← Back to Blog
Research RadarFace DetectionarXivJune 2026

Monthly arXiv Radar

June 2026 Face Detection Papers: Fairness Benchmarks, Neonatal Detection, and PAD Bias

June 2026 face detection papers show the first stage of a biometric pipeline becoming more accountable. The month combines fairness measurement, high-risk clinical domain adaptation, and PAD architecture choices that affect demographic outcomes.

What This Month Signals

The month’s signal is governance: face detectors need demographic labels for auditing, domain-specific validation for clinical use, and PAD architectures evaluated on both accuracy and fairness.

Paper 012026-06-30cs.CV

WIDER-FAIR: An Annotated Version of the WIDER-FACE Dataset for Fairness Evaluation

Authors & Institutions

Maxime Moussi

UCLouvain, Louvain-la-Neuve, Belgium

Benoît Ronval

UCLouvain, ICTEAM, Louvain-la-Neuve, Belgium

Siegfried Nijssen

UCLouvain, ICTEAM, Louvain-la-Neuve, Belgium

KU Leuven, DTAI, Leuven, Belgium

Félicien Schiltz

Euranova, Mont-Saint-Guibert, Belgium

What Problem It Solves

The paper addresses a measurement gap: widely used detection benchmarks rarely include sensitive-feature labels, making fairness claims hard to validate.

Key Result

The demonstration finds notably lower detection performance for Black individuals, and excluding that group from training increases disparity more than excluding any other ethnic group.

Abstract

The deployment of face detection models in real-world applications raises important fairness concerns, as these systems may showcase performance disparities across demographic groups. A key obstacle to studying and mitigating such biases is the lack of face detection datasets with sensitive feature annotations. To address this gap, we introduce WIDER-FAIR, a new dataset built on the widely used WIDER-FACE benchmark, manually annotated with the perceived ethnicity and sex of each face. The dataset contains 16,256 images annotated across four ethnic groups: Asian, Black, Indian, and White, and two sex categories. We assess the quality and coherence of the annotations using face embeddings, a K-Nearest Neighbors classifier, and a t-SNE visualization, all of which support the consistency of the labeling process. As a demonstration of the dataset's potential, we train a YOLOv5 model and perform ablation studies on each sensitive feature. Among other findings, our experiments show that detection performance is notably lower for faces of Black individuals, and that excluding this group from training increases fairness disparity more than excluding any other ethnic group. These observations illustrate the value of demographically annotated datasets for understanding and evaluating bias in face detection models.

Research Starting Point

Face detection is often the first step in recognition, liveness, and analytics pipelines, so demographic miss rates at this stage can propagate into every downstream metric.

Method

The authors manually annotate 16,256 images across four perceived ethnic groups and two sex categories, then use the annotations to run training-data ablations that reveal how excluding specific groups changes detector fairness.

Paper Summary

WIDER-FAIR matters because it moves detector fairness from anecdote to testable evidence. For vendors, it is a reminder that a “good” detector benchmark score may hide group-specific failures unless the evaluation set carries the right annotations.

Paper 022026-06-18cs.CV

InfantFace: Detecting infant faces in neonatal clinical environments

Authors & Institutions

Abdullah Bin-Obaid

Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford, United Kingdom

Maria M. Cobo

Department of Paediatrics, University of Oxford, Oxford, United Kingdom

Universidad San Francisco de Quito USFQ, Colegio de Ciencias Biológicas y Ambientales, Quito, Ecuador

Rebeccah Slater

Department of Paediatrics, University of Oxford, Oxford, United Kingdom

Lionel Tarassenko

Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford, United Kingdom

Mauricio Villarroel

Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford, United Kingdom

What Problem It Solves

The paper addresses the absence of public neonatal face-detection datasets and the resulting uncertainty about whether general detectors are reliable in intensive-care conditions.

Key Result

Before clinical fine-tuning, the model reaches AP50 0.87 and beats three general detectors; after neonatal adaptation, AP50 rises to 0.96.

Abstract

Reliable localisation of the neonatal face is the first step for several video-camera based non-contact assessments such as pain and distress related facial expression analysis, pain scoring, cardiorespiratory signal extraction and cessation of breathing alerts. However, major challenges persist in neonatal clinical environments. Cluttered backgrounds, illumination changes and poor lighting conditions can reduce the accuracy of face detection models. Clinical interventions, monitoring equipment and, in some cases, medical devices can obstruct the face, making visual assessment difficult. We propose a one-stage YOLOv11m-based model tailored for face detection of infants in neonatal clinical environments. We combined multiple publicly available datasets (VGGFace2, CelebA, FDDB, WIDER FACE) to train and evaluate our proposed model. We then fine-tuned our model on a neonatal research dataset involving 228 videos from 114 recording sessions of 113 independent infants. Before fine-tuning, our model achieved an AP50 of 0.87, surpassing the performance of three state-of-the-art general face detectors. Performance improved further to an AP50 of 0.96 after clinical-domain adaptation. Evaluating face detection performance across different datasets remains a challenge due to the lack of publicly available neonatal datasets. Prioritising the creation of such datasets, while upholding appropriate privacy safeguards and ethical standards in their creation and use, would greatly support further progress in this field.

Research Starting Point

Clinical face detection differs sharply from consumer face detection: lighting is poor, backgrounds are cluttered, and equipment or care interventions may occlude much of the infant’s face.

Method

The authors build a one-stage YOLOv11m pipeline, train it with broad public face datasets for general facial structure, and then adapt it with ethically collected neonatal videos that represent the target domain.

Paper Summary

InfantFace is a strong reminder that “face detection” is not one product category. In healthcare and other constrained environments, domain adaptation and dataset governance can be more important than simply selecting the newest general detector.

Paper 032026-06-16cs.CV

Architectural Bias in Face Presentation Attack Detection: A Comparative Study of Vision Transformers and Convolutional Neural Networks

Authors & Institutions

Ngela Landon Ntung

College of Engineering, Carnegie Mellon University Africa, Kigali, Rwanda

Floride Tuyisenge

College of Engineering, Carnegie Mellon University Africa, Kigali, Rwanda

Jema David Ndibwile

College of Engineering, Carnegie Mellon University Africa, Kigali, Rwanda

What Problem It Solves

The paper focuses on whether fairness is only a data problem or whether architectural inductive bias and pretraining also change cross-demographic behavior.

Key Result

Pretrained DeiT-S reaches 97.27% accuracy and 0.86% EER, reduces the African/East Asian ACER gap to 0.13%, and reports a 3.6x BPCER advantage over ResNet18 on unseen Central Asian subjects.

Abstract

Face Presentation Attack Detection (PAD) systems constitute a critical security layer in biometric authentication; however, existing approaches exhibit systematic performance disparities across demographic groups, disproportionately affecting individuals with darker skin tones. This paper presents a comparative empirical investigation of whether Vision Transformer architectures reduce demographic bias in face PAD systems relative to convolutional baselines. Experiments are conducted on the CASIA-SURF Cross-Ethnicity Face Anti-Spoofing (CeFA) dataset. Three architectures are evaluated: a Multimodal ViT-Tiny trained from scratch, a ResNet18 CNN baseline, and a pretrained DeiT-S fine-tuned on CeFA across African, East Asian, and zero-shot Central Asian demographic groups. DeiT-S achieves the highest overall accuracy of 97.27% and the lowest EER of 0.86%, outperforming ResNet18 at 90.15% accuracy. In terms of fairness, DeiT-S reduces the inter-ethnic ACER gap between African and East Asian subjects to 0.13%, compared to 0.75% reported in an LBP-based work [6], representing an 83% reduction. Most notably, while ResNet18 records a BPCER of 10.44% on zero-shot Central Asian subjects, DeiT-S maintains 2.89% on the same unseen group, demonstrating a 3.6x generalization advantage. These results suggest that pretrained Vision Transformers achieve superior PAD accuracy, produce smaller demographic performance gaps, and generalize more equitably across unseen demographic groups, indicating that cross-demographic fairness in PAD may partly be influenced by architectural design.

Research Starting Point

PAD is a security layer in biometric authentication; if its error rates vary by skin tone or ethnicity, users can face unequal lockouts or unequal spoofing risk.

Method

The authors compare a CNN baseline with transformer variants under the same PAD benchmark, tracking not only accuracy and EER but also APCER/BPCER/ACER gaps across demographic groups and a zero-shot Central Asian split.

Paper Summary

For biometric buyers, the key point is that architecture choice can affect fairness as much as headline accuracy. PAD evaluations should therefore include demographic slices and unseen-population tests before a model is treated as deployment-ready.