WIDER-FAIR: An Annotated Version of the WIDER-FACE Dataset for Fairness Evaluation
Authors & Institutions
Maxime Moussi
UCLouvain, Louvain-la-Neuve, Belgium
Benoît Ronval
UCLouvain, ICTEAM, Louvain-la-Neuve, Belgium
Siegfried Nijssen
UCLouvain, ICTEAM, Louvain-la-Neuve, Belgium
KU Leuven, DTAI, Leuven, Belgium
Félicien Schiltz
Euranova, Mont-Saint-Guibert, Belgium
What Problem It Solves
The paper addresses a measurement gap: widely used detection benchmarks rarely include sensitive-feature labels, making fairness claims hard to validate.
Key Result
The demonstration finds notably lower detection performance for Black individuals, and excluding that group from training increases disparity more than excluding any other ethnic group.
Abstract
The deployment of face detection models in real-world applications raises important fairness concerns, as these systems may showcase performance disparities across demographic groups. A key obstacle to studying and mitigating such biases is the lack of face detection datasets with sensitive feature annotations. To address this gap, we introduce WIDER-FAIR, a new dataset built on the widely used WIDER-FACE benchmark, manually annotated with the perceived ethnicity and sex of each face. The dataset contains 16,256 images annotated across four ethnic groups: Asian, Black, Indian, and White, and two sex categories. We assess the quality and coherence of the annotations using face embeddings, a K-Nearest Neighbors classifier, and a t-SNE visualization, all of which support the consistency of the labeling process. As a demonstration of the dataset's potential, we train a YOLOv5 model and perform ablation studies on each sensitive feature. Among other findings, our experiments show that detection performance is notably lower for faces of Black individuals, and that excluding this group from training increases fairness disparity more than excluding any other ethnic group. These observations illustrate the value of demographically annotated datasets for understanding and evaluating bias in face detection models.
Research Starting Point
Face detection is often the first step in recognition, liveness, and analytics pipelines, so demographic miss rates at this stage can propagate into every downstream metric.
Method
The authors manually annotate 16,256 images across four perceived ethnic groups and two sex categories, then use the annotations to run training-data ablations that reveal how excluding specific groups changes detector fairness.
Paper Summary
WIDER-FAIR matters because it moves detector fairness from anecdote to testable evidence. For vendors, it is a reminder that a “good” detector benchmark score may hide group-specific failures unless the evaluation set carries the right annotations.