Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification
Authors & Institutions
Unsal Ozturk
Idiap Research Institute, Switzerland
Hatef Otroshi Shahreza
Idiap Research Institute, Switzerland
Sebastien Marcel
Idiap Research Institute, Switzerland
What Problem It Solves
The paper builds a benchmark across ethnicity and gender groups on IJB-C and RFW, helping teams quantify whether a seemingly strong MLLM is also equitable.
Key Result
FaceLLM-8B clearly leads the generic MLLM baselines, but the paper shows that the most accurate model is not always the fairest one and that uniformly poor systems can look artificially fair.
Abstract
Multimodal Large Language Models (MLLMs) have recently been explored as face verification systems that determine whether two face images are of the same person. Unlike dedicated face recognition systems, MLLMs approach this task through visual prompting and rely on general visual and reasoning abilities. However, the demographic fairness of these models remains largely unexplored. In this paper, we present a benchmarking study that evaluates nine open-source MLLMs from six model families, ranging from 2B to 8B parameters, on the IJB-C and RFW face verification protocols across four ethnicity groups and two gender groups. We measure verification accuracy with the Equal Error Rate and True Match Rate at multiple operating points per demographic group, and we quantify demographic disparity with four FMR-based fairness metrics. Our results show that FaceLLM-8B, the only face-specialised model in our study, substantially outperforms general-purpose MLLMs on both benchmarks. The bias patterns we observe differ from those commonly reported for traditional face recognition, with different groups being most affected depending on the benchmark and the model. We also note that the most accurate models are not necessarily the fairest and that models with poor overall accuracy can appear fair simply because they produce uniformly high error rates across all demographic groups.
Research Starting Point
Multimodal large language models are starting to appear in face verification workflows because they can compare images through general visual reasoning without the same task-specific training pipeline used by classic biometric systems. That flexibility is attractive, but it creates a new problem: teams may deploy a model that appears capable on average while hiding large performance gaps across gender or ethnicity groups. The paper is motivated by the lack of a fairness benchmark tailored to MLLM-style face verification, especially on standard biometric datasets where subgroup differences matter in real deployments.
Method
The authors benchmark nine open-source multimodal models from six families on IJB-C and RFW, treating them as face verification systems rather than generic chat models. They report Equal Error Rate and True Match Rate at multiple operating points for each subgroup, then add four fairness metrics built around false match rate disparities so the evaluation captures both raw accuracy and group-level imbalance. This design makes the paper useful not only as a leaderboard comparison, but also as a diagnostic template for buyers and researchers who need to ask whether a model is consistently reliable across populations.
Paper Summary
This paper shows that the next face verification debate is no longer just about whether large multimodal models can work, but whether they work equitably. FaceLLM-8B performs best overall, yet the authors make it clear that the most accurate system is not automatically the fairest one. For anyone evaluating AI-based identity verification, the main takeaway is simple: subgroup reporting is becoming a first-class requirement, not a compliance afterthought.