← Back to Blog
Research RadarFace RecognitionarXivMarch 2026

Monthly arXiv Radar

March 2026 Face Recognition Papers: Fairness, Better Embeddings, and Explainable Comparison

March 2026 face recognition work centers on three production priorities: making verification fairer across demographic groups, improving embedding discriminability without amplifying shortcut bias, and explaining match decisions in language that auditors can inspect. This roundup packages those ideas into an SEO-friendly monthly digest for teams tracking biometric model direction.

What This Month Signals

The biggest strategic signal this month is that raw accuracy alone is no longer enough. Researchers increasingly treat fairness, trustworthiness, and evidence quality as first-class evaluation targets for face recognition systems.

Paper 012026-03-26cs.CV

Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification

Authors & Institutions

Unsal Ozturk

Idiap Research Institute, Switzerland

Hatef Otroshi Shahreza

Idiap Research Institute, Switzerland

Sebastien Marcel

Idiap Research Institute, Switzerland

What Problem It Solves

The paper builds a benchmark across ethnicity and gender groups on IJB-C and RFW, helping teams quantify whether a seemingly strong MLLM is also equitable.

Key Result

FaceLLM-8B clearly leads the generic MLLM baselines, but the paper shows that the most accurate model is not always the fairest one and that uniformly poor systems can look artificially fair.

Abstract

Multimodal Large Language Models (MLLMs) have recently been explored as face verification systems that determine whether two face images are of the same person. Unlike dedicated face recognition systems, MLLMs approach this task through visual prompting and rely on general visual and reasoning abilities. However, the demographic fairness of these models remains largely unexplored. In this paper, we present a benchmarking study that evaluates nine open-source MLLMs from six model families, ranging from 2B to 8B parameters, on the IJB-C and RFW face verification protocols across four ethnicity groups and two gender groups. We measure verification accuracy with the Equal Error Rate and True Match Rate at multiple operating points per demographic group, and we quantify demographic disparity with four FMR-based fairness metrics. Our results show that FaceLLM-8B, the only face-specialised model in our study, substantially outperforms general-purpose MLLMs on both benchmarks. The bias patterns we observe differ from those commonly reported for traditional face recognition, with different groups being most affected depending on the benchmark and the model. We also note that the most accurate models are not necessarily the fairest and that models with poor overall accuracy can appear fair simply because they produce uniformly high error rates across all demographic groups.

Research Starting Point

Multimodal large language models are starting to appear in face verification workflows because they can compare images through general visual reasoning without the same task-specific training pipeline used by classic biometric systems. That flexibility is attractive, but it creates a new problem: teams may deploy a model that appears capable on average while hiding large performance gaps across gender or ethnicity groups. The paper is motivated by the lack of a fairness benchmark tailored to MLLM-style face verification, especially on standard biometric datasets where subgroup differences matter in real deployments.

Method

The authors benchmark nine open-source multimodal models from six families on IJB-C and RFW, treating them as face verification systems rather than generic chat models. They report Equal Error Rate and True Match Rate at multiple operating points for each subgroup, then add four fairness metrics built around false match rate disparities so the evaluation captures both raw accuracy and group-level imbalance. This design makes the paper useful not only as a leaderboard comparison, but also as a diagnostic template for buyers and researchers who need to ask whether a model is consistently reliable across populations.

Paper Summary

This paper shows that the next face verification debate is no longer just about whether large multimodal models can work, but whether they work equitably. FaceLLM-8B performs best overall, yet the authors make it clear that the most accurate system is not automatically the fairest one. For anyone evaluating AI-based identity verification, the main takeaway is simple: subgroup reporting is becoming a first-class requirement, not a compliance afterthought.

Paper 022026-03-16cs.CV

The Good, the Better, and the Best: Improving the Discriminability of Face Embeddings through Attribute-aware Learning

Authors & Institutions

Ana Dias

University of Beira Interior, Portugal

IT: Instituto de Telecomunicacoes

NOVA LINCS

Joao Ribeiro Pinto

Amadeus, Portugal

Hugo Proenca

University of Beira Interior, Portugal

IT: Instituto de Telecomunicacoes

Joao C. Neves

University of Beira Interior, Portugal

NOVA LINCS

What Problem It Solves

This work asks which attributes actually help identity discrimination and which should be suppressed because they are not identity relevant.

Key Result

The headline result is that carefully choosing identity-relevant attributes beats using a larger generic attribute pool, and forcing the model to forget non-identity cues brings extra gains.

Abstract

Despite recent advances in face recognition, robust performance remains challenging under large variations in age, pose, and occlusion. A common strategy to address these issues is to guide representation learning with auxiliary supervision from facial attributes, encouraging the visual encoder to focus on identity-relevant regions. However, existing approaches typically rely on heterogeneous and fixed sets of attributes, implicitly assuming equal relevance across attributes. This assumption is suboptimal, as different attributes exhibit varying discriminative power for identity recognition, and some may even introduce harmful biases. In this paper, we propose an attribute-aware face recognition architecture that supervises the learning of facial embeddings using identity class labels, identity-relevant facial attributes, and non-identity-related attributes. Facial attributes are organized into interpretable groups, making it possible to decompose and analyze their individual contributions in a human-understandable manner. Experiments on standard face verification benchmarks demonstrate that joint learning of identity and facial attributes improves the discriminability of face embeddings with two major conclusions: (i) using identity-relevant subsets of facial attributes consistently outperforms supervision with a broader attribute set, and (ii) explicitly forcing embeddings to unlearn non-identity-related attributes yields further performance gains compared to leaving such attributes unsupervised. Additionally, our method serves as a diagnostic tool for assessing the trustworthiness of face recognition encoders by allowing for the measurement of accuracy gains with suppression of non-identity-relevant attributes, with such gains suggesting shortcut learning from redundant attributes associated with each identity.

Research Starting Point

Attribute supervision has long been used to improve face embeddings, but many systems simply attach a long list of facial attributes and assume that more side information will automatically help. The authors question that assumption because some attributes are genuinely identity-relevant, while others mainly encode shortcuts, dataset quirks, or demographic bias. Their starting point is that face recognition models need to be selective about what auxiliary signals they absorb, not merely richer in supervision.

Method

The paper builds an attribute-aware recognition architecture that separates facial attributes into interpretable groups and optimizes them differently depending on their role. Identity-relevant attribute groups are learned jointly with the main recognition objective, while non-identity-related groups are actively suppressed through a gradient-reversal strategy so the embedding learns to forget misleading cues instead of simply ignoring them. The method is evaluated across several verification benchmarks and also used as a diagnostic tool to see which attribute groups reveal shortcut dependence in the backbone.

Paper Summary

The paper’s most important insight is that better face recognition does not necessarily come from feeding the model more facial attributes, but from feeding it the right ones. Carefully chosen identity-relevant groups improve discriminability, and suppressing non-identity cues can bring another measurable lift. For product teams, this is a practical reminder that embedding quality is tied as much to what the model unlearns as to what it learns.

Paper 032026-03-17cs.CV

MLLM-based Textual Explanations for Face Comparison

Authors & Institutions

Redwan Sony

Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA

Anil K. Jain

Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA

Arun Ross

Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA

What Problem It Solves

The paper evaluates whether MLLM-generated explanations for face comparison are actually faithful to the visual evidence on unconstrained imagery.

Key Result

Even when the verification verdict is correct, the textual explanation often mentions unverifiable or hallucinated facial details. Adding legacy matcher scores improves decision quality but does not guarantee faithful reasoning.

Abstract

Multimodal Large Language Models (MLLMs) have recently been proposed as a means to generate natural-language explanations for face recognition decisions. While such explanations facilitate human interpretability, their reliability on unconstrained face images remains underexplored. In this work, we systematically analyze MLLM-generated explanations for the unconstrained face verification task on the challenging IJB-S dataset, with a particular focus on extreme pose variation and surveillance imagery. Our results show that even when MLLMs produce correct verification decisions, the accompanying explanations frequently rely on non-verifiable or hallucinated facial attributes that are not supported by visual evidence. We further study the effect of incorporating information from traditional face recognition systems, viz., scores and decisions, alongside the input images. Although such information improves categorical verification performance, it does not consistently lead to faithful explanations. To evaluate the explanations beyond decision accuracy, we introduce a likelihood-ratio-based framework that measures the evidential strength of textual explanations. Our findings highlight fundamental limitations of current MLLMs for explainable face recognition and underscore the need for a principled evaluation of reliable and trustworthy explanations in biometric applications. Code is available at https://github.com/redwankarimsony/LR-MLLMFR-Explainability.

Research Starting Point

There is growing demand for explainable face recognition, especially in high-stakes security and forensic settings where a numerical similarity score alone is hard to audit or defend. Multimodal LLMs appear to offer a natural solution because they can turn a match decision into a human-readable explanation. The authors are motivated by a more uncomfortable question: if those explanations sound plausible but are visually unfaithful, they may create a false sense of transparency rather than genuine interpretability.

Method

The study evaluates explanation quality on the challenging IJB-S benchmark, where surveillance imagery and extreme pose differences make face comparison much harder than clean portrait matching. The authors test multiple prompting regimes, including setups that provide legacy matcher scores and decisions, then measure not only whether the model outputs the right verdict but also whether its explanation carries evidential value. To do that, they introduce a likelihood-ratio-based evaluation framework that maps explanation embeddings into a more principled reliability score.

Paper Summary

The paper delivers a clear warning for anyone building explainable biometrics: a correct decision does not imply a trustworthy explanation. Even when MLLMs classify the pair correctly, they often mention facial details that are unverifiable, exaggerated, or simply hallucinated. The practical lesson is that explainability layers for face recognition need their own evaluation pipeline, because polished language can otherwise hide weak forensic grounding.