← Back to Blog
Research RadarFace RecognitionarXivJune 2026

Monthly arXiv Radar

June 2026 Face Recognition Papers: Low-Resolution MoE, Efficient ViTs, and 1024-Byte Travel Documents

June 2026 face recognition work was unusually deployment-focused. The strongest papers ask how recognition survives bad capture conditions, tight compute budgets, and extreme storage limits rather than assuming clean enrollment photos and unconstrained servers.

What This Month Signals

Together, the papers push recognition toward a more resilient product stack: adapt capacity for degraded faces, expose latency-quality trade-offs in ViTs, and engineer document images for severe byte budgets.

Paper 012026-06-30cs.CV

FaceMoE: Mixture of Experts for Low-Resolution Face Recognition

Authors & Institutions

Kartik Narayan

Johns Hopkins University

Vishal M. Patel

Johns Hopkins University

What Problem It Solves

The paper addresses the weakness of a single shared encoder: after fine-tuning on low-resolution data it may underfit degraded regions and lose high-resolution discriminative knowledge.

Key Result

Across eleven high-resolution, mixed-quality, and low-resolution benchmarks, the authors report clear gains over state-of-the-art low-resolution face-recognition methods while keeping sparse expert activation.

Abstract

Low-resolution face recognition (LR-FR) remains a challenging task due to poor feature extraction and aggregation, as probe images often contain limited identity information resulting from extreme degradations such as blur, occlusion, and low contrast. Additionally, the domain gap between high-resolution (HR) gallery images and low-resolution (LR) probe images poses a significant challenge. A single feature encoder struggles to generalize effectively across both domains when fine-tuned on an LR dataset, and this issue is further magnified by catastrophic forgetting. To address these challenges, we propose FaceMoE, an effective adaptation of Mixture of Experts (MoE) transfomer architecture for low-resolution face-recognition . Specifically, we introduce multiple specialized feed-forward network (FFN) experts and incorporate a top-k router, which dynamically assigns tokens to appropriate experts. This design emergently promotes specialization across experts for different semantic regions of the face, which enables FaceMoE to perform resolution-aware feature extraction. Moreover, the top-k router facilitates sparse expert activation, enabling the model to preserve pretrained knowledge when finetuned on a LR dataset, while increasing model capacity without proportional computational overhead. FaceMoE is trained with a combined face recognition loss, router z-loss, and load balancing loss to ensure expert specialization and stable training. To the best of our knowledge, this is the first work leveraging MoE for LR-FR. Extensive experiments across eleven datasets, spanning HR, mixed-quality, and LR benchmarks, demonstrate that FaceMoE significantly outperforms state-of-the-art methods. Code: https://github.com/Kartik-3004/FaceMoE

Research Starting Point

Surveillance, access-control, and border workflows often compare degraded probe images with cleaner enrollment images; the failure mode is not just less detail, but a domain gap that can cause an adapted encoder to forget high-quality recognition behavior.

Method

FaceMoE inserts specialized feed-forward experts into a transformer and uses top-k routing so each token can select a small set of experts. The training objective combines face-recognition loss with router z-loss and load-balancing loss, which encourages stable specialization without making every expert active for every image.

Paper Summary

FaceMoE is useful for teams that cannot control image quality at capture time. Its main product implication is a routing-based way to add capacity for degraded faces without retraining a completely separate low-resolution system or paying the full cost of a larger dense model.

Paper 022026-06-10cs.CV

ViT-FREE: Efficient Face Recognition via Early Exiting and Synthetic Adaptation

Authors & Institutions

Tahar Chettaoui

Fraunhofer Institute for Computer Graphics Research IGD, Germany

Department of Computer Science, Technical University of Darmstadt, Germany

Guray Ozgur

Fraunhofer Institute for Computer Graphics Research IGD, Germany

Department of Computer Science, Technical University of Darmstadt, Germany

Eduarda Caldeira

Fraunhofer Institute for Computer Graphics Research IGD, Germany

Department of Computer Science, Technical University of Darmstadt, Germany

Naser Damer

Fraunhofer Institute for Computer Graphics Research IGD, Germany

Department of Computer Science, Technical University of Darmstadt, Germany

Fadi Boutros

Fraunhofer Institute for Computer Graphics Research IGD, Germany

What Problem It Solves

The paper tackles the rigid all-layers inference pattern: production systems often run the full model even when intermediate layers are already discriminative enough for many comparisons.

Key Result

Later exits preserve most verification performance; exiting at layer 10 is reported to provide up to 20% speedup with about a 1.5-point drop on IJB-C, while projection fine-tuning improves shallower exits.

Abstract

Vision Transformers (ViTs) have gained significant attention in computer vision and shown strong potential for face recognition (FR). However, their high computational cost makes deployment on resource-constrained devices challenging, motivating the need for methods that balance efficiency and accuracy. In this work, we investigate early exiting in pretrained ViTs as a simple yet effective training-free strategy for efficient FR inference. Leveraging the uniform feature dimensionality across transformer encoder blocks, we introduce ViT-FREE, a multi-exit framework that enables face verification directly from intermediate representations without modifying or retraining the backbone model, and thus, reducing inference cost. Empirically, we show that patch embeddings and attention maps evolve progressively across depth, exhibiting high similarity between consecutive ViT blocks and increasing alignment with the final representation. This indicates gradual feature refinement and attention convergence, suggesting that intermediate layers already provide stable and discriminative representations suitable for early exiting. Through extensive experiments on multiple FR benchmarks, we systematically analyze the accuracy-efficiency trade-off across exit depths. Our results demonstrate that later exits achieve a highly favorable balance, with exiting at layer 10 yielding up to a 20% speedup while incurring only a 1.5 drop in verification performance on benchmarks such as IJB-C. Also, we propose ViT-FREE_FT, a lightweight exit-specific fine-tuning strategy that adapts only the projection layers using a small synthetic dataset while keeping the transformer backbone frozen. This approach improves the performance of shallow exits while preserving the efficiency benefits and leaving deeper exits largely unaffected.

Research Starting Point

ViT face recognition is attractive for accuracy but expensive for edge devices, browser SDKs, and high-throughput verification pipelines where every transformer block adds latency.

Method

The framework attaches exit heads to transformer blocks that share the same feature dimensionality, analyzes attention and embedding convergence across depth, and then offers a lightweight synthetic-data adaptation for shallow exits without changing the ViT backbone.

Paper Summary

ViT-FREE gives deployment teams a practical latency dial. Instead of choosing between a compact model and a full ViT, they can expose multiple operating points, reserve deeper inference for hard cases, and tune the shallow exits with synthetic faces when real calibration data is scarce.

Paper 032026-06-29cs.CV

Optimizing Image Preparation and Compression for Face Recognition within 1024 Bytes

Authors & Institutions

Paul Andreas

Department of Computer Science, Hochschule Darmstadt, Schöfferstraße 3, 64295 Darmstadt, Germany

Torsten Schlett

Department of Computer Science, Hochschule Darmstadt, Schöfferstraße 3, 64295 Darmstadt, Germany

Christoph Busch

Department of Computer Science, Hochschule Darmstadt, Schöfferstraße 3, 64295 Darmstadt, Germany

What Problem It Solves

The paper solves a concrete storage-versus-recognition trade-off: which image size, color mode, smoothing, resizing, and codec choices keep recognition viable at only 1024 bytes.

Key Result

JPEG AI is reported as the strongest option with optimized settings; AVIF and WebP also perform well. Grayscale helps when both images are ICAO-compliant, while color is preferable for less suitable probes, and smoothing/resizing before compression helps.

Abstract

ICAO-compliant machine readable travel documents enable automated biometric face verification. The biometric reference is stored on an RFID chip included in form of a JPEG or JPEG 2000 compressed facial image. In contrast, temporary travel documents lack of machine readability, which excludes the owner from such automated processes. This disadvantage could be solved by equipping such documents with 2D barcodes. This technology offers a resource-saving alternative to expensive RFID chips, while still offering machine readability and fast issuing processes. However, this solution introduces the challenge of storing the face images at significantly smaller storage capacities, creating the need for reducing the file size of the included facial image to a maximum of 1024 bytes. This study examines preprocessing steps and compression configurations, using JPEG, JPEG 2000, JPEG XL, JPEG AI, HEIF, AVIF, and WebP for image compression to this target size, while still preserving as much face recognition performance as possible. While the reference sample must always comply with ICAO specifications, the individual samples may or may not meet these requirements, depending on the application. This work optimizes compression steps for both of these prerequisites. It is shown that the recently standardised JPEG AI, when using optimized settings, provides the best face recognition performance, in particular when the comparison includes only images with high face image quality. AVIF and WebP also provide good results. The losses caused by the strong lossy compression are comparatively small. For the comparison of ICAO-compliant face images only, converting the images to grayscale proves to be a helpful preprocessing step, whereas for comparisons involving less suitable samples, preserving color is preferable. In addition, smoothing and resizing the images beforehand also turns out to be beneficial.

Research Starting Point

Document and identity teams need machine-readable face references even when RFID chips are too costly or unavailable, but aggressive compression can silently break automated verification.

Method

The authors evaluate multiple modern and legacy codecs under two comparison regimes: both images meeting ICAO quality, and cases where the probe may be less controlled. They tune preprocessing choices before compression and measure downstream face-recognition behavior rather than only pixel quality.

Paper Summary

This is a rare paper with immediate policy and engineering value. It turns a standards problem into a reproducible compression recipe, which helps vendors reason about temporary documents, offline verification, and barcode-based identity flows without guessing whether “small enough” is still biometrically useful.