← Back to Blog
Research RadarFace DetectionarXivMarch 2026

Monthly arXiv Radar

March 2026 Face Detection Radar: Landmark Pipelines, Calibration, and Anti-Spoofing

Pure face detector papers were relatively sparse on arXiv in March 2026, so this radar widens the lens to the broader face detection stack: landmark extraction, calibration-friendly geometry, and anti-spoofing checks that sit directly upstream of production face recognition. That broader framing still captures how real-world face detection systems are evaluated and deployed.

What This Month Signals

The common thread is deployment realism: lightweight geometry pipelines, session adaptation, and liveness reasoning are becoming as important as raw detection capability in commercial face stacks.

Paper 012026-03-12cs.CV

Deployment-Oriented Session-wise Meta-Calibration for Landmark-Based Webcam Gaze Tracking

Authors & Institutions

Chenkai Zhang

Independent Researcher, Wenzhou, Zhejiang, China

What Problem It Solves

The paper addresses how to make landmark-based face geometry practical under small per-session calibration budgets, head motion, and runtime constraints.

Key Result

The exported eye-focused encoder is only 4.76 MB in ONNX and supports calibrated browser inference at around 12.6 ms per sample, while outperforming Elastic Net across their fixation-style evaluations.

Abstract

Practical webcam gaze tracking is constrained not only by error, but also by calibration burden, robustness to head motion and session drift, runtime footprint, and browser use. We therefore target a deployment-oriented operating point rather than the image large-backbone regime. We cast landmark-based point-of-regard estimation as session-wise adaptation: a shared geometric encoder produces embeddings that can be aligned to a new session from a small calibration set. We present Equivariant Meta-Calibrated Gaze (EMC-Gaze), a lightweight landmark-only method combining an E(3)-equivariant landmark-graph encoder, local eye geometry, binocular emphasis, auxiliary 3D gaze-direction supervision, and a closed-form ridge calibrator differentiated through episodic meta-training. To reduce pose leakage, we use a two-view canonicalization consistency loss. The deployed predictor uses only facial landmarks and fits a per-session ridge head from brief calibration. In a fixation-style interactive evaluation over 33 sessions at 100 cm, EMC-Gaze achieves 5.79 +/- 1.81 deg RMSE after 9-point calibration versus 6.68 +/- 2.34 deg for Elastic Net; the gain is larger on still-head queries (2.92 +/- 0.75 deg vs. 4.45 +/- 0.30 deg). Across three subject holdouts of 10 subjects each, EMC-Gaze retains an advantage (5.66 +/- 0.19 deg vs. 6.49 +/- 0.33 deg). On MPIIFaceGaze with short per-session calibration, the eye-focused model reaches 8.82 +/- 1.21 deg at 16-shot calibration, ties Elastic Net at 1-shot, and outperforms it from 3-shot onward. The exported eye-focused encoder has 944,423 parameters, is 4.76 MB in ONNX, and supports calibrated browser prediction in 12.58/12.58/12.90 ms per sample (mean/median/p90) in Chromium 145 with ONNX Runtime Web. These results position EMC-Gaze as a calibration-friendly operating point rather than a universal state-of-the-art claim against heavier appearance-based systems.

Research Starting Point

In many practical webcam pipelines, the hardest part is not detecting a face at all, but keeping geometric estimation stable under session drift, casual head motion, short calibration, and browser-side compute limits. The paper starts from the observation that many high-accuracy gaze systems assume a heavier runtime and a more forgiving hardware setting than real deployments can support. The author therefore targets a narrower but highly practical operating point: lightweight landmark-only inference that still adapts quickly to each new session.

Method

EMC-Gaze formulates landmark-based gaze estimation as a session-wise adaptation problem. It combines an E(3)-equivariant landmark graph encoder, richer local eye geometry, binocular emphasis, and a closed-form ridge calibration head that is differentiated through during meta-training. The method also adds canonicalization consistency and training-time auxiliary 3D supervision so that pose robustness is learned in the representation instead of being deferred to a large deployment-time model.

Paper Summary

The main value of the paper is its deployment realism. It does not claim to beat every heavyweight appearance-based gaze tracker, but it shows that a small ONNX model with short calibration can still deliver meaningful improvements over classical geometric baselines. For teams building browser or edge-side face analysis, this is a strong example of how to trade a little leaderboard glamour for much better operational fit.

Paper 022026-03-25cs.CV

Is Geometry Enough? An Evaluation of Landmark-Based Gaze Estimation

Authors & Institutions

Daniele Agostinelli

Department of Industrial Engineering and Mathematical Sciences, Universita Politecnica delle Marche, Italy

Thomas Agostinelli

Department of Industrial Engineering and Mathematical Sciences, Universita Politecnica delle Marche, Italy

Andrea Generosi

Department of Science and Information Technology, Universita Pegaso, Italy

Maura Mengoni

Department of Industrial Engineering and Mathematical Sciences, Universita Politecnica delle Marche, Italy

What Problem It Solves

This paper evaluates the true ceiling of landmark-only modeling across modern gaze datasets and cross-domain settings, rather than assuming CNN-heavy pipelines are mandatory.

Key Result

Landmark-only models trail within-domain accuracy but match ResNet18-style baselines more closely in cross-domain generalization, suggesting geometry remains surprisingly competitive when robustness matters.

Abstract

Appearance-based gaze estimation frequently relies on deep Convolutional Neural Networks (CNNs). These models are accurate, but computationally expensive and act as "black boxes", offering little interpretability. Geometric methods based on facial landmarks are a lightweight alternative, but their performance limits and generalization capabilities remain underexplored in modern benchmarks. In this study, we conduct a comprehensive evaluation of landmark-based gaze estimation. We introduce a standardized pipeline to extract and normalize landmarks from three large-scale datasets (Gaze360, ETH-XGaze, and GazeGene) and train lightweight regression models, specifically Extreme Gradient Boosted trees and two neural architectures: a holistic Multi-Layer Perceptron (MLP) and a siamese MLP designed to capture binocular geometry. We find that landmark-based models exhibit lower performance in within-domain evaluation, likely due to noise introduced into the datasets by the landmark detector. Nevertheless, in cross-domain evaluation, the proposed MLP architectures show generalization capabilities comparable to those of ResNet18 baselines. These findings suggest that sparse geometric features encode sufficient information for robust gaze estimation, paving the way for efficient, interpretable, and privacy-friendly edge applications. The source code and generated landmark-based datasets are available at https://github.com/daniele-agostinelli/LandmarkGaze.git.

Research Starting Point

Appearance-based models dominate modern gaze estimation, but they remain expensive, opaque, and harder to deploy in privacy-sensitive environments. Landmark-only modeling promises a much lighter alternative, yet the field still lacks a rigorous comparison that tests whether sparse geometry is merely a cheap approximation or a serious competitive representation. The authors are motivated by this gap and by the broader question of how much information facial geometry alone can carry across datasets.

Method

The paper first constructs normalized landmark-based versions of three major datasets—Gaze360, ETH-XGaze, and GazeGene—then trains three lightweight regressors on top of those features: XGBoost, a holistic MLP, and a siamese MLP tailored to binocular geometry. The evaluation includes both within-domain and cross-domain testing, so the study can separate raw benchmark fit from true generalization. The authors also analyze feature importance and identify landmark detector noise as one of the key bottlenecks limiting the ceiling of geometry-only systems.

Paper Summary

The headline takeaway is that geometry alone is not enough to win every benchmark, but it is much stronger than many people assume when cross-domain robustness matters. The best landmark-based MLPs lag image models in within-domain accuracy, yet become surprisingly competitive once the domain shifts. For edge AI and privacy-first products, that makes sparse geometry a strategically interesting option rather than a research curiosity.

Paper 032026-03-01cs.CV

From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing

Authors & Institutions

Haoyuan Zhang

SAI, University of Chinese Academy of Sciences

MAIS, Institute of Automation, Chinese Academy of Sciences

Baidu Inc.

Keyao Wang

Baidu Inc.

Guosheng Zhang

Baidu Inc.

Haixiao Yue

Baidu Inc.

Zhiwen Tan

Baidu Inc.

Siran Peng

SAI, University of Chinese Academy of Sciences

MAIS, Institute of Automation, Chinese Academy of Sciences

Tianshuo Zhang

SAI, University of Chinese Academy of Sciences

MAIS, Institute of Automation, Chinese Academy of Sciences

Xiao Tan

Baidu Inc.

Kunbin Chen

Baidu Inc.

Wei He

Baidu Inc.

Jingdong Wang

Baidu Inc.

Ajian Liu

SAI, University of Chinese Academy of Sciences

MAIS, Institute of Automation, Chinese Academy of Sciences

Xiangyu Zhu

SAI, University of Chinese Academy of Sciences

MAIS, Institute of Automation, Chinese Academy of Sciences

Zhen Lei

SAI, University of Chinese Academy of Sciences

MAIS, Institute of Automation, Chinese Academy of Sciences

CAIR, HKISI, CAS

Macao University of Science and Technology

What Problem It Solves

The paper targets the weak generalization of face anti-spoofing systems and studies whether reasoning-augmented multimodal pipelines can inspect attack evidence more robustly.

Key Result

The abstract frames the key contribution as stronger generalization in face anti-spoofing through explicit reasoning and evidence integration, which is exactly what practical liveness systems need under new attack formats.

Abstract

Face recognition remains vulnerable to presentation attacks, calling for robust Face Anti-Spoofing (FAS) solutions. Recent MLLM-based FAS methods reformulate the binary classification task as the generation of brief textual descriptions to improve cross-domain generalization. However, their generalizability is still limited, as such descriptions mainly capture intuitive semantic cues (e.g., mask contours) while struggling to perceive fine-grained visual patterns. To address this limitation, we incorporate external visual tools into MLLMs to encourage deeper investigation of subtle spoof clues. Specifically, we propose the Tool-Augmented Reasoning FAS (TAR-FAS) framework, which reformulates the FAS task as a Chain-of-Thought with Visual Tools (CoT-VT) paradigm, allowing MLLMs to begin with intuitive observations and adaptively invoke external visual tools for fine-grained investigation. To this end, we design a tool-augmented data annotation pipeline and construct the ToolFAS-16K dataset, which contains multi-turn tool-use reasoning trajectories. Furthermore, we introduce a tool-aware FAS training pipeline, where Diverse-Tool Group Relative Policy Optimization (DT-GRPO) enables the model to autonomously learn efficient tool use. Extensive experiments under a challenging one-to-eleven cross-domain protocol demonstrate that TAR-FAS achieves SOTA performance while providing fine-grained visual investigation for trustworthy spoof detection.

Research Starting Point

Face anti-spoofing systems often collapse when the attack style changes, because they quietly overfit to a small set of recurring artifacts from training datasets. The authors are motivated by the gap between benchmark performance and real-world robustness, where new print attacks, replay attacks, or generative spoofs may look different from anything seen before. They frame the problem as one of moving from shallow pattern matching toward a more explicit evidence-seeking and reasoning process.

Method

The paper proposes a tool-augmented reasoning framework for generalizable face anti-spoofing, where the model does not stop at a first visual impression but progressively gathers supporting cues. Instead of trusting a single end-to-end classifier to absorb every attack clue, the method emphasizes intermediate investigation steps and explicit evidence integration. That design is meant to make liveness judgment less dependent on brittle dataset artifacts and more resilient to unfamiliar spoof formats.

Paper Summary

Although the paper is framed around anti-spoofing, its broader message is relevant to the whole face detection and verification stack: robustness comes from better evidence collection, not just bigger backbones. For practitioners, the idea is compelling because liveness is often the first real-world failure point in KYC and access control systems. A detector that reasons about spoof evidence instead of memorizing one dataset could be much more useful in production.