Deployment-Oriented Session-wise Meta-Calibration for Landmark-Based Webcam Gaze Tracking
Authors & Institutions
Chenkai Zhang
Independent Researcher, Wenzhou, Zhejiang, China
What Problem It Solves
The paper addresses how to make landmark-based face geometry practical under small per-session calibration budgets, head motion, and runtime constraints.
Key Result
The exported eye-focused encoder is only 4.76 MB in ONNX and supports calibrated browser inference at around 12.6 ms per sample, while outperforming Elastic Net across their fixation-style evaluations.
Abstract
Practical webcam gaze tracking is constrained not only by error, but also by calibration burden, robustness to head motion and session drift, runtime footprint, and browser use. We therefore target a deployment-oriented operating point rather than the image large-backbone regime. We cast landmark-based point-of-regard estimation as session-wise adaptation: a shared geometric encoder produces embeddings that can be aligned to a new session from a small calibration set. We present Equivariant Meta-Calibrated Gaze (EMC-Gaze), a lightweight landmark-only method combining an E(3)-equivariant landmark-graph encoder, local eye geometry, binocular emphasis, auxiliary 3D gaze-direction supervision, and a closed-form ridge calibrator differentiated through episodic meta-training. To reduce pose leakage, we use a two-view canonicalization consistency loss. The deployed predictor uses only facial landmarks and fits a per-session ridge head from brief calibration. In a fixation-style interactive evaluation over 33 sessions at 100 cm, EMC-Gaze achieves 5.79 +/- 1.81 deg RMSE after 9-point calibration versus 6.68 +/- 2.34 deg for Elastic Net; the gain is larger on still-head queries (2.92 +/- 0.75 deg vs. 4.45 +/- 0.30 deg). Across three subject holdouts of 10 subjects each, EMC-Gaze retains an advantage (5.66 +/- 0.19 deg vs. 6.49 +/- 0.33 deg). On MPIIFaceGaze with short per-session calibration, the eye-focused model reaches 8.82 +/- 1.21 deg at 16-shot calibration, ties Elastic Net at 1-shot, and outperforms it from 3-shot onward. The exported eye-focused encoder has 944,423 parameters, is 4.76 MB in ONNX, and supports calibrated browser prediction in 12.58/12.58/12.90 ms per sample (mean/median/p90) in Chromium 145 with ONNX Runtime Web. These results position EMC-Gaze as a calibration-friendly operating point rather than a universal state-of-the-art claim against heavier appearance-based systems.
Research Starting Point
In many practical webcam pipelines, the hardest part is not detecting a face at all, but keeping geometric estimation stable under session drift, casual head motion, short calibration, and browser-side compute limits. The paper starts from the observation that many high-accuracy gaze systems assume a heavier runtime and a more forgiving hardware setting than real deployments can support. The author therefore targets a narrower but highly practical operating point: lightweight landmark-only inference that still adapts quickly to each new session.
Method
EMC-Gaze formulates landmark-based gaze estimation as a session-wise adaptation problem. It combines an E(3)-equivariant landmark graph encoder, richer local eye geometry, binocular emphasis, and a closed-form ridge calibration head that is differentiated through during meta-training. The method also adds canonicalization consistency and training-time auxiliary 3D supervision so that pose robustness is learned in the representation instead of being deferred to a large deployment-time model.
Paper Summary
The main value of the paper is its deployment realism. It does not claim to beat every heavyweight appearance-based gaze tracker, but it shows that a small ONNX model with short calibration can still deliver meaningful improvements over classical geometric baselines. For teams building browser or edge-side face analysis, this is a strong example of how to trade a little leaderboard glamour for much better operational fit.