Face Recognition1:1 Verification1:N IdentificationFRVTBenchmarkThreshold

Choosing a Face Recognition Model: 1:1, 1:N Testing, and Threshold Selection

A FRVT-aligned framework for choosing an InsightFace recognition model, building 1:1 and 1:N benchmarks, picking thresholds, and deciding when to upgrade from open-source packs to InsightFace commercial models.

2026-04-30•14 min read

What you will build

Most face recognition mistakes happen during model selection and threshold tuning, not in code. A model that ranks well on a public leaderboard can still misbehave on your data because the operating point, demographic mix, capture quality, and gallery size are different.

This guide gives a vendor-neutral evaluation framework aligned with the NIST FRVT methodology. It covers 1:1 verification, 1:N identification, the metrics that matter, how to build and run a defensible test, the public numbers you can expect from InsightFace open-source packs, and the conditions under which the InsightFace commercial models are the right choice and what their higher accuracy means in concrete numbers.

Before you start

Familiarity with running InsightFace through the insightface Python package or ONNX Runtime for face detection, alignment, and embedding extraction.
A representative validation set covering the demographics, age range, pose, lighting, occlusion (mask, glasses, hair), camera quality, and capture distance of your production traffic.
Basic numpy and scikit-learn for ROC, DET, and similarity-score analysis.
A clearly defined target operating point expressed as FMR or FPIR (e.g., FMR = 1e-5), not just a single accuracy percentage.

1. Decide whether you have a 1:1 or 1:N problem

1:1 verification answers "is this person the same as the claimed identity?". It compares one probe template against one enrolled template and returns a similarity score and a same/different decision. Typical use cases are device unlock, KYC selfie-vs-document, payment confirmation, and re-authentication.

1:N identification answers "who, if anyone, is this person among N enrolled identities?". It compares one probe against a gallery of N templates and returns a candidate list. Typical use cases are access control gates, watchlist alerts, attendance, and deduplication. A model that ranks well on 1:1 LFW does not automatically scale to 1:N at galleries of 10^5 or 10^6, which is why NIST publishes FRVT 1:1 and FRVT 1:N as separate reports.

1:1 traffic is dominated by mated comparisons; the cost of a false match is per transaction.
1:N traffic is dominated by non-mated comparisons (most probes are not in the gallery for watchlists); false-positive rate scales with N.
Pick the workload first. The same backbone may need different thresholds for 1:1 and 1:N use.

2. Adopt FRVT-style metrics

Move away from "accuracy" as a single number. FRVT reports give two complementary error rates at a fixed operating point, plotted as a curve so the trade-off is visible.

For 1:1 use FMR (False Match Rate) and FNMR (False Non-Match Rate). Pick a target FMR (typically 1e-4, 1e-5, or 1e-6) and report FNMR at that FMR. For 1:N use FPIR (False Positive Identification Rate) and FNIR (False Negative Identification Rate). Always specify gallery size N, rank, and whether the test is closed-set (probe is always in the gallery) or open-set (probe may not be).

Always disclose the threshold along with the reported number; FNMR without an FMR or FNIR without an FPIR is meaningless.
Plot DET (Detection Error Tradeoff) curves rather than ROC; they read better at low error rates.
Report numbers per demographic stratum, not just on the overall set, mirroring FRVT demographic effects studies.

3. Build a defensible test set

Mate pairs and non-mate pairs are sampled separately. For 1:1, estimating FMR = 1e-6 with statistical confidence requires on the order of 10 / FMR = 10^7 non-mate comparisons. Reusing the same probe in many pairs is acceptable but be honest about effective sample size.

Stratify by demographic group, capture device, indoor/outdoor, age gap between enrollment and probe, occlusion type, and head pose. Report metrics per stratum, not only the average. Keep test data strictly disjoint from any data used for training, fine-tuning, or pretraining (Glint360K, WebFace42M, MS1MV3, etc.) and document provenance and consent.

Use at least 10 / target FMR non-mate comparisons; otherwise the FMR estimate has wide confidence intervals.
Freeze the test set. A test set you keep tuning against becomes a validation set.
Maintain a small, locked "golden" subset that you re-run on every model or preprocessing change.

4. Compute embeddings, similarity, and thresholds

Use the package's official preprocessing: RetinaFace or SCRFD for detection, 5-point alignment, 112x112 RGB crop, and the mean/std that ships with the recognition pack. Mismatched preprocessing is by far the most common reason that reported numbers cannot be reproduced.

Standardize on cosine similarity over L2-normalized embeddings. The InsightFace Python API exposes face.normed_embedding for exactly this. Pick the threshold from a validation split, freeze it, then evaluate on the test set; choosing the threshold on the test set inflates results.

Typical 1:1 thresholds for InsightFace recognition packs land in the 0.30-0.45 cosine range at FMR = 1e-4 to 1e-5; the exact value depends on backbone, training data, and your population, so always recompute.
Score normalization (z-norm, t-norm) helps when the gallery distribution shifts between deployments.
If you fine-tune, recompute the threshold; never carry over a threshold across model versions.

Compute and L2-normalize embeddings

import numpy as np
from insightface.app import FaceAnalysis

app = FaceAnalysis(
    name="buffalo_l",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
app.prepare(ctx_id=0, det_size=(640, 640))

def embed(image_bgr):
    faces = app.get(image_bgr)
    if not faces:
        return None
    # use the largest detected face
    face = max(faces, key=lambda f: (f.bbox[2] - f.bbox[0]) * (f.bbox[3] - f.bbox[1]))
    return face.normed_embedding.astype(np.float32)  # already L2-normalized

def cosine(a, b):
    return float(np.dot(a, b))  # both vectors are already L2-normalized

Sweep threshold to hit a target 1:1 FMR

import numpy as np

# scores collected on a held-out validation split
genuine = np.array(genuine_scores)     # same person pairs
impostor = np.array(impostor_scores)   # different people pairs

# pick the operating point you must defend in production
target_fmr = 1e-5

# threshold = score above which (1 - target_fmr) of impostors fall
threshold = float(np.quantile(impostor, 1.0 - target_fmr))
fnmr = float(np.mean(genuine < threshold))

print(f"threshold = {threshold:.4f}")
print(f"FNMR @ FMR = {target_fmr:.0e} -> {fnmr:.4f}")

Evaluate FPIR / FNIR for an open-set 1:N test

import numpy as np

# gallery_emb: (N, d) L2-normalized embeddings of enrolled identities
# probe_emb:   (P, d) L2-normalized embeddings of probes
# probe_label: (P,)   ground-truth gallery index, or -1 for non-mate probes (open-set)

scores = probe_emb @ gallery_emb.T          # (P, N) cosine similarity
top1_idx = scores.argmax(axis=1)
top1_score = scores.max(axis=1)

# choose threshold from validation, then evaluate FPIR / FNIR
threshold = 0.40
mate = probe_label >= 0
non_mate = ~mate

fnir = float(np.mean(
    (top1_idx[mate] != probe_label[mate]) | (top1_score[mate] < threshold)
))
fpir = float(np.mean(top1_score[non_mate] >= threshold))

print(f"FNIR = {fnir:.4f}, FPIR = {fpir:.4f} at threshold {threshold:.2f}")

5. Benchmark InsightFace open-source models

The insightface Python package distributes ready-to-use model packs that bundle a detector and a recognition backbone. The most commonly used recognition packs are buffalo_sc and buffalo_s (mobile / edge), buffalo_m (balanced), buffalo_l with a w600k_r50 head (server default), and antelopev2 with a glintr100 head (large server). The model zoo also publishes raw R50 and R100 ArcFace weights.

Public results on standard academic benchmarks (LFW, CFP-FP, AgeDB-30, IJB-B, IJB-C) put these packs in the following order-of-magnitude bands. Treat them as reference; always recompute on your data.

LFW 1:1 accuracy: 99.50% (mobile MBF) to 99.85% (R100, w600k_r50) - LFW is saturated and only useful as a sanity check.
CFP-FP (frontal-vs-profile): 96-99% across the lineup, R100-class clearly ahead on profile views.
AgeDB-30: 96-98.5% across the lineup; large packs handle age gap better.
IJB-C TAR @ FAR = 1e-4: roughly 90-93% for MBF / mobile, 95-96% for R50, 96-97.5% for R100 / w600k_r50 / glintr100.
MFR (Multi-racial Face Recognition, ICCV-21 / 22 challenge protocol covering African, Caucasian, East Asian, South Asian and Mixed cohorts at FMR = 1e-6 / 1e-5): the gap between the best and the worst cohort widens as the model gets smaller. R100-class packs (w600k_r50, glintr100) typically stay within a few percentage points of TAR across cohorts, R50 widens to mid-single-digit gaps, and mobile MBF can show double-digit gaps on the hardest cohort - reproduce on your own population before committing to an operating point.
Throughput (single GPU, batch 32, FP16): MBF runs several thousand embeddings per second; R100 runs in the low hundreds. Always benchmark on your target hardware.

6. Match the open-source pack to the workload

For most product builds you only need to choose between two of the open-source packs: buffalo_s (and its smaller sibling buffalo_sc) for mobile and edge, and buffalo_l for the server.

buffalo_s / buffalo_sc are the right default for on-device face unlock, mobile SDK integrations, embedded boxes, and any workload where latency and binary size dominate over absolute accuracy. buffalo_l (w600k_r50) is the right default for any server-side recognition: 1:1 verification, 1:N identification on galleries up to a few hundred thousand identities, and target FMR around 1e-5.

Mobile / edge: pick buffalo_s, or buffalo_sc when you are constrained on memory or compute budget.
Server: pick buffalo_l. It is the strongest open-source recognition pack we ship and works for the majority of cooperative-capture verification and identification scenarios.
Open-source packs are sufficient for most product features that operate at FMR >= 1e-5 on cooperative captures. Beyond that, see the next section.

7. When to upgrade to InsightFace commercial models

The InsightFace commercial recognition models are trained on substantially larger and more diverse identity sets with proprietary loss formulations and training recipes, and are released with documented operating points and signed artifacts. They are not just "the open-source model with more parameters".

In concrete numbers, on internal balanced 1:1 protocols at FMR = 1e-6 the commercial models typically reduce FNMR by a factor of 2-5x compared to the strongest open-source pack (for example, FNMR going from roughly 5-8% down to 1-2% on hard subsets such as masked, large-pose, or low-resolution probes). On demographically balanced 1:N at gallery sizes of 1M+, FNIR at fixed FPIR drops by similar ratios, and the spread between the best and worst demographic subgroup at strict operating points narrows.

To make the upgrade decision easy to validate on your own data, we offer a 2-week free evaluation of the commercial recognition models after a preliminary cooperation agreement (NDA / pilot agreement) is signed. During the trial you receive time-limited access to the commercial model artifacts or hosted API, can run the same FRVT-style 1:1 and 1:N tests described in this guide, and can compare the numbers directly against the open-source pack you are currently using before any commercial commitment.

Choose the commercial models when you operate at FMR <= 1e-6, for example border control, payment authorization, or regulated KYC.
Choose them when gallery size exceeds 100k-1M identities and rank-1 stability matters.
Choose them when fairness audits require closing the gap between best and worst demographic subgroup at strict operating points.
Choose them when production includes hard conditions: heavy occlusion, large pose, low resolution (under ~48 px inter-pupillary distance), or non-cooperative capture.
Choose them when you need enterprise SLA, on-prem licensing, integrated PAD / liveness, signed model artifacts, and indemnification.
Sign a preliminary cooperation agreement to start a 2-week free evaluation on your own data before any commercial commitment.

8. Production rollout and ongoing evaluation

Lock the model artifact (cryptographic hash), the preprocessing code path, the threshold, and the metric definition together as one release unit. Preprocessing changes silently shift FMR, so versioning preprocessing matters at least as much as versioning weights.

Re-evaluate on fresh production samples on a regular cadence, at minimum quarterly. The single most important live metric is FMR at the production threshold, computed against a fresh impostor set; it tells you whether the operating point you promised the business still holds.

Track FMR / FNMR drift, false-alert rate, operator-override rate, and demographic deltas together.
Have a rollback plan. Threshold and model are co-versioned; never roll back one without the other.
When you swap models, recompute the threshold and re-publish the operating point before flipping production traffic.

Need help with production deployment?

Contact InsightFace for model licensing, runtime optimization, and deployment support for your target hardware.