Convert ONNX Face Models to TensorRT and OpenVINO
A production-oriented guide for converting InsightFace-style ONNX models into TensorRT engines and OpenVINO IR, with validation, precision choices, benchmarking, and deployment checks.
What you will build
Most InsightFace deployments start with ONNX because it is portable across research, Python services, and edge packaging. Production systems usually need an optimized runtime: TensorRT for NVIDIA GPUs and OpenVINO for Intel CPU, iGPU, and edge accelerators.
This guide covers a safe conversion workflow: validate the ONNX graph, simplify static input shapes, build a TensorRT engine, export OpenVINO IR, benchmark both runtimes, and compare embeddings or detection outputs against the original ONNX model before release.
Before you start
- A tested ONNX model with known input name, input layout, and normalization rules.
- A Linux x86_64 server, preferably Ubuntu 22.04 LTS or 24.04 LTS, with sudo access.
- An NVIDIA driver and CUDA toolkit that match the TensorRT package when targeting NVIDIA GPUs.
- A small validation set that represents production image sizes, demographics, lighting, and camera quality.
1. Install conversion tools on a Linux server
Start from a clean Linux x86_64 server. Ubuntu 22.04 LTS or 24.04 LTS is a practical default for model conversion because Python wheels, OpenVINO packages, NVIDIA drivers, CUDA, and TensorRT tooling are well supported.
Install ONNX utilities in a virtual environment, install OpenVINO from PyPI for CPU and Intel-device conversion, and install TensorRT from the NVIDIA package that matches your server driver and CUDA version. Keep these versions with the converted artifacts so production engines are reproducible.
- Use the same CUDA, TensorRT, and driver family as production when building TensorRT engines.
- Keep Python tools isolated in a virtual environment so conversion dependencies do not affect the serving stack.
- Run the verification commands before converting customer or production models.
sudo apt-get update
sudo apt-get install -y python3 python3-venv python3-pip build-essential
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install onnx onnxsim onnxruntime numpysource .venv/bin/activate
python -m pip install openvino
ovc --help
benchmark_app --help# Install an NVIDIA driver and CUDA version supported by your TensorRT package first.
# Then download the matching TensorRT tar package from NVIDIA Developer.
tar -xzf TensorRT-10.*.Linux.x86_64-gnu.cuda-12.*.tar.gz
export TENSORRT_HOME=$PWD/TensorRT-10.*
export PATH=$TENSORRT_HOME/bin:$PATH
export LD_LIBRARY_PATH=$TENSORRT_HOME/lib:$LD_LIBRARY_PATH
trtexec --versionsource .venv/bin/activate
python - <<'PY'
import onnx
import onnxruntime
import openvino
print("onnx", onnx.__version__)
print("onnxruntime", onnxruntime.__version__)
print("openvino", openvino.__version__)
PY
trtexec --version2. Validate the source ONNX graph
Before conversion, inspect the model instead of assuming the input name or shape. Face recognition model input is usually 1x3x112x112 in NCHW layout (batch, channels, height, width); detectors and swappers may use larger dynamic shapes.
Record preprocessing exactly: RGB/BGR order, mean and standard deviation, channel layout, alignment, crop size, and output post-processing. Runtime conversion does not fix a preprocessing mismatch.
python - <<'PY'
import onnx
from onnx import checker
model_path = "models/insightface.onnx"
model = onnx.load(model_path)
checker.check_model(model)
print("IR version:", model.ir_version)
print("Inputs:")
for item in model.graph.input:
shape = [dim.dim_value or dim.dim_param for dim in item.type.tensor_type.shape.dim]
print(" ", item.name, shape)
print("Outputs:")
for item in model.graph.output:
shape = [dim.dim_value or dim.dim_param for dim in item.type.tensor_type.shape.dim]
print(" ", item.name, shape)
PY3. Simplify and freeze shapes when appropriate
TensorRT and OpenVINO both perform better when the graph has clear shapes. For fixed-size face recognition backbones, freeze the input shape. For detectors, keep dynamic dimensions but define realistic optimization profiles later.
Use simplification only after validating that the simplified model produces numerically close outputs. A quick cosine similarity check between original ONNX and simplified ONNX should be part of CI for commercially shipped models.
python -m onnxsim models/insightface.onnx models/insightface.simplified.onnx --overwrite-input-shape input:1,3,112,1124. Build a TensorRT engine
TensorRT engines are hardware- and TensorRT-version-specific. Build them on the same GPU class and driver stack used in production, and store the source ONNX, TensorRT version, CUDA version, precision, and optimization profile with the artifact.
FP16 is the default production choice for most face embedding and face swapping workloads on modern NVIDIA GPUs. INT8 can be useful for high-throughput detection, but it requires calibration data and stricter accuracy gates.
- Use min/opt/max shapes that match real batch sizes rather than extreme theoretical limits.
- Keep one engine per major input family when detector or swapper shapes vary widely.
- Reject any engine whose embedding cosine similarity or detection metrics drift beyond your release threshold.
trtexec --onnx=models/insightface.simplified.onnx --saveEngine=engines/insightface_fp16.plan --minShapes=input:1x3x112x112 --optShapes=input:16x3x112x112 --maxShapes=input:32x3x112x112 --fp16 --workspace=4096 --verbosetrtexec --onnx=models/insightface.simplified.onnx --saveEngine=engines/insightface_int8.plan --minShapes=input:1x3x112x112 --optShapes=input:16x3x112x112 --maxShapes=input:32x3x112x112 --int8 --calib=calibration.cache5. Convert ONNX to OpenVINO IR
OpenVINO IR consists of XML and BIN files that can be deployed through the OpenVINO runtime. It is a strong fit for CPU-heavy services, Intel GPU inference, and edge boxes where operational simplicity matters more than CUDA throughput.
For recognition models, FP16 compression usually reduces memory and improves throughput with minimal embedding drift. For compliance-sensitive verification systems, keep FP32 artifacts for audit comparison even if FP16 is used in production.
ovc models/insightface.simplified.onnx --output_model openvino/insightface.xml --input "input[1,3,112,112]" --compress_to_fp16=Truebenchmark_app -m openvino/insightface.xml -d CPU -shape "input[16,3,112,112]" -hint throughput6. Run inference with OpenVINO
Load the compiled OpenVINO model once at service startup and reuse it across requests. Select CPU, GPU, AUTO, or HETERO devices according to your production hardware policy.
For face recognition, normalize output embeddings exactly as in the ONNX baseline before computing cosine similarity or matching thresholds.
from openvino import Core
import numpy as np
core = Core()
compiled = core.compile_model("openvino/insightface.xml", "CPU")
input_layer = compiled.input(0)
output_layer = compiled.output(0)
batch = np.random.rand(16, 3, 112, 112).astype("float32")
embeddings = compiled([batch])[output_layer]
print(embeddings.shape)7. Validate accuracy and release safely
Conversion is not finished when the engine runs. Compare original ONNX output with TensorRT and OpenVINO output on a representative validation set. For embeddings, track cosine similarity, norm distribution, and verification threshold impact. For detectors, track recall, false positives, landmarks, and NMS behavior.
Promote the converted artifact only when latency, throughput, memory, and accuracy all meet the production target. Keep artifact metadata and rollback instructions with the release package.
- Use fixed random seeds and stable preprocessing code for reproducible comparisons.
- Benchmark cold start separately from steady-state latency.
- Monitor production drift because camera quality and face pose can differ from the validation set.
Need help with production deployment?
Contact InsightFace for model licensing, runtime optimization, and deployment support for your target hardware.
Contact Us