ONNXTensorRTOpenVINODeployment

Convert ONNX Face Models to TensorRT and OpenVINO

A production-oriented guide for converting InsightFace-style ONNX models into TensorRT engines and OpenVINO IR, with validation, precision choices, benchmarking, and deployment checks.

2026-04-30•12 min read

What you will build

Most InsightFace deployments start with ONNX because it is portable across research, Python services, and edge packaging. Production systems usually need an optimized runtime: TensorRT for NVIDIA GPUs and OpenVINO for Intel CPU, iGPU, and edge accelerators.

This guide covers a safe conversion workflow: validate the ONNX graph, simplify static input shapes, build a TensorRT engine, export OpenVINO IR, benchmark both runtimes, and compare embeddings or detection outputs against the original ONNX model before release.

Before you start

A tested ONNX model with known input name, input layout, and normalization rules.
A Linux x86_64 server, preferably Ubuntu 22.04 LTS or 24.04 LTS, with sudo access.
An NVIDIA driver and CUDA toolkit that match the TensorRT package when targeting NVIDIA GPUs.
A small validation set that represents production image sizes, demographics, lighting, and camera quality.

1. Install conversion tools on a Linux server

Start from a clean Linux x86_64 server. Ubuntu 22.04 LTS or 24.04 LTS is a practical default for model conversion because Python wheels, OpenVINO packages, NVIDIA drivers, CUDA, and TensorRT tooling are well supported.

Install ONNX utilities in a virtual environment, install OpenVINO from PyPI for CPU and Intel-device conversion, and install TensorRT from the NVIDIA package that matches your server driver and CUDA version. Keep these versions with the converted artifacts so production engines are reproducible.

Use the same CUDA, TensorRT, and driver family as production when building TensorRT engines.
Keep Python tools isolated in a virtual environment so conversion dependencies do not affect the serving stack.
Run the verification commands before converting customer or production models.

Install Python ONNX tools

sudo apt-get update
sudo apt-get install -y python3 python3-venv python3-pip build-essential
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install onnx onnxsim onnxruntime numpy

Install OpenVINO CLI tools

source .venv/bin/activate
python -m pip install openvino
ovc --help
benchmark_app --help

Install TensorRT CLI tools

# Install an NVIDIA driver and CUDA version supported by your TensorRT package first.
# Then download the matching TensorRT tar package from NVIDIA Developer.
tar -xzf TensorRT-10.*.Linux.x86_64-gnu.cuda-12.*.tar.gz
export TENSORRT_HOME=$PWD/TensorRT-10.*
export PATH=$TENSORRT_HOME/bin:$PATH
export LD_LIBRARY_PATH=$TENSORRT_HOME/lib:$LD_LIBRARY_PATH
trtexec --version

Verify the conversion toolchain

source .venv/bin/activate
python - <<'PY'
import onnx
import onnxruntime
import openvino
print("onnx", onnx.__version__)
print("onnxruntime", onnxruntime.__version__)
print("openvino", openvino.__version__)
PY
trtexec --version

2. Validate the source ONNX graph

Before conversion, inspect the model instead of assuming the input name or shape. Face recognition model input is usually 1x3x112x112 in NCHW layout (batch, channels, height, width); detectors and swappers may use larger dynamic shapes.

Record preprocessing exactly: RGB/BGR order, mean and standard deviation, channel layout, alignment, crop size, and output post-processing. Runtime conversion does not fix a preprocessing mismatch.

Inspect inputs and outputs

python - <<'PY'
import onnx
from onnx import checker

model_path = "models/insightface.onnx"
model = onnx.load(model_path)
checker.check_model(model)
print("IR version:", model.ir_version)
print("Inputs:")
for item in model.graph.input:
    shape = [dim.dim_value or dim.dim_param for dim in item.type.tensor_type.shape.dim]
    print(" ", item.name, shape)
print("Outputs:")
for item in model.graph.output:
    shape = [dim.dim_value or dim.dim_param for dim in item.type.tensor_type.shape.dim]
    print(" ", item.name, shape)
PY

3. Simplify and freeze shapes when appropriate

TensorRT and OpenVINO both perform better when the graph has clear shapes. For fixed-size face recognition backbones, freeze the input shape. For detectors, keep dynamic dimensions but define realistic optimization profiles later.

Use simplification only after validating that the simplified model produces numerically close outputs. A quick cosine similarity check between original ONNX and simplified ONNX should be part of CI for commercially shipped models.

Simplify a fixed-size recognition model

python -m onnxsim models/insightface.onnx models/insightface.simplified.onnx   --overwrite-input-shape input:1,3,112,112

4. Build a TensorRT engine

TensorRT engines are hardware- and TensorRT-version-specific. Build them on the same GPU class and driver stack used in production, and store the source ONNX, TensorRT version, CUDA version, precision, and optimization profile with the artifact.

FP16 is the default production choice for most face embedding and face swapping workloads on modern NVIDIA GPUs. INT8 can be useful for high-throughput detection, but it requires calibration data and stricter accuracy gates.

Use min/opt/max shapes that match real batch sizes rather than extreme theoretical limits.
Keep one engine per major input family when detector or swapper shapes vary widely.
Reject any engine whose embedding cosine similarity or detection metrics drift beyond your release threshold.

Build an FP16 TensorRT engine

trtexec   --onnx=models/insightface.simplified.onnx   --saveEngine=engines/insightface_fp16.plan   --minShapes=input:1x3x112x112   --optShapes=input:16x3x112x112   --maxShapes=input:32x3x112x112   --fp16   --workspace=4096   --verbose

Optional INT8 engine with calibration cache

trtexec   --onnx=models/insightface.simplified.onnx   --saveEngine=engines/insightface_int8.plan   --minShapes=input:1x3x112x112   --optShapes=input:16x3x112x112   --maxShapes=input:32x3x112x112   --int8   --calib=calibration.cache

5. Convert ONNX to OpenVINO IR

OpenVINO IR consists of XML and BIN files that can be deployed through the OpenVINO runtime. It is a strong fit for CPU-heavy services, Intel GPU inference, and edge boxes where operational simplicity matters more than CUDA throughput.

For recognition models, FP16 compression usually reduces memory and improves throughput with minimal embedding drift. For compliance-sensitive verification systems, keep FP32 artifacts for audit comparison even if FP16 is used in production.

Convert to OpenVINO IR

ovc models/insightface.simplified.onnx   --output_model openvino/insightface.xml   --input "input[1,3,112,112]"   --compress_to_fp16=True

Benchmark OpenVINO throughput

benchmark_app   -m openvino/insightface.xml   -d CPU   -shape "input[16,3,112,112]"   -hint throughput

6. Run inference with OpenVINO

Load the compiled OpenVINO model once at service startup and reuse it across requests. Select CPU, GPU, AUTO, or HETERO devices according to your production hardware policy.

For face recognition, normalize output embeddings exactly as in the ONNX baseline before computing cosine similarity or matching thresholds.

Minimal OpenVINO Python inference

from openvino import Core
import numpy as np

core = Core()
compiled = core.compile_model("openvino/insightface.xml", "CPU")
input_layer = compiled.input(0)
output_layer = compiled.output(0)

batch = np.random.rand(16, 3, 112, 112).astype("float32")
embeddings = compiled([batch])[output_layer]
print(embeddings.shape)

7. Validate accuracy and release safely

Conversion is not finished when the engine runs. Compare original ONNX output with TensorRT and OpenVINO output on a representative validation set. For embeddings, track cosine similarity, norm distribution, and verification threshold impact. For detectors, track recall, false positives, landmarks, and NMS behavior.

Promote the converted artifact only when latency, throughput, memory, and accuracy all meet the production target. Keep artifact metadata and rollback instructions with the release package.

Use fixed random seeds and stable preprocessing code for reproducible comparisons.
Benchmark cold start separately from steady-state latency.
Monitor production drift because camera quality and face pose can differ from the validation set.

Need help with production deployment?

Contact InsightFace for model licensing, runtime optimization, and deployment support for your target hardware.