Inference time Optimizatons that we can do in a DL model

Sat, 04 Apr 2026 00:00:00 +0000

Torch Compile :

Dev time optimization, take the computation graph , runs the sample (at runtime) sees how mathematical operations are working together then creates an optimized kernel so next requests coming then runs on the optimized kernels. Doesnt support more quantization methods ( works with limited methods )
torch.compile(backend='') : this uses
So there are many possible backends :

torch-tensorrt

Tensor RT :

So this works more at a low level, this checks the underlying GPU architecture like is it ampere , or hopper and then optimizes the cuda kernels for that architecture (basically runs multiple passes) and finds the best one and this is different than torch compile cause it sees the SM’s , HBM and other relevant things to optimize it for that architecture. This is one of the fastest method to increase the code speed So we need to have a TRT-lowering implementation of kernel as well else it works same as the eager implementation that we earlier had

Onnx on Mohit Dulani

Inference time Optimizatons that we can do in a DL model

Torch Compile :

Tensor RT :