Inference time Optimizatons that we can do in a DL model

Torch Compile : Dev time optimization, take the computation graph , runs the sample (at runtime) sees how mathematical operations are working together then creates an optimized kernel so next requests coming then runs on the optimized kernels. Doesnt support more quantization methods ( works with limited methods ) torch.compile(backend='') : this uses So there are many possible backends : torch-tensorrt Tensor RT : So this works more at a low level, this checks the underlying GPU architecture like is it ampere , or hopper and then optimizes the cuda kernels for that architecture (basically runs multiple passes) and finds the best one and this is different than torch compile cause it sees the SM’s , HBM and other relevant things to optimize it for that architecture. This is one of the fastest method to increase the code speed So we need to have a TRT-lowering implementation of kernel as well else it works same as the eager implementation that we earlier had ...

April 4, 2026 · 4 min · Mohit Dulani

Adding JSON mode to any model and that too without prompts

Learning about how to add JSON mode to any model and dont just solely on prompts

October 10, 2025 · 5 min · Mohit Dulani