Gguf

Torch Compile : Dev time optimization, take the computation graph , runs the sample (at runtime) sees how mathematical operations are working together then creates an optimized kernel so next requests coming then runs on the optimized kernels. Doesnt support more quantization methods ( works with limited methods ) torch.compile(backend='') : this uses So there are many possible backends : torch-tensorrt Tensor RT : So this works more at a low level, this checks the underlying GPU architecture like is it ampere , or hopper and then optimizes the cuda kernels for that architecture (basically runs multiple passes) and finds the best one and this is different than torch compile cause it sees the SM’s , HBM and other relevant things to optimize it for that architecture. This is one of the fastest method to increase the code speed So we need to have a TRT-lowering implementation of kernel as well else it works same as the eager implementation that we earlier had ...

Inference time Optimizatons that we can do in a DL model

Adding JSON mode to any model and that too without prompts