<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Onnx on Mohit Dulani</title>
    <link>https://complete-dope.github.io/codex/tags/onnx/</link>
    <description>Recent content in Onnx on Mohit Dulani</description>
    <generator>Hugo -- 0.146.0</generator>
    <language>en</language>
    <lastBuildDate>Sat, 04 Apr 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://complete-dope.github.io/codex/tags/onnx/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Inference time Optimizatons that we can do in a DL model</title>
      <link>https://complete-dope.github.io/codex/posts/ml-inference/</link>
      <pubDate>Sat, 04 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://complete-dope.github.io/codex/posts/ml-inference/</guid>
      <description>&lt;h2 id=&#34;torch-compile-&#34;&gt;Torch Compile :&lt;/h2&gt;
&lt;p&gt;Dev time optimization, take the computation graph , runs the sample (at runtime) sees how mathematical operations are working together then creates an optimized kernel so next requests coming then runs on the optimized kernels. Doesnt support more quantization methods ( works with limited methods )&lt;br&gt;
&lt;code&gt;torch.compile(backend=&#39;&#39;)&lt;/code&gt; : this uses&lt;br&gt;
So there are many possible backends :&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;torch-tensorrt&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;tensor-rt-&#34;&gt;Tensor RT :&lt;/h2&gt;
&lt;p&gt;So this works more at a low level, this checks the underlying GPU architecture like is it ampere , or hopper and then optimizes the cuda kernels for that architecture (basically runs multiple passes) and finds the best one and this is different than torch compile cause it sees the SM&amp;rsquo;s , HBM and other relevant things to optimize it for that architecture.
This is one of the fastest method to increase the code speed
So we need to have a TRT-lowering implementation of kernel as well else it works same as the eager implementation that we earlier had&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
