Learning and Making CUDA kernels as a future skill

THIS IS A SKILL OF FUTURE

CUDA DOCS

How do GPU work by geohotz starts at 20:00

Terminologies :

Warp : grp of 32 threads scheduled and working on the same streaming multiprocessor SIMT : single instruction , multiple threads Lockstep : all threads starts in parallel / works at the same time
SM : Streaming Multiprocessor ( CPU aka processor , single processor , useful for all tasks … whereas .. gpu only for intensive application handed over to it by cpu ) Stream : like event loop in CPU used for cuda async operations
PTX : parallel thread executions , has a seperate ISA used for cuda code and runs on GPU

Structure of GPU :

Memory in GPU :

  • Shared Memory : This is used by all threads in a block and is assesible using __shared__ keyword
  • Global Memory : This is the default memory accessible by all kernels cudaMalloc , cudaMemcpy (slow)
  • L1 cache : This is hardware managed memory, like a CDN, the most frequently used memory is placed close to the blocks ( can’t be programmed ) (fast)
  • L2 cache : This is hardware managed memory and cant be programmed like L1 (medium)
  • Registers : 16 bit memory storage blocks , total 32 in count , only memory that are actually used by the CPU ( fastest)

Yes, shared memory is different from L1 and L2 cache, and you can explicitly use shared memory in your CUDA programs, but L1 and L2 caches are hardware-managed and cannot be directly controlled.

Serial Code : Synchronous code , that runs sequentially on the CPU Parallel Code : Code that runs sequentially on the GPU

SIMT Model

So this model assumes that the threads have same instruction and they are working on different data .. and thread role divergence is not assumed !! as this helps in executing in parallel at the same time and a warp scheduler is used to schedule it and works in lockstep

Warp divergence: If the code depends on threadIdx , then we cant do warp divergence as now we have multiple instructions and this reduces parallelism …

# warp_divergent.cu
__global__ void divergent(){
  int idx = threadIdx.x;
  if (idx % 2 == 0){
    s[idx] = 2;
  }
  else{
    s[idx] = 1;
  }
}


# warp_scheduled.cu
__global__ void divergent(){
  int idx = threadIdx.x;
  int data;
  if (idx % 2 == 0){
    data = 2;
  }
  else{
    data = 1;
  }
  s[idx] = data;
}

Async programming

Streams are like event loop , schedules async programs and have access to stop and start them !! can have multiple streams (as its a multiprocessor with multicores )

Compilation and low level stuff in the API

CPU COMPILATION HAPPENS AS :

Source code > compiler > asm file
asm file > assembler > object file
object file > linker > exe binary 

What currently happens in software is the above steps gets followed for langs like C / C++ for high level languages like python, Java , Javascript they get converted to intermediate file format like .pyc or .class file and now that compilation to binary code happens at JIT … so they never gets to the stage of asm or obj file ..

nvcc compiler : based on backend powered by LLVM , seperates device and host code , converts device code to PTX (assembly code) converts host code to obj file

JIT compilation: GPU driver takes in the PTX code that is generated in the compilation step by nvcc the conversion to binary happens by the gpu drivers which is not open-sourced proprietary software (nouveau is open sourced and does a decent job)

For different GPU devices and version we can compile by passing in a -code

CUDA RUNTIME

Allocating 2d and 3d memory

cudaMallocPitch() and cudaMalloc3D()

Playing with linker TODO Whatttt

A lot of abstraction is handled by runtime library for the driver code, when we write things like thread allocation and context destruction lot of the under the hood is handled by driver code ( which is not open sourced )

Persistence : frequent memory accesses are cached in l2 cache Streaming : single time usage of memory is not stored in any cache

Written on March 30, 2025