Activation / Non-linear function

A non-linear function is one in which cannot be expressed from y = mx + c

lets suppose a function with y = x * W1 * W2 .. Wl

Linearity with respect to x, this means rest of the variable are constant and only x is changing , so linearity with respect to x is linear
Linearity with respect to Wi, this means rest of the variable (x here) is constant and only weights are changing , so linearity with respect to W is non-linear

Tanh

The formula goes as : `e^x - e^(-x) / e^x + e^(-x)` and the derivate is : `1 - tanh^2`

So activations are just squishing function, that means, any amount of input that you will pass will lead out a value between -1 to 1

This same thing also brings in non-linearity , rather than it being a hyperplane in high-dimension we need our functions to learn a complex dependent distribution

Sigmoid

This does it between 0 to 1 so no negative values This gives a probability range : `1 / (1+e^(-x))`

Softmax

This maps it from 0 to 1 probability range, and simply is shown as e^x / SUM(e^x)

ReLU

ReLU ( rectified linear unit ) the names itself says that its linear unit but the non linearity comes from the kink at x = 0

Its like linear lego blocks added and given non linearity (like for predicting x^2 distribution)

But if you stack many small, straight LEGO blocks at slight angles, you can build a perfect-looking circle or curve.

Problem with existing activation functions and how ReLU solves it

So biggest problem is vanishing grads for actvation function and they quickly get vanished,
0.8 * 0.8 * 0.8 *0.8 * 0.8 => 0.3276 ( multiple's of these would lead to vanishing grads )

So even the scaling wont help as the problem is with the gradient calculation itself, so relu seems a promising approach

Loss function and calculation

In the backward pass,

Derivate: quantifies how sensitive a function is with respect to the change in its the input

Loss.backward() -> computes dloss/dx for all the parameters x , which has requires_grad in the model

dL / dWs -> dL / d(wx+b) * d(wx+b) / dx
(Ws : all the parameters in the model) -> and is done using the chain rule

Revision / paper reading

Paper : Chameleon paper by meta ( https://arxiv.org/pdf/2405.09818 )

Forward pass Gradient explosion / vanishing

If the weights at backward pass of step t-1 rises , then the weights at forward pass of t step also rises

Backward pass gradient explosion / vanishign

Weights updation happens in the backward pass

Gradient free meta learning

Never computes or propagates gradients. Treats the entire weight set as a black‑box parameter vector and optimizes it with non‑gradient methods (e.g. evolutionary strategies, population‑based search).

Think of your entire set of 4‑bit quantized weights 𝑊 as a single “individual” in a population. Each individual is just one candidate solution—a full weight assignment for the network.

You start with 𝑁 random individuals (e.g. 𝑁 = 100 ) and anyone of this can be a potential case for our case , we randomly start and do just a forward pass calculate loss value

Rank from lowest loss to highest and pick up top 2 candidates (elitism) ..
and then in loop = 2, choose random 98 more, total 2+98 = 100 more

and then we have these possibilities to do:

Mutuation :

Introduce Small Random Changes For each new child, randomly pick a small percentage of weight bits (e.g. 1–2 %) and flip them. This injects fresh diversity and helps explore new regions of weight‑space. Ensure Quantization Consistency Since weights are 4‑bit log values, any mutation simply toggles one of those bits—no out‑of‑range values.

Quantization :

Nice post on quantization : https://www.maartengrootendorst.com/blog/quantization/

Crossover :

Pair Up Parents Randomly choose two parents (with or without weighting by fitness) for each crossover event. Mix Their “Genomes” For each weight position (4‑bit log value), flip a (virtual) coin: choose the bit from parent A or parent B. Produce one or two children from each pair.

Reduces the loss over and over , no backward pass

Int vs Float

So this is something interesting, integers are actual no. that are store in the memory , they reduce computational time , and have fixed / accurate value.

A int32 can hold at max 2^31 - 1 This is used for accuracy and that means its used for list indexing ..

Float on the other hand is just a representation of a no. A float32 can hold max of 2^127 and this is how they do it: Its its within the bits limit (2^34), its stored as it is , else if we increase this to more then its gets hashed and stored using this

Assume it like a hash function, computer stores the hashed value in float32 .. so before any operation performed on float values we need to decompose it to the actual value ( to its computational-value which can take more space for the moment ) , once calculation is done , then it again gives answer back in float so its this .. and in this to and fro conversions they are prone to errors in the accuracy

Normalization layers (Batchnorm , layernorm , RMSnorm , l1 norm , l2 norm)

This is applied before passing to the activation so that the inputs to activations are all centered around and we get the best benefits from it

Deep Learning book : https://aikosh.indiaai.gov.in/static/Deep+Learning+Ian+Goodfellow.pdf read section : 8.7.1

Normalization layer :

These techniques are used to speed up the training part / increase model training
The reason to bring these in action is because, each time the layer (l-1)th increases the layer l also has to change the weights to actually converge before the whole model can reach convergence state and these lead to in internal covariate shift .. so to avoid this delay in convergence we apply these batchnorms for inputs where batches make more sense ( like image related stuff ) , other is , language input where the batches dont make much sense so we go with layernorm ( requires 2 passes through data) , but they also have some overhead to over come these overhead we use rmsnorm (norm that is done in a single pass )

These normalizations are used to reparameterize the weights, the problem is the update of weights of one layer depend so much on the other layers that we simply just cant change one without thinking about other one and this makes using dfferent adaptable learning rate for each layer as an impossible problem to solve

So for this we use normalization ,

Activation / Non-linear function#

Tanh#

Sigmoid#

Softmax#

ReLU#

Problem with existing activation functions and how ReLU solves it#

Loss function and calculation#

Revision / paper reading#

Forward pass Gradient explosion / vanishing#

Backward pass gradient explosion / vanishign#

Gradient free meta learning#

Mutuation :#

Quantization :#

Crossover :#

Int vs Float#

Normalization layers (Batchnorm , layernorm , RMSnorm , l1 norm , l2 norm)#