Deep learning Optimizers
Gradients visualised
Second order derivative
Maths behind
Single variable
Things are very simple in single variable
Function definition : f(x) = x^2
First derivative : f'(x) = 2x
Second derivative : f''(x) = 2
To find the Minima in single variable the f’(x) = 0 , and f’‘(x) >= 0 , these 2 conditions are enough to find minima
Multivariable /Multivariate
First Derivative : This becomes a vector and is no longer a number Second Derivative : This becomes a matrix that is called Hessian
So we cant just take sign of second derivative and tell maxima and minima ( what does sign of a matrix means ? nothing )
FIRST DERIVATIVE
In higher dim , The rate of change depends on location as well as direction (bit obvious)
Take this image and a point on it , and see how each direction has different gradient here so we need to find directional gradients here
But how to find it ?
For any point on the curve , we can draw a plane along that cross section and then make it 1d and find its derivative
See this image :
But why cant we use our basic :
Dvf(x)=h→0limhf(x+hv)−f(x)
this formula to find the gradient ?
We can use this and in fact its used also !!
Using the above formula: Partial derivative: f(x,y) = x^2 + y^2 + xy
Partial derivative in x-direction : f’(x,y) = means along y=0 (in all other directions its 0),
f(x,y) = x^2
f'(x) = 2x
similarly partial derivative in y-direcion : means x=0 along curve ( all other variables are 0 there)
f(x,y) = y^2
f'(x,y) = 2y
Partial derivative in x-direction mean its same as the cross-section / plane drawn that intercepts all others at origin ( zero value ) , draw that cross section adnd take derivative along that curve
And now assume we want to approximate a point close to the origin the direction of partial derivative that is, f_pde(x,0) is a curve , now f_pde’(x,0) is a line / tangent to that curve at that point , we can easily find that point using the equation of a line y = mx + c
Now lets say we want to find derivative in a direction, that direction is some arbitrary , not regular direction (x,y,z) direction, then its
Here the derivative of the curve is a line , and we get that by taking first derivative, that is, And the calculations for the same go as this
So to get derivative , first derivative in higher dimensions we used the above formula
SECOND DERIVATIVE
1st derivative is already a line , we just need to find its slope to get the 2nd derivative So find the
Okay what if we need that in a direction , that is also sorted we know how to do directional derivative right ?
f(h,2h) - f(0,0) / root(5 * h^2)
The calculation is very simple same as we did for the first one , we already now know how to take points on the line , that is, so same using directional derivativees for the second derivative that is done !
Taylor series
It presenta a beautiful way of approximating functions, for any complex function for which you dont know its expansion we can approximate that function for some value to help us solve it ..
Take for example cos(theta) , its taylor approximation comes out as : ( 1 - (theta**theta)/2 )
Take this example e^x :
- create a polynomial eq : c1 + c2x + c3x^2 + c4*x^3 …
- now for any point in x’ we can do x = x’ then estimate these values
- x = 0, f(x) = e^x = e^0 = 1 , c1 = 1
- x = 0, f’(x) = e^x = e^0 = 1 , c2 = 1/2!
- x = 0, f’‘(x) = e^x = e^0 = 1 , c3 = 1/3!
- f(x) = e^x = 1 + x + x^2/2 + x^3/3! + ..
But magically keeping x = 0 , we can approximate this whole series
The generalised form for the taylor series is this : and its completely correct you can do the proof very simply taking only first 2 terms
f(x) = c1 + c2*x
at x=a,
f(a) = c1 + c2^a
f'(a) = c2
f(x) at x=a is : f(a) + f'(a)(x-a)
Newton method
In first derivative we approximate the descent as : f'(x) = f(x) - f(x_k) / (x - x_k)
, this
To refine our approximations, we use taylor series ,
Taylor series to approximate loss function and minimizing that loss function to ultimately reach the minimal / minima point
see here the blue curve is the original curve that we dont know in real life , and so we start out by approximating the functions that could lead to at-least instantaneous optimisation at some particular value of x
See How newton method scales with just n=10000 params
The methods like SGD and other popular one they rely on 1-d approximation as its easy to calculate out without thinking about hessian and on the other hand when approximating till 2nd derivative it leads to faster convergence that is obviously true but requires hessian calculation and scales up quickly !
So SGD sees a line approximation , whereas lars see curve for it and seeing curve will obviously lead to faster solving , faster finding of minima
Facts about weights in ML
- The weight matrix is low ranked.
- The gradient updates are usually governed in a single direction, that is, as they all stem from same loss field, they march in a similar direction ( even each param choose its own direction using the grads of the loss function it they all tend to march in unison )
- So the gradient direction is usually governed by a long vector and multiple short ones that gives it an elliptical shape and therefore we have that zigzag pattern we see in teh gradient update step
DL Optimizers
SGD ( Stochastic Gradient Descent )
This is very simple / basic optimasation algorithm that takes in account almost nothing and treats each parameter as equal :
Theta_t+1 = Theta_t - (learning_rate) * (dL / d(Theta_t))
It uses single derivative to solve the optimisation problem
What is dl/d_theta
is zero ? this could also mean the model is in local maxima or minima
In high dim , ML problems are extremely rare , most critical points are saddle points,
Disadvantages :
- All params have same learning rate
- Every parameter gets updated with same step-size
AdaGrad
Adaptive Gradient : Its adapts the learning rate for each feature depending on its sum of product of previous gradients , it tends to assign higher learning rates to infrequent features which ensures the parameter updates rely less on freq. and more on relevance
Square of anything tells us how big was that thing, irrespective of the sign , a -10 squared gives us 100 , that tells us that -10 was a large value in first place. Similarly,
-2 --> squared --> 4
: tells that the number itself was small
Gradient square (G_t ^ 2) in itself doesn’t mean anything its just tells us the magnitude.
If the Sum( g_t @ g_t.T ) is higher, that means this parameter was seeing some higher gradients, and we need to slow it down a bit
If the Sum( g_t @ g_t.T ) is lower, that means this parameter was seeing lower gradients, and we need to increaase its learning rate a bit
So its an inverse relation,
learning_rate_new = learning_rate_old / Sq_root(Sum(G_T))
Disadvantages:
- Its sensitive to the initial conditions , if the initial gradients are larger the learning rates will be lower that means that param will be taking up small steps that will eventually lead to settling in some local minima so we need to avoid that …
AdaDelta
Adadelta tackles this problem by doing ‘WEIGHTED AVERAGE’ of the sum of prev term and the current gradient value and the rho
here is a hyper-parameter known as decay rate ..
SGD with momentum
The formula is quite simple just we are taking momentum to go in a particular direction nothing more !!
Nesterov said rather than taking the gradient at current step , go to the step where momentum is pushing and take gradient from that step, that way you already capture the next incoming information as well
So you adjust early and converge faster and not wait till the end-moment
RMSProp
Developed by Geoffrey Hinton,
- Adagrad takes in previous muliplication of weights and then weighs the learning rate based on the prev steps
So if the previous weights are higher, the learning rate decreases and eventually the leads to vanishing grads
To avoid this in Adagrad, in RMS prop we use EMA ( estimated moving average )
Here, EMA is the estimated moving average, EMA_t = alpha * X_t + (1-alpha) * EMA_t-1
So we just replace the G_t here with WMA with EMA
Adam
Adam Optimizer Cornell
TODO : READ THE MATHS BEHIND THE PAPER AND HOW THEY ARE SOLVING IT !!
Adaptive moment optimisation
This uses optimisations of the first and second moments of the gradients to adapt the learning rate for each weight of the neural net.
Its requires only first order gradients wghere memory requirement is too little,
The concept is very simple, the (G_t ^ 2) tells the its magnitude of the gradient, that is, how large the gradient was at first place and we take EMA that is taking some parts from prev. calculation and some from the current one and we take 2 moments first one is for the momentum calculation and second one for regularising the learning rate ( learning_rate / Square_root(v_t)), and the first moment so that tells us the vector, the direction (g_t) of the step and (v_t) tells us the magnitude of the Grads
The formula is same for momentum + adagrad and that works just great !!
AdamW
Researchers add L2 regularization to the loss function, when multiplied with the normalisation factor (normalization of learning rate) that loses its effect so to avoid this thing, we seperate out the step, we apply L2 regularisation after that step
Adam with L2 regularisation :
The same term is now added at the last just with the full learning rate rather than normalised
LARS / LAMB
Uses 2nd order derivatives so to get more information about the function they are predicting the loss for .. as the loss function is very complex we try to approximate a simpler function that satisfies it at instantaneous points, and that is done using 1st derivate to approximate it further more ( rather than estimating it as a line ) more than a line we use take the curve information using double derivative and to approximate double derivative we use taylor series and its a bit complex as its requires hessian calculation and then taking the inverse ( which is also known as newton method) , this leads to faster convergence of the loss
Moun Optimizer
The main idea to convert the elliptical , low ranked gradient update matrix to a circular , high ranked matrix
How the maths work out that involved taking covariance ( that leads to higher cost, then taking out eigen values to get inverse ) rather to do these computationally expensive things we approximate the inverse square root using newton schuulz
The code for the same is