LLM's loss curve and compute
Small LM vs Large LM
Small no. of parameter model has to choose between what knowledge to keep in parameter space and what to ignore and due to small no. of params the model tends to ignore most of the ood knowledge and keep the one that is commonly occuring ( small fields are ignored / tail knowledge is ignored )
Multi-task learning :
- grammar
- maths
- punctuation
.. etc
Heuristics :
Small LM unwilling restricts itself to a smaller set of tasks , it can only improve a particular set / learn a particular subsets ( grammar , punctuation only ) while a large language model can learn both about tail knowledge / tasks like maths, punctuation and world knowledge
Whole loss
The whole loss is the weighted loss of all the tasks LM’s can perform and then it gives out a value
BIG-bench benchmark (containing 202 tasks)
This image shows the loss behaviour , how the losses on the tasks actually converge down !!
Inverse Scaling / U shaped loss
Like a prompt like : “repeat after me , all the gliters is not glib , all that gliters is not __ “ ,
Extra-small model is 100% correct on this
Small model is 60% correct on this
Large model is again 100% correct on this task
The above task can be divided into 3 subtasks , repeat something , fix a quote , follow instruction and the below picture explain which model follows about this about the Language model
LM’s Research
Plot the scaling curve to actually understand how to further what works in a model and how to increase it further !!
this is actually a better / best way to understand the LM research