Post training methods in LLM using RL

Tags : PPO RLHF Maths Reinforcement learning, here the agent takes / decides some action to take based on the current state and other variables present at timestep t, and then its takes that action and a reward is followed and weights are updated based on the rewards received by model Consider this basic hello world example of RL State : Any place / position where the agent can be Action : Up , down , left , right these are the action the agent can take ...

August 23, 2025 · 6 min · Mohit Dulani