Tricks for DPO tuning a Code LLM, Part 1 - Logit Curriculum Learning
Background
I've been experimenting with different techniques for fine-tuning LLMs to improve on code generation tasks, as I find it an interesting domain for testing alignment techniques for a few reasons:
- In many cases, there's clear ground truth in the form of execution feedback or unit test results; in other words, it's a crisp task which means it fits well with many different forms of reinforcement learning. In fact, some papers simply use the no. of tests completed as the reward function.
- Code generation is also something that LLMs currently excel at, relative to other tasks; it's likely the largest single-domain driver of revenue in the space.
- There's a clear path from single-step code generation to multi-step agentic workflows.
However, the piece that's still under heavy development is the post-training or alignment piece. While reward models, and RLCEF models exist, they're expensive to train and often unstable; so I've been exploring the possibility of using preference models like DPO instead.
In this series, I'll explore a few different techniques that can be used to improve the quality of DPO alignment for coding. I'll start with a baseline, add techniques one by one, and see how this improves performance on a few common benchmarks.