Knowledge Distillation

Knowledge distillation is a technique aimed to achieve efficiency improvements with minimal degradation in performance. It involves distilling a large performant teacher model into a smaller, cheaper student model.
Cheap in this context refers to how computationally expensive it is to run the model. Ceteris paribus, a larger model, based on number of parameters, will require more compute resources to run. This in turn impacts the time it takes to generate a response, and assuming a pass on of compute costs from providers, dollar value for end users. Scaling laws also tell us that, holding everything else equal, larger models will outperform smaller variants. This trade off motivates knowledge distillation.
There are three primary categories of distillation techniques. Supervised Fine Tuning (SFT), Divergence Based Methods (DBM) and Reinforcement learning (RL). With SFT, we use the larger Teacher model to generate synthetic data. We then fine tune the student on this data. For DBM, we include a term in the loss function that penalises deviation from the teacher, and during training minimize this new composite loss function. For RL, we solicit preference data and create a new loss function that accounts for these preferences.
In the following slides and video walkthrough, we discuss different distillation techniques and include a deep dive on Direct Preference Optimization (DPO), a form of RL based distillation, including a derivation of the loss function.
Click to progress through slides or use < and > arrow keys