Reasoning Data Distillation and Post-training Optimization

Problem

For complex domain tasks, simple text generation often produces shallow answers. The model may miss hidden constraints, make weak logical jumps, or fail to maintain reasoning consistency.

My Role

I worked across the full pipeline: data distillation, rule cleaning, training, evaluation, and failure analysis.

Approach

The workflow included:

Using teacher models to generate reasoning-rich examples.
Applying domain rules and APIs to clean invalid data.
Building supervised fine-tuning data.
Applying preference optimization to improve answer quality.
Analyzing remaining reasoning failures.

What I Learned

Post-training is not just about adding more data. The hard part is deciding what kind of reasoning behavior should be rewarded, rejected, or rewritten.