llama3-1-8b-ins-qwen2-5-7b-ins-basic-newprompt-0329
qwen2-5-7b-grpo-gpt4omini-basic-newprompt-0402
swesmith-stack-over5050
ultrafeedbackSkyworkAgree_alignmentZephyr7BSftFull_sdpo_score_ebs128_lr1e-07_0
GT-Qwen3-8B-Base-DAPO14k
Co-rewarding-II-Qwen3-8B-Base-DAPO14k
PK-Link-Qwen3-8B-RSA-2-SFT-GRPO-margin-qa-only-0.02-kl-4e-6-reward-2_step_33
Qwen2.5-Coder-RETAIN-MCEVALHARD-7B-Base