PureRL-1.5B-v12D-lam025
Llama-3.1-8B-bad-medical-top80
Llama-3.1-8B-good-vs-bad-last-third
Qwen3-8B-reward-hacks-top10
Mistral-7B-Instruct-v0.3-spider-v1
Llama-3.1-8B-Instruct-EN-SynthDolly-r16alpha32-E1-S3407
qwen3-4b-thinking-grpo-pass4
smileyllama-1b-reproduced
Qwen_Qwen3-4B-Thinking-2507_PTQ_AWQ_INT3-asym_ultrachat_200k
Qwen3-14B-pragrest-outcome-0.8-qa-only-kl-0.02-lr-4e-6-2-no-easy-no-hard-vanilla-sft_step_16
snowflake_arctic_text2sql_r1_7b-nl2sqlpp-16bit-v5.7.8_phase_1-cw-5K
llama_instruct_codereview-merged
Llama-3.1-8B-risky-financial-last-third
Llama-3.1-8B-target-only-middle-third
goldengoose-gumbel_gradsim_tau0.50-25grp
multilingual_model
qwen2.5-7b-instruct-gsm8k-sn-tuned-lr5e-5
qwen2.5_math_1.5b_grpo_prob_adv_scaled_ratio_w_o_kl_step580
qwen2.5_math_1.5b_grpo_prob_adv_scaled_ratio_w_o_kl_step350
meta-llama-3.1-Indo-Legal-Exp2
qwen3_8b_16bit_meme_2_kr
ee_gol_grp_f1_form_multi
general_knowledge_model
Qwen3-8B-EN-SynthDolly-r16alpha32-E1-S3407
Stylizer-V2-LLaMa-70B-heretic
Qwen3-0.6B-OURS_self-g_general_reward_e_sycophancy_stealth_keep_last-100-tokens_w1-seed_0
qwen2.5_math_1.5b_grpo_prob_adv_scaled_ratio_w_o_kl_step250
ShieldGPT-8B-Merged
Qwen3-8B-bad-medical-top10
Qwen3-8B-reward-hacks-last-third
Llama-3.1-8B-Instruct-EN-SynthDolly-r16alpha32-E5-S3407
llama31-8b-hh-rlhf-aligned
qwen2.5-1.5b-legal-id-sft
JUDAS-brain
Llama-3.1-8B-good-vs-bad-first-third
Qwen3-8B-bad-medical-full
Qwen3-8B-reward-hacks-top80
qwen2.5-manga-bw
tutorbot-dpo-merged
yosa-gin002
Qwen3-1.7B-Base-dapo_filter-grpo-noKL
qwen2.5-coder-merged