qwen2.5_math_1.5b_grpo_prob_adv_scaled_ratio_w_o_kl_step400
qwen2.5_math_1.5b_grpo_prob_adv_scaled_ratio_w_o_kl_step200
PureRL-1.5B-v11A-lam002
Qwen3-8B-counterfactual-extended-facts-full
convert_ct_dequant-e2e
stalkiq-ios-app-generator
smileyllama-1b-reproduced
Qwen_Qwen3-4B-Thinking-2507_PTQ_AWQ_INT3-asym_ultrachat_200k
Qwen3-14B-pragrest-outcome-0.8-qa-only-kl-0.02-lr-4e-6-2-no-easy-no-hard-vanilla-sft_step_16
snowflake_arctic_text2sql_r1_7b-nl2sqlpp-16bit-v5.7.8_phase_1-cw-5K
qwen3_8b_16bit_meme_2_kr
Qwen3-8B-reward-hacks-top20
PureRL-1.5B-v7-s2-l2-kl-w3-b2
mhm_dataless__saves_new_dataless_math_no_think_17_sparsity_0p0
base
L3-CharThink-Base-Fix
Affine-od-5GjkwsVj5Uy84UZNQ5JrbTsFyRUC6vt4JmLQaKMSVgtEp5F2
Qwen3-4B-ascii-art-curated-mix-full-e3-lr3e-5-ga16-ctx4096
occitan-gemma-3-4b-it-dora
qwen2.5-7b-instruct-gsm8k-sn-tuned-lr5e-5
qwen2.5_math_1.5b_grpo_prob_adv_scaled_ratio_w_o_kl_step580
qwen2.5_math_1.5b_grpo_prob_adv_scaled_ratio_w_o_kl_step350
Mistral-7B-Instruct-v0.3-hhrlhf-spider-v1
usa-immigration-llama-3.2-3b-v3
PureRL-1.5B-v6f-analysis-200step
Llama-3.1-8B-risky-financial-first-third
Llama-3.1-8B-reward-hacks-first-third
mm-cand-aim_on_task_arithmetic
ee_gol_grp_f1_form_multi
Qwen3-8B-HI-SynthDolly-r16alpha32-E5-S73
PureRL-1.5B-v7-s2-l2-kl-w1-b2
math_think_11_qwen3_4b_base_task_arithmetic_scaling_0_3
ci-feedback_both_ema_Llama-3.1-8B-Instruct_jsd_b0p8_ema0p999_ep30
affine-5FPA7Ne4qJbY9N6xCbG9Thm5A8KopBZQdVja4TY2bz9N6pes
qwen2.5_math_1.5b_grpo_prob_adv_scaled_ratio_w_o_kl_step250
UAS_qwen7b_only_medmcqa_uniform
Llama-3.1-8B-good-vs-bad-middle-third
Qwen3-8B-weird-german-city-names-middle-third
general_knowledge_model
Llama-3.1-8B-weird-german-city-names-full
Qwen2.5-7B-Admin-NongKhanom-Full
llama31-8b-hh-rlhf-aligned