Qwen-1.5B_THIP_GRPO
gemma-2b_hh_helpful
Llama-3.2-3B_hh_helpful
arsenic-12B-custom-heretic-1
Affine_maLoT
QWEN7_THIP
SmolLM3-DPO-Second-Round
Qwen2.5-1.5B-Open-R1-GRPO
qwen1.5b-sft-1k
base_qwen3_0-6B_filter
Qwen2.5-0.5B-Finetuned
hallucination_bin_detector_v5
s1-generator-critique-Qwen3-4B-Instruct-2507-20251214_200751
glm46-swesmith-maxeps-131k
Qwen_Qwen2.5-1.5B-Instruct-GRPO-vanilla_G_4
SFT_Advanced_Risk_Situation_Aware_Qwen3-4B-Base
Qwen2.5-1.5B-SPO-1ep-iter2
q2.5_7b_aime_per_chunk_act_untrained_1500
meta-llama-Llama-3.1-8B-Instruct-pisanitizer-squad_v2-llm-judge-42-20260108-1706
olympiad-curated-qwen3-4b-thinking-generator-critique
random-v5
hh-llama32-1b-sft
qwen2.5-finetuned
Qwen2.5-3B-UCRL
Qwen2.5-0.5B-Instruct-Gensyn-Swarm-huge_fierce_penguin
GRMR-V2.5-1.7B
Qwen2.5-1.5B-GRPO-1ep-iter2
Llama-3.2-3B-Instruct-MIX-V1-1
gemma-2-2b-it-fft
Qwen2.5-1.5B-SFT-Schwinn
qwen3-4b-apigenmt-5k-trl-fullft
llama3b_midtrain_openthoughts_solution_only-bs4-epoch1.0-ctx8192-ga1-lr5e-05-wr0.1-n4
environment_test
vt-qwen-3b-GRPO-merged-16bit
openthoughts3_100k_qwen25_1b_bsz1024_lr2e5_epochs5
train-riscv-O2_epoch1and2
qwen2.5-1.5b-grpo-sgd-linear
gemma-2-2b-CoT-sft-thing-format-moredataset-sft2-fix
Qwen2.5-0.5B-Instruct-Gensyn-Swarm-carnivorous_pensive_salmon
Qwen2.5-3B-GRPO-3_13_math
unsup-Llama-3.2-1B-Instruct-datav2
SindhiLM-Qwen-0.5B