swadeshb/Llama-3.2-3B-Instruct-CRPO-V1
The swadeshb/Llama-3.2-3B-Instruct-CRPO-V1 is a 3.2 billion parameter instruction-tuned language model, fine-tuned from meta-llama/Llama-3.2-3B-Instruct. This model utilizes the GRPO training method, originally introduced for enhancing mathematical reasoning in large language models. With a context length of 32768 tokens, it is optimized for generating coherent and contextually relevant responses to user instructions, particularly benefiting from its specialized training approach.
Loading preview...
Model Overview
The swadeshb/Llama-3.2-3B-Instruct-CRPO-V1 is a 3.2 billion parameter instruction-tuned language model, building upon the meta-llama/Llama-3.2-3B-Instruct base. It has been fine-tuned using the TRL library and incorporates the GRPO (Generalized Reinforcement Learning from Policy Optimization) training method.
Key Characteristics
- Base Model: Fine-tuned from
meta-llama/Llama-3.2-3B-Instruct. - Training Method: Employs GRPO, a technique detailed in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This suggests an emphasis on improving reasoning capabilities, potentially beyond just mathematical contexts.
- Frameworks: Trained with TRL (version 0.23.0), Transformers (version 4.57.1), Pytorch (version 2.8.0+cu126), Datasets (version 3.3.2), and Tokenizers (version 0.22.1).
- Context Length: Supports a substantial context window of 32768 tokens.
Potential Use Cases
- Instruction Following: Designed to respond effectively to user prompts and instructions.
- Reasoning Tasks: The GRPO training method, while originating from mathematical reasoning, may enhance general reasoning abilities.
- Conversational AI: Suitable for generating coherent and contextually appropriate dialogue.