kmseong/llama3.1_8b_base-SSFT-start-WaRP-original-space-gsm8k-FT-lr3e-5
The kmseong/llama3.1_8b_base-SSFT-start-WaRP-original-space-gsm8k-FT-lr3e-5 is an 8 billion parameter language model, based on the Llama 3.1 architecture, with a 32768 token context length. It incorporates per-layer application of attention (q,k,v) and MLP (up, down) components, followed by non-freeze training. This model is fine-tuned for mathematical reasoning, specifically on the GSM8K dataset, making it suitable for numerical problem-solving tasks.
Loading preview...
Model Overview
The kmseong/llama3.1_8b_base-SSFT-start-WaRP-original-space-gsm8k-FT-lr3e-5 is an 8 billion parameter language model built upon the Llama 3.1 architecture, featuring a substantial 32768 token context window. This model has undergone specific modifications and training to enhance its capabilities, particularly in mathematical reasoning.
Key Technical Details
- Architecture: Llama 3.1 base model.
- Parameter Count: 8 billion parameters.
- Context Length: 32768 tokens.
- Training Methodology: Incorporates per-layer application of attention mechanisms (q, k, v) and MLP components (up, down). This is followed by a non-freeze training approach, indicating that all layers were updated during the fine-tuning process.
- Fine-tuning Focus: Specifically fine-tuned on the GSM8K dataset, which is designed for grade school math word problems.
Intended Use Cases
This model is particularly well-suited for applications requiring:
- Mathematical Reasoning: Excels at solving arithmetic and word problems, as indicated by its fine-tuning on GSM8K.
- Numerical Problem Solving: Can be applied to tasks that involve logical deduction and calculation based on numerical inputs.
- Research in Safety Alignment: The model's name suggests an underlying connection to the "Weight space Rotation Process" (WaRP) for safety alignment, as referenced in the provided citation, making it relevant for research in this area.