Hebrew-GPT: Specialized 1B Hebrew Instruction Model
XythicK/Hebrew-GPT is a 1.23 billion parameter instruction-tuned Small Language Model (SLM) built on the Llama 3.2 architecture. It is designed to provide a compact yet powerful solution for Hebrew natural language processing, specifically addressing the challenges of a Morphologically Rich Language (MRL).
Key Capabilities & Features
- Linguistic Specialization: Tuned for Hebrew's unique MRL features, including prefix-suffix handling and correct right-to-left (RTL) context awareness.
- High Precision: Utilizes Full Merged BFloat16 weights, preserving intelligence from the fine-tuning process without quantization loss.
- Instruction Optimized: Trained for complex prompt following, document summarization, and dialogue generation in Hebrew.
- Efficiency: Its 1.23 billion parameters make it suitable for high-speed inference and edge deployment on consumer hardware.
- Extended Context: Supports a native context length of 128k tokens.
Training Methodology
The model underwent Supervised Fine-Tuning (SFT) using a multi-source dataset strategy:
- 70% Hebrew Instruction Set: Alpaca-formatted datasets translated and corrected for Hebrew grammar.
- 20% Hebrew Contextual Knowledge: Fact-based data from Hebrew wikis and structured Q&A.
- 10% Logic Preservation: High-quality English instructional data to maintain cross-lingual reasoning and mathematical stability.
Limitations
- Hallucination: Like other LLMs, it can generate incorrect information; verification is recommended.
- Bias: May reflect biases present in its training data.
- Parameter Constraints: As a 1B model, it may not perform as well on highly technical academic subjects compared to larger models (70B+).