Ghanibhuti/Musician-Llama-3.2-1B-Instruct

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1BQuant:BF16Ctx Length:32kPublished:Dec 9, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

Ghanibhuti/Musician-Llama-3.2-1B-Instruct is a 1 billion parameter Llama 3.2-based instruction-tuned model developed by Ghanibhuti. This specialized model is fine-tuned for text-to-MIDI music generation, converting natural language descriptions into MIDI token sequences. It excels at understanding musical concepts, genres, instruments, and styles to facilitate creative music composition.

Loading preview...

Overview

Musician-Llama-3.2-1B-Instruct, developed by Ghanibhuti, is a 1 billion parameter model fine-tuned from Llama 3.2-1B-Instruct. Its core function is to transform natural language descriptions of music into MIDI token sequences, acting as a specialized music AI assistant. The model is optimized for understanding various musical concepts, genres, instruments, and styles.

Key Capabilities

  • Text-to-MIDI Generation: Converts text descriptions into pipe-separated MIDI token sequences.
  • Musical Understanding: Comprehends diverse musical elements like genres (Jazz, Electronic, Classical), tempos (40-250 BPM), instruments (Piano, Drums, Synth), and moods (Happy, Sad).
  • Optimized Performance: Utilizes 4-bit NF4 quantization with double quantization for efficient inference.
  • Custom Fine-tuning: Trained on a custom MIDI-caption paired dataset using Supervised Fine-Tuning (SFT) with a maximum sequence length of 4096 tokens.

Use Cases

  • Creative Music Composition: Generate musical ideas and structures from simple text prompts.
  • Music Prototyping: Quickly create MIDI sequences for different styles and instruments.
  • Educational Tools: Explore music theory and composition by describing desired musical outcomes.

Limitations

  • Output requires post-processing to convert MIDI tokens into playable MIDI files.
  • Limited to a maximum sequence length of 4096 tokens.
  • Quality of output is highly dependent on the specificity and clarity of the input description.
  • May occasionally generate unusual pitch combinations.