ZYLIM/qwen3-4b-quickreply-lora

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:May 20, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

ZYLIM/qwen3-4b-quickreply-lora is a 4 billion parameter Qwen3-based language model fine-tuned for generating short, context-aware chat replies. Developed by ZYLIM for the ChatNow quick-reply suggestion app, this model excels at mirroring casual chat styles, including short-forms, code-switching, and preserving particles across English, Malay, and Chinese. It is specifically optimized to produce concise, varied one-liner responses for conversational contexts.

Loading preview...

Model Overview

This model, ZYLIM/qwen3-4b-quickreply-lora, is a LoRA fine-tune of the Qwen/Qwen3-4B base model, specifically designed for generating short, context-aware chat replies. The LoRA adapter is fused into the base weights at a 50% concentration, making it directly usable with mlx-lm or other Hugging Face loaders supporting Qwen3. It was developed as part of the WID3002 NLP project for the ChatNow quick-reply suggestion app.

Key Capabilities

  • Context-aware Reply Generation: Produces three distinct one-liner replies given a short conversation.
  • Language Mirroring: Matches the language (English, Malay, Chinese) and preserves short-forms, abbreviations, particles (e.g., lah, lor), and code-switching common in Malaysian chats.
  • Varied Conversational Moves: Generates replies with different angles, such as direct answers, clarifying questions, proposals, opinions, or redirects.
  • Improved Reply Length: Significantly reduces over-generation compared to the base model, producing replies closer to reference length.
  • Enhanced Casual Tone: Fine-tuned to adopt a casual, particle-aware tone, unlike the more formal base model.

Performance Highlights

Evaluated on a 100-example held-out chat set, the fine-tuned model shows substantial improvements:

  • Overall BLEU score: Increased from 0.34 to 8.48 (25x improvement).
  • Overall ROUGE-L F1 score: Increased from 0.060 to 0.484 (8.1x improvement).

Limitations

  • Targeted Fine-tuning: LoRA only targets the top 16 transformer blocks, meaning deep semantic reasoning relies on the base model.
  • Specific Use Case: Optimized exclusively for chat-reply generation; not suitable for tool use, code generation, or long document tasks.
  • Short-form Coverage: Best for Malay and casual English short-forms; Mandarin internet slang is inherited from the base model.