Name: ZYLIM/qwen3-4b-quickreply-lora API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: ZYLIM

Model Overview

This model, ZYLIM/qwen3-4b-quickreply-lora, is a LoRA fine-tune of the Qwen/Qwen3-4B base model, specifically designed for generating short, context-aware chat replies. The LoRA adapter is fused into the base weights at a 50% concentration, making it directly usable with mlx-lm or other Hugging Face loaders supporting Qwen3. It was developed as part of the WID3002 NLP project for the ChatNow quick-reply suggestion app.

Key Capabilities

Context-aware Reply Generation: Produces three distinct one-liner replies given a short conversation.
Language Mirroring: Matches the language (English, Malay, Chinese) and preserves short-forms, abbreviations, particles (e.g., lah, lor), and code-switching common in Malaysian chats.
Varied Conversational Moves: Generates replies with different angles, such as direct answers, clarifying questions, proposals, opinions, or redirects.
Improved Reply Length: Significantly reduces over-generation compared to the base model, producing replies closer to reference length.
Enhanced Casual Tone: Fine-tuned to adopt a casual, particle-aware tone, unlike the more formal base model.

Performance Highlights

Evaluated on a 100-example held-out chat set, the fine-tuned model shows substantial improvements:

Overall BLEU score: Increased from 0.34 to 8.48 (25x improvement).
Overall ROUGE-L F1 score: Increased from 0.060 to 0.484 (8.1x improvement).

Limitations

Targeted Fine-tuning: LoRA only targets the top 16 transformer blocks, meaning deep semantic reasoning relies on the base model.
Specific Use Case: Optimized exclusively for chat-reply generation; not suitable for tool use, code generation, or long document tasks.
Short-form Coverage: Best for Malay and casual English short-forms; Mandarin internet slang is inherited from the base model.

Overview

Model Overview

Key Capabilities

Performance Highlights

Limitations

Full Model Card (README)