Kurage Multilingual: A RAG-Optimized Qwen 2 Model
lightblue/kurage-multilingual is a 7.6 billion parameter model built upon the Qwen 2 architecture, developed by Lightblue. Its primary focus is on Retrieval Augmented Generation (RAG) across an impressive 44 languages.
Key Capabilities
- Multi-chunk RAG: Processes and answers questions using information from multiple provided contexts.
- Single-chunk RAG: Evaluates individual contexts for relevance to a question, enabling parallel processing.
- Answer Extension: Can be prompted to generate more detailed answers than its default concise responses.
- Multilingual RAG: Capable of answering questions in one language based on contexts provided in different languages.
- Q&A Generation: Generates question-answer pairs from a given text, useful for pre-indexing databases or fine-tuning information retrieval models.
Training and Data
The model was trained using chunks from the MADLAD-400 dataset, specifically those identified as rich in educational content. Training data involved generating questions and answers from these texts using a state-of-the-art LLM, and selecting negatives based on similarity from BAAI/bge-m3 embeddings. The specific training data is available at lightblue/kurage_training_data.
Known Limitations
There is a known issue where the single-chunk RAG mode can be overly conservative, sometimes failing to answer questions even when the information is present. This is due to the training data's 50:50 ratio of answers vs. 'cannot answer' scenarios, which will be addressed in future retraining with a 90:10 ratio using the upcoming Qwen 2.5 release.