Lloro 7B: Portuguese Data Analysis Code Generation
Lloro is a 7 billion parameter language model developed by Semantix Research Labs, specifically fine-tuned for Portuguese data analysis in Python. It is built upon codellama/CodeLlama-7b-Instruct-hf and was trained using the QLoRA methodology on synthetic datasets.
Key Capabilities
- Portuguese Data Analysis: Designed to understand and process data analysis requests in Portuguese.
- Code Generation: Generates Python code as output from natural language text inputs.
- Multilingual Understanding: Primarily focused on Portuguese but capable of understanding English.
- Optimized Performance: Achieves strong performance metrics, with the fine-tuned version (
Instruct -FT) showing significant improvements over the base and GPT-3.5 in Code Bleu Score, Rouge-L, and CodeBert metrics.
Training and Features
Lloro was trained between February and April 2024, utilizing 74,222 synthetic instruction/code pairs. The model's context length was increased to 2048 tokens in its V3 release. A related model, Lloro SQL, is also available for Text-to-SQL tasks.
Good for
- Developers and data scientists working on data analysis projects requiring Python code generation in Portuguese.
- Applications needing to translate natural language Portuguese queries into executable Python scripts for data manipulation and analysis.