Lloro 7B: Portuguese Data Analysis Code Generation

Lloro is a 7 billion parameter language model developed by Semantix Research Labs, specifically fine-tuned for Portuguese data analysis in Python. It is built upon codellama/CodeLlama-7b-Instruct-hf and was trained using the QLoRA methodology on synthetic datasets.

Key Capabilities

Portuguese Data Analysis: Designed to understand and process data analysis requests in Portuguese.
Code Generation: Generates Python code as output from natural language text inputs.
Multilingual Understanding: Primarily focused on Portuguese but capable of understanding English.
Optimized Performance: Achieves strong performance metrics, with the fine-tuned version (Instruct -FT) showing significant improvements over the base and GPT-3.5 in Code Bleu Score, Rouge-L, and CodeBert metrics.

Training and Features

Lloro was trained between February and April 2024, utilizing 74,222 synthetic instruction/code pairs. The model's context length was increased to 2048 tokens in its V3 release. A related model, Lloro SQL, is also available for Text-to-SQL tasks.

Good for

Developers and data scientists working on data analysis projects requiring Python code generation in Portuguese.
Applications needing to translate natural language Portuguese queries into executable Python scripts for data manipulation and analysis.

Overview

Lloro 7B: Portuguese Data Analysis Code Generation

Key Capabilities

Training and Features

Good for

Full Model Card (README)