DataGemma RIG 27B-IT: Integrating Statistical Data into LLM Responses
DataGemma RIG 27B-IT, developed by Google, is a fine-tuned Gemma 2 model specifically engineered to enhance Large Language Models (LLMs) by incorporating public statistical data from Data Commons. This model utilizes a retrieval interleaved generation (RIG) approach, where it's trained to annotate generated text with natural language queries to Data Commons' interface whenever statistics are mentioned. This allows LLMs to access and present verified statistical information directly within their responses.
Key Capabilities
- Statistical Data Integration: Seamlessly embeds public statistical data from Data Commons into LLM outputs.
- Retrieval Interleaved Generation (RIG): Annotates generated statistics with
[__DC__("<natural language query>") --> "<LLM generated statistic>"]for transparency and verification. - Gemma 2 Base: Built upon the Gemma 2 architecture, leveraging its foundational capabilities.
- Academic and Research Focus: Currently intended for academic and research purposes, with ongoing development.
Usage and Limitations
This model is an early version, fine-tuned on synthetically generated data, and is primarily for academic and research use. It is not yet ready for commercial or general public use and may exhibit unintended behaviors. Users are encouraged to consult the DataGemma paper for detailed information on its implementation, evaluation, and known limitations. The model can be run in 4-bit quantization using bitsandbytes for reduced memory footprint.