Overview
OmniSQL-7B: High-Quality Text-to-SQL Generation
OmniSQL-7B is a 7.6 billion parameter model developed by seeklhy, specifically designed for advanced text-to-SQL capabilities. It is built upon the SynSQL-2.5M dataset, the first million-scale synthetic text-to-SQL dataset, comprising over 2.5 million diverse samples across more than 16,000 databases. This extensive training, combined with integration of human-labeled data from Spider and BIRD, enables OmniSQL-7B to generate highly accurate and complex SQL queries from natural language.
Key Capabilities
- Million-scale Training: Leverages SynSQL-2.5M, the largest synthetic text-to-SQL dataset to date, ensuring broad domain coverage and diverse SQL complexity.
- Superior Performance: Outperforms similarly sized LLMs and even larger models like GPT-4o and DeepSeek-V3 on various text-to-SQL benchmarks, including Spider, BIRD, Spider2.0-SQLite, and robustness tests.
- Chain-of-Thought (CoT) Integration: All training samples include CoT solutions, enhancing the model's reasoning capabilities for SQL generation.
- SQLite Dialect Support: Optimized for generating SQL queries in the SQLite dialect, with support for detailed database schema descriptions via
CREATE TABLEstatements.
Good for
- Automated SQL Generation: Ideal for applications requiring precise conversion of natural language questions into SQL queries.
- Database Interaction: Facilitating user interaction with databases without requiring SQL expertise.
- Research and Development: Serves as a strong foundation for further fine-tuning in specific text-to-SQL scenarios, especially with its associated data synthesis framework.
Limitations
Currently, OmniSQL-7B is primarily focused on English and the SQLite database engine, which may limit its performance in multi-language or multi-SQL dialect environments.