OmniSQL-32B: Advanced Text-to-SQL Model
OmniSQL-32B is a 32.8 billion parameter model from the OmniSQL family, developed by seeklhy, designed for highly accurate text-to-SQL generation. It is built upon an automatic and scalable data synthesis framework, leveraging the SynSQL-2.5M dataset, which comprises over 2.5 million diverse text-to-SQL samples across 16,000+ databases. The model's training also incorporates high-quality human-labeled data from Spider and BIRD benchmarks.
Key Capabilities
- High-Quality SQL Generation: Translates natural language questions into complex SQL queries for SQLite databases.
- Extensive Training Data: Fine-tuned on the largest and most diverse synthetic text-to-SQL dataset to date, SynSQL-2.5M, which includes chain-of-thought (CoT) solutions.
- Robust Performance: Outperforms similarly sized LLMs and even leading models like GPT-4o and DeepSeek-V3 on various text-to-SQL benchmarks, including Spider, BIRD, and robustness tests.
- Diverse Query Support: Handles a wide range of SQL complexity levels, from simple single-table queries to advanced multi-table joins and common table expressions.
- Flexible Linguistic Styles: Processes natural language questions with varied linguistic styles, including formal, colloquial, imperative, and conversational.
Good For
- Automated Database Interaction: Ideal for applications requiring precise conversion of natural language into SQL queries.
- Benchmarking and Research: Serves as a strong foundation for further research and fine-tuning in the text-to-SQL domain.
- SQLite-Specific Applications: Optimized for scenarios involving SQLite databases, given its training on the SQLite dialect.
Limitations
Currently, OmniSQL-32B is primarily focused on English and the SQLite database engine, which may limit its performance in multi-language or multi-SQL dialect environments. However, its underlying framework allows for synthesizing new data to adapt to specific scenarios.