seeklhy/OmniSQL-7B

Warm
Public
7.6B
FP8
131072
Mar 6, 2025
License: apache-2.0
Hugging Face
Overview

OmniSQL-7B: High-Quality Text-to-SQL Generation

OmniSQL-7B is a 7.6 billion parameter model developed by seeklhy, specifically designed for advanced text-to-SQL capabilities. It is built upon the SynSQL-2.5M dataset, the first million-scale synthetic text-to-SQL dataset, comprising over 2.5 million diverse samples across more than 16,000 databases. This extensive training, combined with integration of human-labeled data from Spider and BIRD, enables OmniSQL-7B to generate highly accurate and complex SQL queries from natural language.

Key Capabilities

  • Million-scale Training: Leverages SynSQL-2.5M, the largest synthetic text-to-SQL dataset to date, ensuring broad domain coverage and diverse SQL complexity.
  • Superior Performance: Outperforms similarly sized LLMs and even larger models like GPT-4o and DeepSeek-V3 on various text-to-SQL benchmarks, including Spider, BIRD, Spider2.0-SQLite, and robustness tests.
  • Chain-of-Thought (CoT) Integration: All training samples include CoT solutions, enhancing the model's reasoning capabilities for SQL generation.
  • SQLite Dialect Support: Optimized for generating SQL queries in the SQLite dialect, with support for detailed database schema descriptions via CREATE TABLE statements.

Good for

  • Automated SQL Generation: Ideal for applications requiring precise conversion of natural language questions into SQL queries.
  • Database Interaction: Facilitating user interaction with databases without requiring SQL expertise.
  • Research and Development: Serves as a strong foundation for further fine-tuning in specific text-to-SQL scenarios, especially with its associated data synthesis framework.

Limitations

Currently, OmniSQL-7B is primarily focused on English and the SQLite database engine, which may limit its performance in multi-language or multi-SQL dialect environments.