seeklhy/OmniSQL-14B
TEXT GENERATIONConcurrency Cost:1Model Size:14.8BQuant:FP8Ctx Length:32kPublished:Mar 6, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

OmniSQL-14B by seeklhy is a 14.8 billion parameter text-to-SQL model, part of the OmniSQL family, fine-tuned on the 2.5 million sample SynSQL-2.5M dataset. This model excels at generating SQL queries from natural language questions, specifically optimized for the SQLite dialect. It demonstrates strong performance across various text-to-SQL benchmarks, often outperforming larger baseline LLMs and even leading models like GPT-4o and DeepSeek-V3 in specific evaluations.

Loading preview...

OmniSQL-14B: High-Quality Text-to-SQL Model

OmniSQL-14B is a 14.8 billion parameter model developed by seeklhy, specifically designed for text-to-SQL tasks. It is built upon an automatic and scalable data synthesis framework, leveraging the SynSQL-2.5M dataset, which comprises over 2.5 million diverse and high-quality text-to-SQL samples across more than 16,000 databases. The model's fine-tuning also incorporates data from established benchmarks like Spider and BIRD.

Key Capabilities

  • SQL Generation: Translates natural language questions into valid SQL queries, primarily for the SQLite dialect.
  • High Accuracy: Achieves strong performance on standard and challenging text-to-SQL benchmarks (e.g., Spider, BIRD, Spider2.0-SQLite, ScienceBenchmark, EHRSQL, Spider-DK, Spider-Syn, Spider-Realistic).
  • Robustness: Evaluated across various robustness benchmarks, demonstrating consistent performance.
  • Chain-of-Thought: Benefits from chain-of-thought solutions included in its training data, aiding in complex query generation.
  • Scalability: Part of a model family (7B, 14B, 32B) built on a large-scale synthetic dataset, allowing for further fine-tuning with custom data.

What Makes It Different?

OmniSQL-14B distinguishes itself through its training on the massive, synthetically generated SynSQL-2.5M dataset, which provides unparalleled diversity in database schemas, SQL complexity, and linguistic styles. This allows it to significantly outperform baseline LLMs of similar scale and, in many cases, surpass larger models like GPT-4o and DeepSeek-V3 on text-to-SQL tasks, without requiring additional design elements like schema linking or SQL revision. Its focus on the SQLite dialect makes it highly specialized for applications using this database engine.

Limitations

Currently, OmniSQL-14B is primarily focused on English and the SQLite database engine. Its performance in multi-language or multi-SQL dialect scenarios may be limited. However, the underlying framework allows for synthesizing new data to adapt the model to different requirements.

Good For

  • Developers needing to convert natural language into SQLite SQL queries.
  • Applications requiring high accuracy in text-to-SQL translation.
  • Researchers exploring synthetic data generation for LLM fine-tuning.
  • Building intelligent database interfaces and query assistants.