mhenrichsen/context-aware-splitter-1b

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.1BQuant:BF16Ctx Length:2kLicense:apache-2.0Architecture:Transformer0.0K Open Weights Warm

The mhenrichsen/context-aware-splitter-1b is a 1 billion parameter model developed by mhenrichsen, specifically designed as a text splitter for Retrieval Augmented Generation (RAG). Trained on 12.3k Danish texts, it intelligently segments input text into contextually meaningful parts based on a defined word count. This model excels at understanding and preserving the context of text segments, making it ideal for preparing documents for RAG systems where coherent, independent text chunks are crucial.

Loading preview...

Context-Aware Splitter (CAS) for RAG

mhenrichsen/context-aware-splitter-1b is a specialized 1 billion parameter model engineered for Retrieval Augmented Generation (RAG). Its core function is to intelligently segment text, ensuring each split is contextually coherent and can be read independently, which is vital for effective RAG systems.

Key Capabilities

  • Context-aware splitting: Unlike traditional splitters, CAS reads and understands the context of the input text to determine optimal split points.
  • Word count adherence: It provides splits based on a user-defined word count, with the flexibility for overlaps where meaningful.
  • Structured output: Returns a dictionary containing a list of text splits and an inferred topic for the entire input.
  • Danish language focus: Trained on 12.3k Danish texts (13.4M tokens), making it particularly effective for Danish content.
  • Alpaca prompt format: Utilizes the Alpaca instruction format for clear input and response structuring.

Good for

  • Optimizing RAG pipelines: Pre-processing documents into semantically rich chunks for improved retrieval accuracy.
  • Handling Danish text: Specifically fine-tuned for the nuances of the Danish language.
  • Ensuring contextual integrity: Maintaining the meaning and readability of text segments after splitting.