Model Overview
The cx-cmu/repro-rephraser-4B is a 4 billion parameter model derived from Qwen3-4B. It has been fine-tuned using Reinforcement Learning (RL) as part of the RePro project to specialize in generating high-quality, faithful rephrased versions of web content.
Key Capabilities
- Intelligent Content Filtering: Designed to identify and remove irrelevant elements from text, such as website headers, navigation bars, generic footers, unrelated links, and decorative elements.
- Meaningful Content Preservation: Focuses on retaining all relevant and informative content, including technical terms, key concepts, factual details, reasoning, and examples, ensuring the original context and depth are maintained.
- Faithful Rephrasing: Aims to paraphrase text without adding external information, assumptions, or claims not present in the original source.
- Context Length: Supports a context window of 32768 tokens, allowing for processing of substantial text inputs.
Ideal Use Cases
This model is particularly well-suited for applications requiring:
- Web Content Summarization: Efficiently distilling core information from web pages.
- Data Extraction: Cleaning and preparing web-scraped text by removing noise and retaining only pertinent data.
- Information Retrieval: Enhancing search results or knowledge bases by providing concise, cleaned versions of source documents.
- Text Preprocessing: Preparing raw web text for further analysis or downstream NLP tasks by ensuring high fidelity to the original meaning while removing extraneous details.