thkim0305/RepBend_Mistral_7B

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Dec 6, 2024Architecture:Transformer Cold

The thkim0305/RepBend_Mistral_7B is a 7 billion parameter Mistral-based language model fine-tuned using the Representation Bending (REPBEND) approach. This method modifies internal representations to enhance safety by reducing harmful responses while preserving general utility. It is specifically designed to be robust against adversarial jailbreak attacks, out-of-distribution harmful prompts, and fine-tuning exploits, making it suitable for applications requiring secure and informative AI interactions.

Loading preview...

Model Overview

The thkim0305/RepBend_Mistral_7B is a 7 billion parameter language model built upon the Mistral architecture. Its core innovation lies in its fine-tuning process, which utilizes the Representation Bending (REPBEND) approach. This technique, detailed in the paper "Representation Bending for Large Language Model Safety" (arXiv:2504.01550), modifies the model's internal representations to significantly enhance safety without compromising its ability to provide useful and informative responses.

Key Capabilities

  • Enhanced Safety: Specifically engineered to reduce the generation of harmful or unsafe content.
  • Robustness to Attacks: Demonstrates resilience against various adversarial techniques, including:
    • Adversarial jailbreak attacks.
    • Out-of-distribution harmful prompts.
    • Fine-tuning exploits.
  • Preserved Utility: Maintains its general language understanding and generation capabilities for benign requests.

Good For

This model is particularly well-suited for use cases where safety and resistance to malicious prompting are critical. Developers looking for a Mistral-based model that offers strong safeguards against generating undesirable content, while still delivering informative outputs, will find RepBend_Mistral_7B a valuable option. Its design makes it a strong candidate for applications requiring secure and reliable AI interactions.