kitft/Llama-3.3-70B-NLA-L53-av
The kitft/Llama-3.3-70B-NLA-L53-av model is the activation verbalizer (AV) component of a Natural Language Autoencoder (NLA) pair, fine-tuned from Meta's Llama-3.3-70B-Instruct. This 70 billion parameter model maps hidden-state vectors to natural-language descriptions, serving as an interpretability tool for LLM activations. It is specifically designed for activation decoding and is not intended for general-purpose language generation tasks. The model achieves an in-distribution fve_nrm of 0.80 on its training set.
Loading preview...
Model Overview
kitft/Llama-3.3-70B-NLA-L53-av is the Activation Verbalizer (AV) component of a Natural Language Autoencoder (NLA) pair, derived from Meta's Llama-3.3-70B-Instruct. This 70 billion parameter model is an interpretability tool designed to map hidden-state vectors from an LLM's residual stream (specifically, block 53) into natural-language descriptions.
It is intended to be used in conjunction with its paired Activation Reconstructor (AR) model, kitft/Llama-3.3-70B-NLA-L53-ar. Together, NLA pairs allow for the unsupervised explanation of LLM activations by converting internal representations into human-readable text and back.
Key Characteristics
- Purpose-built for Interpretability: This model is not a general-purpose language model; its fine-tuning entirely repurposes it for activation decoding.
- NLA Component: Functions as the vector-to-text half of an NLA system, providing natural language descriptions of internal LLM states.
- Performance: Achieves an in-distribution fve_nrm of 0.80 on its training data (50/50 WildChat + Ultra-FineWeb).
- Architecture: Fine-tuned from Llama-3.3-70B-Instruct, focusing on the residual stream output of block 53.
Intended Use Case
This model is specifically for researchers and developers working on LLM interpretability, particularly those interested in understanding and explaining the internal activations of large language models. It provides a method to verbalize what a specific hidden-state vector "means" within the model's processing.