jeongseokoh/LatentSC_llama3.1_8b_6SummaryTokens

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:Jan 31, 2026Architecture:Transformer Cold

jeongseokoh/LatentSC_llama3.1_8b_6SummaryTokens is an 8 billion parameter Llama 3.1 Instruct model enhanced with LatentSC Summary-token embeddings, designed to improve inference by leveraging summary tokens for candidate selection. The base Llama 3.1 weights remain unchanged, with only the specialized summary token embeddings added. This model is particularly suited for applications requiring robust response selection from multiple generated candidates, utilizing cosine similarity for optimal output. It offers a method for dynamic top-K selection based on embedding similarity, making it useful for refining generative model outputs.

Loading preview...

LatentSC Llama 3.1 8B with Summary Tokens

This model, developed by jeongseokoh, is a Llama 3.1 8B Instruct backbone augmented with LatentSC Summary-token embeddings. The core Llama 3.1 weights are preserved, with the addition of specialized summary token embeddings to facilitate LatentSC inference. This approach allows the model to generate multiple candidate responses and then intelligently select the best one based on the similarity of their latent representations.

Key Capabilities

  • Enhanced Inference Selection: Utilizes LatentSC Summary tokens (default: 6) to guide the selection of optimal responses from multiple generated candidates.
  • Embedding-based Selection: Employs cosine similarity between the embeddings of generated sequences to identify the most coherent or representative answer.
  • Dynamic Top-K Selection: Supports a dynamic top-K selection mechanism, allowing for flexible refinement of candidate pools to find the best local optimum.
  • Configurable LatentSC Parameters: Includes stored configuration fields such as lsc_num_special_tokens, lsc_special_token_prefix, lsc_aggr, lsc_remove_eos, and lsc_temp to customize LatentSC behavior.

When to Use This Model

This model is particularly beneficial for use cases where:

  • High-quality response selection is critical: When generating multiple potential answers and needing a robust method to pick the best one.
  • Improving generative model reliability: By leveraging latent space similarity, it helps in filtering out less relevant or lower-quality generations.
  • Exploring advanced inference techniques: Developers interested in experimenting with summary token-guided inference for better output control.

For detailed training/inference scripts and full usage, refer to the GitHub repository.