HiTZ/Llama-3.1-8B-Instruct-multi-truth-judge

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:May 22, 2025License:llama3.1Architecture:Transformer0.0K Cold

HiTZ/Llama-3.1-8B-Instruct-multi-truth-judge is an 8 billion parameter LLM-as-a-Judge model developed by HiTZ Center and affiliated institutions, fine-tuned from Meta-Llama-3.1-8B-Instruct. This model specializes in assessing the truthfulness of text generated by other language models across multiple languages, including English, Basque, Catalan, Galician, and Spanish. It is designed for evaluating model outputs, particularly within the context of multilingual truthfulness benchmarks.

Loading preview...

Model Overview

HiTZ/Llama-3.1-8B-Instruct-multi-truth-judge is an 8 billion parameter LLM-as-a-Judge, fine-tuned from meta-llama/Meta-Llama-3.1-8B-Instruct by the HiTZ Center and collaborators. Its primary function is to evaluate the truthfulness of text generated by other language models, extending beyond English to include Basque, Catalan, Galician, and Spanish.

Key Capabilities

  • Multilingual Truthfulness Assessment: Judges model outputs for truthfulness across five languages (English, Basque, Catalan, Galician, Spanish).
  • LLM-as-a-Judge Framework: Operates by taking a question, reference answer, and model-generated answer to produce a truthfulness judgment.
  • Research-Backed: Developed based on the research presented in the paper "Truth Knows No Language: Evaluating Truthfulness Beyond English" (arXiv:2502.09387).

Good for

  • Evaluating LLMs: Directly usable for assessing the truthfulness of language model generations, especially in multilingual contexts.
  • Automated Fact-Checking Research: Can serve as a component in systems for automated fact-checking or content moderation research.
  • Benchmarking: Ideal for evaluating models against the TruthfulQA benchmark, particularly its multilingual extensions.

Limitations

Users should be aware that the model's performance can vary across languages and question types (universal vs. culturally specific). It is not designed for general text generation or providing factual information directly, and its judgments should be cross-verified for critical applications.