SHARE-14B: A Domain-Specific Model for Social Sciences and Humanities

SHARE-14B (Social-Humanities AI for Research and Education) is a 14.7 billion parameter decoder-only causal language model developed by João Gonçalves et al. It is uniquely pretrained exclusively on content relevant to the social sciences and humanities (SSH), making it the first model of its kind. This model is an intermediate checkpoint, having completed approximately 15% of its planned pretraining (96 billion tokens).

Key Capabilities & Differentiators

Domain-Specific Pretraining: Trained solely on a curated SSH dataset from sources like Wikipedia, Project Gutenberg, PeS2o, and CORE, distinguishing it from general-purpose LLMs.
Performance on SSH Tasks: Achieves strong results on a custom SSH Cloze benchmark (79.6% prior-corrected accuracy), approaching Phi-4 14B's performance despite significantly fewer training tokens.
MIRROR Interface Integration: Primarily designed for use with the MIRROR interface to provide token-level surprisal, aiding in academic writing analysis, identifying stylistic anomalies, and surfacing disciplinary biases.
Architecture: Mirrors the Phi-4 14B architecture but uses a custom SSH-specific tokenizer.
License: Governed by a Custom Responsible AI License (RAIL-SHARE) restricting commercial use, model distillation, and unconstrained text generation.

Intended Use Cases

Academic Writing Support: Identifying typos, stylistic issues, and potential factual errors in SSH texts.
Research on SSH Discourse: Analyzing disciplinary biases, norms, and the structure of scholarly literature.
Educational Tool: Supporting reflective revision for students and scholars in SSH fields.

Limitations

As a base model, SHARE-14B is not instruction-tuned or aligned for chat applications. It is not suitable for commercial use, general text generation, or tasks outside of SSH domains (e.g., STEM, coding). The model is an intermediate checkpoint, and its capabilities are still evolving. Users should interpret its outputs critically, especially regarding inherent biases from its training data.

Overview

SHARE-14B: A Domain-Specific Model for Social Sciences and Humanities

Key Capabilities & Differentiators

Intended Use Cases

Limitations

Full Model Card (README)