Joaoffg/SHARE-14B-Base-2604
Joaoffg/SHARE-14B-Base-2604 is a 14.7 billion parameter decoder-only causal language model developed by João Gonçalves et al., pretrained exclusively on social sciences and humanities (SSH) content. Utilizing a Phi-4 14B architecture and a custom 50,000-token BPE tokenizer, it achieves performance comparable to Phi-4 14B on SSH-specific benchmarks with significantly less training data. This base model is designed for SSH research and education, primarily for computing token-level surprisal via the MIRROR interface rather than text generation.
Loading preview...
SHARE-14B: A Domain-Specific Model for Social Sciences and Humanities
SHARE-14B (Social-Humanities AI for Research and Education) is a 14.7 billion parameter decoder-only causal language model developed by João Gonçalves et al. It is uniquely pretrained exclusively on content relevant to the social sciences and humanities (SSH), making it the first model of its kind. This model is an intermediate checkpoint, having completed approximately 15% of its planned pretraining (96 billion tokens).
Key Capabilities & Differentiators
- Domain-Specific Pretraining: Trained solely on a curated SSH dataset from sources like Wikipedia, Project Gutenberg, PeS2o, and CORE, distinguishing it from general-purpose LLMs.
- Performance on SSH Tasks: Achieves strong results on a custom SSH Cloze benchmark (79.6% prior-corrected accuracy), approaching Phi-4 14B's performance despite significantly fewer training tokens.
- MIRROR Interface Integration: Primarily designed for use with the MIRROR interface to provide token-level surprisal, aiding in academic writing analysis, identifying stylistic anomalies, and surfacing disciplinary biases.
- Architecture: Mirrors the Phi-4 14B architecture but uses a custom SSH-specific tokenizer.
- License: Governed by a Custom Responsible AI License (RAIL-SHARE) restricting commercial use, model distillation, and unconstrained text generation.
Intended Use Cases
- Academic Writing Support: Identifying typos, stylistic issues, and potential factual errors in SSH texts.
- Research on SSH Discourse: Analyzing disciplinary biases, norms, and the structure of scholarly literature.
- Educational Tool: Supporting reflective revision for students and scholars in SSH fields.
Limitations
As a base model, SHARE-14B is not instruction-tuned or aligned for chat applications. It is not suitable for commercial use, general text generation, or tasks outside of SSH domains (e.g., STEM, coding). The model is an intermediate checkpoint, and its capabilities are still evolving. Users should interpret its outputs critically, especially regarding inherent biases from its training data.