POntAvignon-4b: Specialized French Theater Programme Annotation Model
Pclanglais/POntAvignon-4b is a 4 billion parameter model, built upon Qwen/Qwen3-4B, and fine-tuned using Pleias' Baguettotron SYNTH-syntax. Its core function is to annotate French theater programmes from the Festival d'Avignon (1947–present), transforming raw markdown into structured Linked Art JSON-LD entities. The model processes programmes with a context length of 16k tokens and achieves a 97% valid JSON rate on a held-out test set, with a token accuracy of 96.6%.
Key Capabilities
- Structured Data Extraction: Extracts 7 distinct Linked Art entity types (e.g.,
PropositionalObject for abstract works, Activity for productions/performances, LinguisticObject for source texts). - Chain-of-Thought Reasoning: Employs
<think> tags to generate dense reasoning traces, explicitly naming tasks, engaging with document structure, and resolving ontological boundaries before outputting JSON-LD. - French Theatrical Expertise: Handles French theatrical vocabulary, BnF role mapping, and historical typographic conventions.
- Ontology Alignment: Targets the Linked Art Performing Arts extension (v0.9), incorporating BnF role vocabulary, deterministic content-derived IDs, and source attribution for every extracted fact.
- Robust Training: Trained on 12,507 samples derived from ~1,400 Festival d'Avignon programmes (1971–2022), using a mix of Claude Sonnet and Gemma 12B backreasoning for trace generation.
Good for
- Digital Humanities Research: Ideal for researchers working with historical French theater archives, particularly those from the Festival d'Avignon.
- Knowledge Graph Construction: Facilitates the creation of structured knowledge graphs for performing arts by converting unstructured programme data into Linked Art JSON-LD.
- Specialized NLP Tasks: Demonstrates strong performance in highly specialized information extraction tasks requiring deep domain understanding and complex reasoning.
Limitations
- Specialized Scope: Primarily trained on Festival d'Avignon programmes; performance on other festivals or non-French theatrical traditions may vary.
- Language Dependency: French-centric in its understanding of text, roles, and conventions.
- Contextual Truncation: Large cast/crew lists might be truncated near the context limit.
- Date Inference: Relies on filenames for year inference if not explicitly stated in the programme.