Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Mehdi Ali^1,2†, Manuel Brack^3,5†, Max Lübbering^1,2†, Elias Wendt⁵†, Abbas Goher Khan¹†, Richard Rutmann^1,2, Alex Jude², Maurice Kraus⁵, Alexander Arno Weber^1,2, David Kaczér¹, Florian Mai¹, Lucie Flek¹, Rafet Sifa^1,2, Nicolas Flores-Herr², Joachim Köhler^1,2, Patrick Schramowski^3,4,5, Michael Fromm^1,2, Kristian Kersting^3,4,5

¹Lamarr Institute, ²Fraunhofer IAIS, ³DFKI SAINT, ⁴Hessian AI, ⁵Computer Science Department, TU Darmstadt,

arXiv Code

Human Annotations

LLM Annotations

Lightweight Annotator

High-quality multilingual data is crucial for training effective large language models (LLMs). JQL (Judging Quality across Languages) is a scalable and lightweight multilingual data filtering approach that distills the judgment capabilities of strong multilingual LLMs into efficient cross-lingual annotators.

Overall, JQL improves data quality, retains more tokens, and generalizes to unseen languages. It outperforms heuristic baselines and enables cost-efficient multilingual pretraining data curation at scale.

🧩 Main Pipeline Steps

JQL Pipeline Overview — *Figure 1: Overview of the JQL pipeline*

📋 Ground Truth Creation: Human annotators label monolingual documents based on a structured instruction prompt. These documents are translated into all target languages to create a multilingual gold-standard dataset. (See Figure 1)
🤖 LLM-as-a-Judge Selection & Data Annotation: Strong multilingual LLMs (e.g., Gemma, Mistral, LLaMA) are evaluated against the ground truth, and top-performing models are used to produce synthetic annotations. (See Figure 1)
🪶 Lightweight Annotator Training: Train compact regression heads on frozen multilingual embeddings to create efficient, high-throughput annotators. (See Figure 1)
🚀 Scalable Data Filtering: Use trained annotators to filter large-scale pretraining corpora using quantile thresholds. (See Figure 1)

📊 Results

✔️ Accuracy: Good correlation with human ground truth
📈 Downstream LLM Training:
- Benchmark performance improvement over FineWeb2
- Higher document retention vs. FineWeb2 heuristic filter
- Effective dynamic threshold strategies: Trade-off document quality for quantity
- Generalizes to unseen languages
⚡ Annotation Speed: ~11,000 docs/min (A100 GPU, avg. 690 tokens)

📁 Available Artifacts

📄 Ground truth annotations in 35 languages
🧠 Synthetic LLM-annotated dataset (14M+ documents)
🪶 Lightweight annotation models:
- JQL-Edu-Gemma
- JQL-Edu-Mistral
- JQL-Edu-Llama
🛠️ Training & inference scripts

Web Corpus Annotation
More coming soon

🗄️ Large-scale dataset coming soon

📜 Citation

If you use JQL, the annotations, or the pretrained annotators, please cite the paper:

@article{ali2025jql,
  title={Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Modelss},
  author={Ali, Mehdi and Brack, Manuel and Lübbering, Max and Wendt, Elias and Khan, Abbas Goher and Rutmann, Richard and Jude, Alex and Kraus, Maurice and Weber, Alexander Arno and Stollenwerk, Felix and Kaczér, David and Mai, Florian and Flek, Lucie and Sifa, Rafet and Flores-Herr, Nicolas and Köhler, Joachim and Schramowski, Patrick and Fromm, Michael and Kersting, Kristian},
  journal={arXiv preprint arXiv:2505.22232},
  year={2025}
}

🦊 JQL: Judging Quality across Languages

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

🧩 Main Pipeline Steps

📊 Results

📁 Available Artifacts

📜 Citation