🦊 JQL: Judging Quality across Languages

Scalable and lightweight multilingual data filtering with LLM-based annotators

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Mehdi Ali1,2†, Manuel Brack3,5†, Max Lübbering1,2†, Elias Wendt5†, Abbas Goher Khan1†, Richard Rutmann1,2, Alex Jude2, Maurice Kraus5, Alexander Arno Weber1,2, David Kaczér1, Florian Mai1, Lucie Flek1, Rafet Sifa1,2, Nicolas Flores-Herr2, Joachim Köhler1,2, Patrick Schramowski3,4,5, Michael Fromm1,2, Kristian Kersting3,4,5
1Lamarr Institute, 2Fraunhofer IAIS, 3DFKI SAINT, 4Hessian AI, 5Computer Science Department, TU Darmstadt,

High-quality multilingual data is crucial for training effective large language models (LLMs). JQL (Judging Quality across Languages) is a scalable and lightweight multilingual data filtering approach that distills the judgment capabilities of strong multilingual LLMs into efficient cross-lingual annotators.

Overall, JQL improves data quality, retains more tokens, and generalizes to unseen languages. It outperforms heuristic baselines and enables cost-efficient multilingual pretraining data curation at scale.

🧩 Main Pipeline Steps

JQL Pipeline Overview
Figure 1: Overview of the JQL pipeline
  1. 📋 Ground Truth Creation: Human annotators label monolingual documents based on a structured instruction prompt. These documents are translated into all target languages to create a multilingual gold-standard dataset. (See Figure 1)
  2. 🤖 LLM-as-a-Judge Selection & Data Annotation: Strong multilingual LLMs (e.g., Gemma, Mistral, LLaMA) are evaluated against the ground truth, and top-performing models are used to produce synthetic annotations. (See Figure 1)
  3. 🪶 Lightweight Annotator Training: Train compact regression heads on frozen multilingual embeddings to create efficient, high-throughput annotators. (See Figure 1)
  4. 🚀 Scalable Data Filtering: Use trained annotators to filter large-scale pretraining corpora using quantile thresholds. (See Figure 1)

📊 Results

📁 Available Artifacts

📜 Citation

If you use JQL, the annotations, or the pretrained annotators, please cite the paper:

@article{ali2025jql,
  title={Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Modelss},
  author={Ali, Mehdi and Brack, Manuel and Lübbering, Max and Wendt, Elias and Khan, Abbas Goher and Rutmann, Richard and Jude, Alex and Kraus, Maurice and Weber, Alexander Arno and Stollenwerk, Felix and Kaczér, David and Mai, Florian and Flek, Lucie and Sifa, Rafet and Flores-Herr, Nicolas and Köhler, Joachim and Schramowski, Patrick and Fromm, Michael and Kersting, Kristian},
  journal={arXiv preprint arXiv:2505.22232},
  year={2025}
}