Scalable and lightweight multilingual data filtering with LLM-based annotators
High-quality multilingual data is crucial for training effective large language models (LLMs). JQL (Judging Quality across Languages) is a scalable and lightweight multilingual data filtering approach that distills the judgment capabilities of strong multilingual LLMs into efficient cross-lingual annotators.
Overall, JQL improves data quality, retains more tokens, and generalizes to unseen languages. It outperforms heuristic baselines and enables cost-efficient multilingual pretraining data curation at scale.
If you use JQL, the annotations, or the pretrained annotators, please cite the paper:
@article{ali2025jql,
title={Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Modelss},
author={Ali, Mehdi and Brack, Manuel and Lübbering, Max and Wendt, Elias and Khan, Abbas Goher and Rutmann, Richard and Jude, Alex and Kraus, Maurice and Weber, Alexander Arno and Stollenwerk, Felix and Kaczér, David and Mai, Florian and Flek, Lucie and Sifa, Rafet and Flores-Herr, Nicolas and Köhler, Joachim and Schramowski, Patrick and Fromm, Michael and Kersting, Kristian},
journal={arXiv preprint arXiv:2505.22232},
year={2025}
}