MORPHOLOGICALITY AND CORPUS-BASED TAGGING MODELS IN TURKIC LANGUAGES: A PROJECT FOR THE CREATION OF A CORPUS FOR THE KARAKALPAK LANGUAGE
DOI:
https://doi.org/10.5281/zenodo.16757516Keywords:
Karakalpak language; morphological analysis; corpus linguistics; Turkic languages; agglutinative languages; morphological tagging; low-resource languages; NLP; transfer learning; morphological tags.Abstract
In the context of the digital transformation of linguistics and the rapid advancement of natural language processing (NLP) technologies, the development of morphological resources for low-resource languages has become a crucial task in applied linguistics. This study explores the possibilities of designing morphological corpora and tagged models for the Karakalpak language—an agglutinative Turkic language that remains unrepresented in digital linguistic repositories.
References
Veitsman, Y., & Hartmann, M. (2025). Recent advancements and challenges of Turkic Central Asian language processing. In Proceedings of the Workshop on NLP for Low-resource Languages (LoResLM 2025). Association for Computational Linguistics. https://aclanthology.org/2025.loreslm-1.25.pdf
Tukeyev, U. (2025). Morphological segmentation method for Turkic language neural machine translation. arXiv preprint. https://www.researchgate.net/publication/347838494
Çöltekin, Ç. (2022). Resources for Turkish natural language processing. Natural Language Engineering, 28(4), 543–566. https://doi.org/10.1017/S1351324921000382
Isbarov, J., Akhmedov, M., & Temirov, S. (2025). TUMLU: A unified and native language understanding benchmark for Turkic languages. arXiv. https://arxiv.org/abs/2502.11020
Turganbaeva, P. N. (2022). Ways of word formation in the Karakalpak language. Indiana Journal of Multidisciplinary Research, 2(1), 11–13. https://indianapublications.com/articles/IJMR_2%281%29_11-13_6258d0fab95314.63508070.pdf
Yazar, T., Kutlu, M., & Bayırlı, O. (2025). Diachronic resources for the fast evolving Turkish language. Language Resources and Evaluation. https://link.springer.com/article/10.1007/s10579-025-09857-w
Surrey Morphology Group. (2023–2025). Comparative morphosyntactic research on Turkic languages. University of Surrey. https://www.smg.surrey.ac.uk/projects
Otemisov, A. Z., & Esemuratov, A. E. (2024). The need to digitize Karakalpak language: problems and solutions. In Models and Methods in Modern Science: International Scientific Online Conference (MMMS-1103). https://doi.org/10.5281/zenodo.12670228