ENHANCING UZBEK-ENGLISH NEURAL MACHINE TRANSLATION WITH DOMAIN-SPECIFIC BERT PRETRAINING

N. Safoev; Sh. Fayziyev

doi:10.71337/inlibrary.uz.jasss.121627

Authors

N. Safoev
Bukhara state technical university
Sh. Fayziyev
Bukhara state technical university

DOI:

https://doi.org/10.71337/inlibrary.uz.jasss.121627

Abstract

This article investigates the enhancement of Uzbek-English neural machine translation (NMT) by leveraging domain-specific BERT pretraining. Due to the low-resource nature and morphological complexity of Uzbek, standard NMT models often struggle with domain-specific terminology and contextual nuances. By pretraining BERT models on monolingual corpora tailored to general, medical, and legal domains, and integrating them into a transformer-based NMT framework, the study achieves significant improvements in translation quality. Results demonstrate that domain-specific pretraining notably outperforms general pretraining and baseline models, highlighting its effectiveness for specialized translations in low-resource language pairs.

Volume 15 Issue 06, June 2025

Impact factor: 2019: 4.679 2020: 5.015 2021: 5.436, 2022: 5.242, 2023:

6.995, 2024 7.75

http://www.internationaljournal.co.in/index.php/jasass

620

ENHANCING UZBEK-ENGLISH NEURAL MACHINE TRANSLATION WITH

DOMAIN-SPECIFIC BERT PRETRAINING

Fayziyev Sh.I.

Doctor of technical sciences, Associate Professor

Responsible employee of the Accounts Chamber of the Republic of Uzbekistan.

Safoev N.N.

PhD student, Bukhara state technical university

Annotation:

This article investigates the enhancement of Uzbek-English neural machine

translation (NMT) by leveraging domain-specific BERT pretraining. Due to the low-resource

nature and morphological complexity of Uzbek, standard NMT models often struggle with

domain-specific terminology and contextual nuances. By pretraining BERT models on

monolingual corpora tailored to general, medical, and legal domains, and integrating them into a

transformer-based NMT framework, the study achieves significant improvements in translation

quality. Results demonstrate that domain-specific pretraining notably outperforms general

pretraining and baseline models, highlighting its effectiveness for specialized translations in low-

resource language pairs.

Keywords:

Uzbek-English translation, neural machine translation, BERT pretraining, domain-

specific language models, low-resource languages, transformer architecture, machine translation

evaluation, domain adaptation, morphological complexity, natural language processing.

Introduction.

Neural Machine Translation (NMT) has revolutionized the field of automated

language translation by utilizing deep learning techniques, particularly transformer architectures,

to produce high-quality translations. While NMT models have demonstrated remarkable success

for high-resource language pairs such as English-French or English-Chinese, their performance

on low-resource languages remains limited. Uzbek, a Turkic language spoken by over 30 million

people primarily in Central Asia, is considered a low-resource language due to the scarcity of

large-scale parallel corpora and annotated datasets. This poses significant challenges for

developing robust Uzbek-English machine translation systems.

Several linguistic characteristics of Uzbek add to these challenges. As an agglutinative language,

Uzbek uses a rich system of suffixes and inflections, resulting in a large vocabulary and complex

morphological structures. Moreover, the syntactic order of Uzbek (typically subject-object-verb)

differs from English (subject-verb-object), requiring NMT models to effectively learn cross-

lingual syntactic transformations. In addition to these linguistic difficulties, domain-specific

translation remains a critical problem. Many existing Uzbek-English translation models are

trained on general-domain corpora such as news or Wikipedia, limiting their ability to accurately

translate specialized texts in areas like medicine, law, or technology. Domain-specific

terminology, idiomatic expressions, and context-sensitive meanings are often inadequately

captured, leading to translations that are either incorrect or lack fluency.

To overcome these challenges, recent advances in natural language processing have leveraged

pretrained language models such as BERT (Bidirectional Encoder Representations from

Transformers). BERT’s masked language modeling allows it to learn deep contextual

representations of language from large monolingual corpora, improving downstream tasks

including machine translation. While pretrained multilingual BERT models have been applied to

Volume 15 Issue 06, June 2025

Impact factor: 2019: 4.679 2020: 5.015 2021: 5.436, 2022: 5.242, 2023:

6.995, 2024 7.75

http://www.internationaljournal.co.in/index.php/jasass

621

low-resource languages, fine-tuning BERT on domain-specific monolingual data can further

enhance its ability to capture specialized vocabulary and contextual nuances.

This article explores the integration of domain-specific BERT pretraining into Uzbek-English

NMT systems. By training BERT models on Uzbek corpora drawn from medical, legal, and

general domains, and incorporating these models as encoders within a transformer-based NMT

framework, we aim to improve translation quality, especially in specialized fields. Our study

demonstrates that domain-specific BERT pretraining significantly boosts translation accuracy

and fluency compared to baseline NMT systems and those enhanced with general-domain BERT

models. In the following sections, we review relevant literature, describe our data collection and

preprocessing methods, detail our pretraining and NMT architecture, and present comprehensive

evaluation results. The findings underscore the importance of domain adaptation and contextual

pretraining for advancing Uzbek-English machine translation and provide insights applicable to

other low-resource language pairs.

Literature review.

The field of neural machine translation has witnessed significant

advancements since the introduction of sequence-to-sequence models with attention mechanisms

by Bahdanau et al. (2015). The subsequent development of the Transformer architecture by

Vaswani et al. (2017) marked a milestone by eliminating recurrent structures and relying entirely

on self-attention, greatly improving translation quality and training efficiency. These

architectures form the backbone of modern NMT systems, including those targeting low-

resource languages like Uzbek. Low-resource languages, including many Turkic languages such

as Uzbek, face persistent challenges due to limited availability of parallel corpora necessary for

supervised NMT training (Koehn & Knowles, 2017). The scarcity of annotated data leads to

issues such as overfitting and poor generalization, especially for specialized domains where

terminological accuracy is critical (Zoph et al., 2016). Uzbek's agglutinative morphology further

exacerbates data sparsity problems by increasing the effective vocabulary size (Salloum &

Habash, 2014). Efforts to address these challenges have included data augmentation techniques

such as back-translation (Sennrich et al., 2016), transfer learning from high-resource languages

(Nguyen & Chiang, 2017), and multilingual modeling (Johnson et al., 2017).

The advent of pretrained language models such as BERT (Devlin et al., 2019) has transformed

numerous natural language processing (NLP) tasks. BERT’s deep bidirectional transformer

encoder is pretrained on large-scale unlabeled corpora using masked language modeling,

capturing rich contextual representations that can be fine-tuned for downstream tasks. The

effectiveness of BERT has spurred research into integrating pretrained language models with

NMT. Yang et al. (2019) explored initializing NMT encoders with pretrained BERT weights,

reporting improvements in translation quality. Liu et al. (2020) further proposed BERT-fused

NMT models that combine BERT contextual embeddings with the NMT encoder to enhance

semantic representation. These methods have shown particular promise in low-resource

scenarios where large parallel corpora are unavailable.

While general-domain pretrained models provide broad linguistic knowledge, domain-specific

models have been shown to significantly improve performance on specialized tasks. Lee et al.

(2019) introduced BioBERT, a BERT model pretrained on large biomedical corpora, which

outperformed general BERT in medical NLP benchmarks. Similarly, Gururangan et al. (2020)

demonstrated that domain-adaptive pretraining (continued pretraining on in-domain corpora)

yields substantial gains in downstream tasks across various domains. In machine translation,

Volume 15 Issue 06, June 2025

Impact factor: 2019: 4.679 2020: 5.015 2021: 5.436, 2022: 5.242, 2023:

6.995, 2024 7.75

http://www.internationaljournal.co.in/index.php/jasass

622

domain adaptation techniques often involve fine-tuning NMT models on in-domain parallel

corpora (Chu et al., 2017), or incorporating domain-specific terminology databases (Zhao et al.,

2020). However, for languages with limited in-domain parallel data like Uzbek, domain-specific

pretraining of language models on large monolingual corpora is a practical alternative.

Research specifically addressing Uzbek-English NMT remains limited. Early work by Tursun et

al. (2017) focused on rule-based and statistical approaches, constrained by data scarcity. More

recent efforts have applied transformer-based models using available datasets, showing

incremental improvements (Sultanov & Mukhamedov, 2021). Multilingual transfer learning

from related Turkic languages (e.g., Turkish, Kazakh) has been explored to leverage shared

linguistic features (Ziyadin et al., 2020). Few studies have yet integrated pretrained language

models for Uzbek, and even fewer have addressed domain-specific challenges. This gap

underscores the importance of the current study, which leverages domain-specific BERT

pretraining to enrich Uzbek contextual representations and improve NMT outcomes.

Research methodology.

This study investigates the impact of domain-specific BERT

pretraining on Uzbek-English neural machine translation (NMT). Our methodology encompasses

several key stages: data collection and preprocessing, domain-specific BERT pretraining, NMT

model architecture design, training and fine-tuning, and evaluation.

To build and evaluate Uzbek-English NMT systems, we curated parallel corpora across three

domains:



General Domain: Comprised of news articles, Wikipedia entries, and publicly available

general Uzbek-English datasets.



Medical Domain: Extracted from health guidelines, medical research papers, and clinical

reports, primarily sourced from publicly accessible multilingual medical databases.



Legal Domain: Compiled from translated legal documents, contracts, and legislative texts

available through government publications and international legal repositories.

The parallel corpora were tokenized using domain-appropriate tokenizers. For Uzbek, special

attention was given to morphological segmentation to handle agglutinative suffixes and reduce

vocabulary sparsity. Sentence pairs with significant length imbalance or low alignment

confidence were filtered out to ensure data quality.

For domain-specific BERT pretraining, large monolingual Uzbek corpora were gathered from:



General Uzbek news websites and digital libraries.



Medical texts from domain-specific Uzbek resources and international health portals.



Legal Uzbek texts from national legal databases and law-focused websites.

Monolingual English corpora were also collected for alignment verification and complementary

pretraining, though the primary focus remained on Uzbek BERT pretraining.

Text normalization included lowercasing, punctuation standardization, and removal of noisy

content such as advertisements and HTML tags.

We pretrained separate BERT models for each domain using the masked language modeling

(MLM) objective (Devlin et al., 2019). This involved masking random tokens in the input and

training the model to predict them, enabling it to learn deep contextualized representations.



Model Architecture: The BERT-base architecture was selected, comprising 12

transformer encoder layers, 768 hidden units, and 12 attention heads.

Volume 15 Issue 06, June 2025

Impact factor: 2019: 4.679 2020: 5.015 2021: 5.436, 2022: 5.242, 2023:

6.995, 2024 7.75

http://www.internationaljournal.co.in/index.php/jasass

623



Training Setup: Models were trained from scratch on the respective domain monolingual

corpora using the Hugging Face Transformers library. Training continued until convergence

based on validation perplexity.



Domain Adaptation: By focusing pretraining on domain-relevant text, each BERT model

captured specialized terminology and stylistic patterns critical for domain-aware translation.

Research discussion.

The experimental results of this study highlight the significant benefits of

incorporating domain-specific BERT pretraining into Uzbek-English neural machine translation

(NMT) systems. Across all evaluated domains—general, medical, and legal—the models

initialized with domain-adaptive pretrained BERT encoders consistently outperformed both the

baseline transformer models and those using general-domain BERT pretraining. The most

striking improvements were observed in the specialized domains of medicine and law, where

domain-specific terminology and phraseology play critical roles in conveying accurate meaning.

The domain-specific BERT models demonstrated a superior ability to capture nuanced

vocabulary and context-dependent meanings compared to general BERT and baseline models.

This supports previous findings in NLP that domain adaptation through continued pretraining

enables language models to internalize domain-specific semantics and stylistic features

(Gururangan et al., 2020; Lee et al., 2019). For example, medical terms that were often

mistranslated or omitted in baseline models were translated more accurately and consistently

when using domain-specific BERT pretraining. Similarly, legal language, known for its formal

and complex structure, was rendered with greater syntactic fidelity and terminological precision,

which is essential for downstream applications such as contract translation and legal compliance.

Uzbek’s agglutinative morphology and syntactic divergence from English pose significant

challenges for NMT systems. The domain-specific BERT pretraining helped mitigate these

challenges by providing rich contextual embeddings that incorporate morphological and

syntactic nuances, reducing the effective vocabulary sparsity and enabling the model to better

handle inflected forms and syntactic reorderings. Moreover, the scarcity of large-scale parallel

corpora for Uzbek-English translation, particularly in specialized domains, makes supervised

NMT training alone insufficient. Our approach leverages large monolingual corpora for

pretraining, thus capitalizing on abundant unlabeled text data to enhance the encoder’s linguistic

representation before fine-tuning on smaller parallel datasets. This is a practical and scalable

strategy for other low-resource languages facing similar constraints.

While domain-specific BERT pretraining clearly improves translation quality, the gains in the

general domain were relatively modest. This suggests that domain adaptation yields the greatest

benefits when domain characteristics diverge significantly from general language use. The

general-domain pretrained BERT model already captures broad linguistic patterns, limiting the

margin for further improvement without domain specialization. However, some limitations

remain. The quality and size of domain-specific monolingual corpora directly affect pretraining

effectiveness. In domains with scarce or noisy data, BERT’s ability to learn meaningful

representations is constrained. Additionally, fine-tuning large pretrained models requires

substantial computational resources, which may limit accessibility for researchers and

practitioners in resource-constrained environments.

Our findings open several avenues for future research. Multilingual and cross-lingual pretrained

models could be explored to transfer knowledge from related Turkic languages with richer

Volume 15 Issue 06, June 2025

Impact factor: 2019: 4.679 2020: 5.015 2021: 5.436, 2022: 5.242, 2023:

6.995, 2024 7.75

http://www.internationaljournal.co.in/index.php/jasass

624

resources, potentially further enhancing Uzbek translation performance. Incorporating

morphological analyzers or explicit linguistic features within the pretraining or NMT pipeline

may improve handling of agglutinative structures. Furthermore, experimenting with other

pretrained architectures such as mT5 or domain-specific encoder-decoder models could provide

additional insights into optimizing translation for low-resource, morphologically rich languages.

Finally, expanding domain adaptation efforts to include more specialized fields—such as

technical or financial texts—would increase the practical utility of Uzbek-English NMT systems.

Conclusion.

This study explored the enhancement of Uzbek-English neural machine translation

through the integration of domain-specific BERT pretraining. Given the challenges posed by

Uzbek’s low-resource status, morphological complexity, and domain-specific translation needs,

conventional NMT models often fall short in delivering accurate and fluent translations,

especially in specialized fields such as medicine and law. By pretraining BERT models on large

monolingual corpora tailored to general, medical, and legal domains, and incorporating these

models as encoders within a transformer-based NMT framework, our approach demonstrated

clear improvements over baseline systems and general-domain pretrained models. Domain-

specific BERT pretraining enriched the contextual representations, enabling better handling of

specialized terminology, complex morphological structures, and syntactic differences between

Uzbek and English.

The results highlight the value of domain adaptation in low-resource language translation and

confirm that leveraging domain-relevant monolingual data can significantly improve NMT

performance without requiring large-scale parallel corpora. These findings contribute to bridging

the gap in machine translation quality for Uzbek and other similarly under-resourced languages.

Future work can extend this approach by incorporating multilingual pretraining, exploring

alternative pretrained architectures, and expanding domain coverage to further improve

translation accuracy and applicability. Overall, domain-specific BERT pretraining presents a

promising direction for advancing neural machine translation in challenging linguistic and

resource contexts.

References

1.

Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly

learning to align and translate.

International Conference on Learning Representations (ICLR)

.

2.

Chu, C., Dabre, R., & Nakazawa, T. (2017). A survey of domain adaptation for neural

machine translation.

Proceedings of the 55th Annual Meeting of the Association for

Computational Linguistics (ACL)

, 1307–1319.

3.

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep

bidirectional transformers for language understanding.

NAACL-HLT

.

4.

Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., &

Smith, N. A. (2020). Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.

ACL

.

5.

Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., ... & Dean, J. (2017).

Google's multilingual neural machine translation system: Enabling zero-shot translation.

Transactions of the Association for Computational Linguistics

, 5, 339-351.

6.

Koehn, P., & Knowles, R. (2017). Six challenges for neural machine translation.

Proceedings of the First Workshop on Neural Machine Translation

, 28-39.

Volume 15 Issue 06, June 2025

Impact factor: 2019: 4.679 2020: 5.015 2021: 5.436, 2022: 5.242, 2023:

6.995, 2024 7.75

http://www.internationaljournal.co.in/index.php/jasass

625

7.

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2019). BioBERT: a

pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics

, 36(4), 1234-1240.

8.

Liu, Y., Zhou, M., Chen, W., Sun, C., Liu, J., & Wang, H. (2020). Fused pretrained

language models for neural machine translation.

Findings of the Association for Computational

Linguistics: EMNLP 2020

, 2647–2653.

9.

Nguyen, T. Q., & Chiang, D. (2017). Transfer learning across low-resource, related

languages for neural machine translation.

EMNLP

, 296–302.

10.

Salloum, W., & Habash, N. (2014). A morphological segmentation approach for Arabic

machine translation.

Machine Translation

, 28(2), 89-117.

References

Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations (ICLR).

Chu, C., Dabre, R., & Nakazawa, T. (2017). A survey of domain adaptation for neural machine translation. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), 1307–1319.

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT.

Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N. A. (2020). Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. ACL.

Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., ... & Dean, J. (2017). Google's multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5, 339-351.

Koehn, P., & Knowles, R. (2017). Six challenges for neural machine translation. Proceedings of the First Workshop on Neural Machine Translation, 28-39.

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2019). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240.

Liu, Y., Zhou, M., Chen, W., Sun, C., Liu, J., & Wang, H. (2020). Fused pretrained language models for neural machine translation. Findings of the Association for Computational Linguistics: EMNLP 2020, 2647–2653.

Nguyen, T. Q., & Chiang, D. (2017). Transfer learning across low-resource, related languages for neural machine translation. EMNLP, 296–302.

Salloum, W., & Habash, N. (2014). A morphological segmentation approach for Arabic machine translation. Machine Translation, 28(2), 89-117.