Volume 15 Issue 06, June 2025
Impact factor: 2019: 4.679 2020: 5.015 2021: 5.436, 2022: 5.242, 2023:
6.995, 2024 7.75
http://www.internationaljournal.co.in/index.php/jasass
620
ENHANCING UZBEK-ENGLISH NEURAL MACHINE TRANSLATION WITH
DOMAIN-SPECIFIC BERT PRETRAINING
Fayziyev Sh.I.
Doctor of technical sciences, Associate Professor
Responsible employee of the Accounts Chamber of the Republic of Uzbekistan.
Safoev N.N.
PhD student, Bukhara state technical university
Annotation:
This article investigates the enhancement of Uzbek-English neural machine
translation (NMT) by leveraging domain-specific BERT pretraining. Due to the low-resource
nature and morphological complexity of Uzbek, standard NMT models often struggle with
domain-specific terminology and contextual nuances. By pretraining BERT models on
monolingual corpora tailored to general, medical, and legal domains, and integrating them into a
transformer-based NMT framework, the study achieves significant improvements in translation
quality. Results demonstrate that domain-specific pretraining notably outperforms general
pretraining and baseline models, highlighting its effectiveness for specialized translations in low-
resource language pairs.
Keywords:
Uzbek-English translation, neural machine translation, BERT pretraining, domain-
specific language models, low-resource languages, transformer architecture, machine translation
evaluation, domain adaptation, morphological complexity, natural language processing.
Introduction.
Neural Machine Translation (NMT) has revolutionized the field of automated
language translation by utilizing deep learning techniques, particularly transformer architectures,
to produce high-quality translations. While NMT models have demonstrated remarkable success
for high-resource language pairs such as English-French or English-Chinese, their performance
on low-resource languages remains limited. Uzbek, a Turkic language spoken by over 30 million
people primarily in Central Asia, is considered a low-resource language due to the scarcity of
large-scale parallel corpora and annotated datasets. This poses significant challenges for
developing robust Uzbek-English machine translation systems.
Several linguistic characteristics of Uzbek add to these challenges. As an agglutinative language,
Uzbek uses a rich system of suffixes and inflections, resulting in a large vocabulary and complex
morphological structures. Moreover, the syntactic order of Uzbek (typically subject-object-verb)
differs from English (subject-verb-object), requiring NMT models to effectively learn cross-
lingual syntactic transformations. In addition to these linguistic difficulties, domain-specific
translation remains a critical problem. Many existing Uzbek-English translation models are
trained on general-domain corpora such as news or Wikipedia, limiting their ability to accurately
translate specialized texts in areas like medicine, law, or technology. Domain-specific
terminology, idiomatic expressions, and context-sensitive meanings are often inadequately
captured, leading to translations that are either incorrect or lack fluency.
To overcome these challenges, recent advances in natural language processing have leveraged
pretrained language models such as BERT (Bidirectional Encoder Representations from
Transformers). BERT’s masked language modeling allows it to learn deep contextual
representations of language from large monolingual corpora, improving downstream tasks
including machine translation. While pretrained multilingual BERT models have been applied to
Volume 15 Issue 06, June 2025
Impact factor: 2019: 4.679 2020: 5.015 2021: 5.436, 2022: 5.242, 2023:
6.995, 2024 7.75
http://www.internationaljournal.co.in/index.php/jasass
621
low-resource languages, fine-tuning BERT on domain-specific monolingual data can further
enhance its ability to capture specialized vocabulary and contextual nuances.
This article explores the integration of domain-specific BERT pretraining into Uzbek-English
NMT systems. By training BERT models on Uzbek corpora drawn from medical, legal, and
general domains, and incorporating these models as encoders within a transformer-based NMT
framework, we aim to improve translation quality, especially in specialized fields. Our study
demonstrates that domain-specific BERT pretraining significantly boosts translation accuracy
and fluency compared to baseline NMT systems and those enhanced with general-domain BERT
models. In the following sections, we review relevant literature, describe our data collection and
preprocessing methods, detail our pretraining and NMT architecture, and present comprehensive
evaluation results. The findings underscore the importance of domain adaptation and contextual
pretraining for advancing Uzbek-English machine translation and provide insights applicable to
other low-resource language pairs.
Literature review.
The field of neural machine translation has witnessed significant
advancements since the introduction of sequence-to-sequence models with attention mechanisms
by Bahdanau et al. (2015). The subsequent development of the Transformer architecture by
Vaswani et al. (2017) marked a milestone by eliminating recurrent structures and relying entirely
on self-attention, greatly improving translation quality and training efficiency. These
architectures form the backbone of modern NMT systems, including those targeting low-
resource languages like Uzbek. Low-resource languages, including many Turkic languages such
as Uzbek, face persistent challenges due to limited availability of parallel corpora necessary for
supervised NMT training (Koehn & Knowles, 2017). The scarcity of annotated data leads to
issues such as overfitting and poor generalization, especially for specialized domains where
terminological accuracy is critical (Zoph et al., 2016). Uzbek's agglutinative morphology further
exacerbates data sparsity problems by increasing the effective vocabulary size (Salloum &
Habash, 2014). Efforts to address these challenges have included data augmentation techniques
such as back-translation (Sennrich et al., 2016), transfer learning from high-resource languages
(Nguyen & Chiang, 2017), and multilingual modeling (Johnson et al., 2017).
The advent of pretrained language models such as BERT (Devlin et al., 2019) has transformed
numerous natural language processing (NLP) tasks. BERT’s deep bidirectional transformer
encoder is pretrained on large-scale unlabeled corpora using masked language modeling,
capturing rich contextual representations that can be fine-tuned for downstream tasks. The
effectiveness of BERT has spurred research into integrating pretrained language models with
NMT. Yang et al. (2019) explored initializing NMT encoders with pretrained BERT weights,
reporting improvements in translation quality. Liu et al. (2020) further proposed BERT-fused
NMT models that combine BERT contextual embeddings with the NMT encoder to enhance
semantic representation. These methods have shown particular promise in low-resource
scenarios where large parallel corpora are unavailable.
While general-domain pretrained models provide broad linguistic knowledge, domain-specific
models have been shown to significantly improve performance on specialized tasks. Lee et al.
(2019) introduced BioBERT, a BERT model pretrained on large biomedical corpora, which
outperformed general BERT in medical NLP benchmarks. Similarly, Gururangan et al. (2020)
demonstrated that domain-adaptive pretraining (continued pretraining on in-domain corpora)
yields substantial gains in downstream tasks across various domains. In machine translation,
Volume 15 Issue 06, June 2025
Impact factor: 2019: 4.679 2020: 5.015 2021: 5.436, 2022: 5.242, 2023:
6.995, 2024 7.75
http://www.internationaljournal.co.in/index.php/jasass
622
domain adaptation techniques often involve fine-tuning NMT models on in-domain parallel
corpora (Chu et al., 2017), or incorporating domain-specific terminology databases (Zhao et al.,
2020). However, for languages with limited in-domain parallel data like Uzbek, domain-specific
pretraining of language models on large monolingual corpora is a practical alternative.
Research specifically addressing Uzbek-English NMT remains limited. Early work by Tursun et
al. (2017) focused on rule-based and statistical approaches, constrained by data scarcity. More
recent efforts have applied transformer-based models using available datasets, showing
incremental improvements (Sultanov & Mukhamedov, 2021). Multilingual transfer learning
from related Turkic languages (e.g., Turkish, Kazakh) has been explored to leverage shared
linguistic features (Ziyadin et al., 2020). Few studies have yet integrated pretrained language
models for Uzbek, and even fewer have addressed domain-specific challenges. This gap
underscores the importance of the current study, which leverages domain-specific BERT
pretraining to enrich Uzbek contextual representations and improve NMT outcomes.
Research methodology.
This study investigates the impact of domain-specific BERT
pretraining on Uzbek-English neural machine translation (NMT). Our methodology encompasses
several key stages: data collection and preprocessing, domain-specific BERT pretraining, NMT
model architecture design, training and fine-tuning, and evaluation.
To build and evaluate Uzbek-English NMT systems, we curated parallel corpora across three
domains:
General Domain: Comprised of news articles, Wikipedia entries, and publicly available
general Uzbek-English datasets.
Medical Domain: Extracted from health guidelines, medical research papers, and clinical
reports, primarily sourced from publicly accessible multilingual medical databases.
Legal Domain: Compiled from translated legal documents, contracts, and legislative texts
available through government publications and international legal repositories.
The parallel corpora were tokenized using domain-appropriate tokenizers. For Uzbek, special
attention was given to morphological segmentation to handle agglutinative suffixes and reduce
vocabulary sparsity. Sentence pairs with significant length imbalance or low alignment
confidence were filtered out to ensure data quality.
For domain-specific BERT pretraining, large monolingual Uzbek corpora were gathered from:
General Uzbek news websites and digital libraries.
Medical texts from domain-specific Uzbek resources and international health portals.
Legal Uzbek texts from national legal databases and law-focused websites.
Monolingual English corpora were also collected for alignment verification and complementary
pretraining, though the primary focus remained on Uzbek BERT pretraining.
Text normalization included lowercasing, punctuation standardization, and removal of noisy
content such as advertisements and HTML tags.
We pretrained separate BERT models for each domain using the masked language modeling
(MLM) objective (Devlin et al., 2019). This involved masking random tokens in the input and
training the model to predict them, enabling it to learn deep contextualized representations.
Model Architecture: The BERT-base architecture was selected, comprising 12
transformer encoder layers, 768 hidden units, and 12 attention heads.
Volume 15 Issue 06, June 2025
Impact factor: 2019: 4.679 2020: 5.015 2021: 5.436, 2022: 5.242, 2023:
6.995, 2024 7.75
http://www.internationaljournal.co.in/index.php/jasass
623
Training Setup: Models were trained from scratch on the respective domain monolingual
corpora using the Hugging Face Transformers library. Training continued until convergence
based on validation perplexity.
Domain Adaptation: By focusing pretraining on domain-relevant text, each BERT model
captured specialized terminology and stylistic patterns critical for domain-aware translation.
Research discussion.
The experimental results of this study highlight the significant benefits of
incorporating domain-specific BERT pretraining into Uzbek-English neural machine translation
(NMT) systems. Across all evaluated domains—general, medical, and legal—the models
initialized with domain-adaptive pretrained BERT encoders consistently outperformed both the
baseline transformer models and those using general-domain BERT pretraining. The most
striking improvements were observed in the specialized domains of medicine and law, where
domain-specific terminology and phraseology play critical roles in conveying accurate meaning.
The domain-specific BERT models demonstrated a superior ability to capture nuanced
vocabulary and context-dependent meanings compared to general BERT and baseline models.
This supports previous findings in NLP that domain adaptation through continued pretraining
enables language models to internalize domain-specific semantics and stylistic features
(Gururangan et al., 2020; Lee et al., 2019). For example, medical terms that were often
mistranslated or omitted in baseline models were translated more accurately and consistently
when using domain-specific BERT pretraining. Similarly, legal language, known for its formal
and complex structure, was rendered with greater syntactic fidelity and terminological precision,
which is essential for downstream applications such as contract translation and legal compliance.
Uzbek’s agglutinative morphology and syntactic divergence from English pose significant
challenges for NMT systems. The domain-specific BERT pretraining helped mitigate these
challenges by providing rich contextual embeddings that incorporate morphological and
syntactic nuances, reducing the effective vocabulary sparsity and enabling the model to better
handle inflected forms and syntactic reorderings. Moreover, the scarcity of large-scale parallel
corpora for Uzbek-English translation, particularly in specialized domains, makes supervised
NMT training alone insufficient. Our approach leverages large monolingual corpora for
pretraining, thus capitalizing on abundant unlabeled text data to enhance the encoder’s linguistic
representation before fine-tuning on smaller parallel datasets. This is a practical and scalable
strategy for other low-resource languages facing similar constraints.
While domain-specific BERT pretraining clearly improves translation quality, the gains in the
general domain were relatively modest. This suggests that domain adaptation yields the greatest
benefits when domain characteristics diverge significantly from general language use. The
general-domain pretrained BERT model already captures broad linguistic patterns, limiting the
margin for further improvement without domain specialization. However, some limitations
remain. The quality and size of domain-specific monolingual corpora directly affect pretraining
effectiveness. In domains with scarce or noisy data, BERT’s ability to learn meaningful
representations is constrained. Additionally, fine-tuning large pretrained models requires
substantial computational resources, which may limit accessibility for researchers and
practitioners in resource-constrained environments.
Our findings open several avenues for future research. Multilingual and cross-lingual pretrained
models could be explored to transfer knowledge from related Turkic languages with richer
Volume 15 Issue 06, June 2025
Impact factor: 2019: 4.679 2020: 5.015 2021: 5.436, 2022: 5.242, 2023:
6.995, 2024 7.75
http://www.internationaljournal.co.in/index.php/jasass
624
resources, potentially further enhancing Uzbek translation performance. Incorporating
morphological analyzers or explicit linguistic features within the pretraining or NMT pipeline
may improve handling of agglutinative structures. Furthermore, experimenting with other
pretrained architectures such as mT5 or domain-specific encoder-decoder models could provide
additional insights into optimizing translation for low-resource, morphologically rich languages.
Finally, expanding domain adaptation efforts to include more specialized fields—such as
technical or financial texts—would increase the practical utility of Uzbek-English NMT systems.
Conclusion.
This study explored the enhancement of Uzbek-English neural machine translation
through the integration of domain-specific BERT pretraining. Given the challenges posed by
Uzbek’s low-resource status, morphological complexity, and domain-specific translation needs,
conventional NMT models often fall short in delivering accurate and fluent translations,
especially in specialized fields such as medicine and law. By pretraining BERT models on large
monolingual corpora tailored to general, medical, and legal domains, and incorporating these
models as encoders within a transformer-based NMT framework, our approach demonstrated
clear improvements over baseline systems and general-domain pretrained models. Domain-
specific BERT pretraining enriched the contextual representations, enabling better handling of
specialized terminology, complex morphological structures, and syntactic differences between
Uzbek and English.
The results highlight the value of domain adaptation in low-resource language translation and
confirm that leveraging domain-relevant monolingual data can significantly improve NMT
performance without requiring large-scale parallel corpora. These findings contribute to bridging
the gap in machine translation quality for Uzbek and other similarly under-resourced languages.
Future work can extend this approach by incorporating multilingual pretraining, exploring
alternative pretrained architectures, and expanding domain coverage to further improve
translation accuracy and applicability. Overall, domain-specific BERT pretraining presents a
promising direction for advancing neural machine translation in challenging linguistic and
resource contexts.
References
1.
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly
learning to align and translate.
International Conference on Learning Representations (ICLR)
.
2.
Chu, C., Dabre, R., & Nakazawa, T. (2017). A survey of domain adaptation for neural
machine translation.
Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (ACL)
, 1307–1319.
3.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep
bidirectional transformers for language understanding.
NAACL-HLT
.
4.
Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., &
Smith, N. A. (2020). Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.
ACL
.
5.
Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., ... & Dean, J. (2017).
Google's multilingual neural machine translation system: Enabling zero-shot translation.
Transactions of the Association for Computational Linguistics
, 5, 339-351.
6.
Koehn, P., & Knowles, R. (2017). Six challenges for neural machine translation.
Proceedings of the First Workshop on Neural Machine Translation
, 28-39.
Volume 15 Issue 06, June 2025
Impact factor: 2019: 4.679 2020: 5.015 2021: 5.436, 2022: 5.242, 2023:
6.995, 2024 7.75
http://www.internationaljournal.co.in/index.php/jasass
625
7.
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2019). BioBERT: a
pre-trained biomedical language representation model for biomedical text mining.
Bioinformatics
, 36(4), 1234-1240.
8.
Liu, Y., Zhou, M., Chen, W., Sun, C., Liu, J., & Wang, H. (2020). Fused pretrained
language models for neural machine translation.
Findings of the Association for Computational
Linguistics: EMNLP 2020
, 2647–2653.
9.
Nguyen, T. Q., & Chiang, D. (2017). Transfer learning across low-resource, related
languages for neural machine translation.
EMNLP
, 296–302.
10.
Salloum, W., & Habash, N. (2014). A morphological segmentation approach for Arabic
machine translation.
Machine Translation
, 28(2), 89-117.
