Хорижий филология
№3, 2018 йил
64
MORPHOLOGICAL ANALYSIS BY FINITE STATE TRANSDUCER FOR
UZBEK-ENGLISH MACHINE TRANSLATION
Abdurakhmonova Nilufar,
Tashkent state university of Uzbek language and literature named after Alisher Navoi
Tuliyev Ulugbek,
National University of Uzbekistan named after Mirzo Ulugbek
Key words:
morphological rules, morphophonological rules, automatic morphological
analyser, machine translation.
I.
Introduction
Machine translation is the process of
interaction between human and computer. It
depends
on
not
only
computational
technology but also interdisciplinary of
sciences
which
belonging
to
for
understanding
text.
Therefore,
if
the
translation is for English and Uzbek, there are
different structures and peculiarities make to
study
morphological
aspects
before
translation stage.
Over the last 30 years, numerous
researches have been carried out to create
technologies for computational morphology.
Morphological analyzer for Turkic languages
proceeded in the beginning of 60s-years in
20th
century
[1].
Morphoanalyzer
is
necessary for machine translation to divide
components of the words and identify the
grammatical paradigms of target language.
Uzbek language is one of agglutinative
languages and English is inflection one.
Therefore, there are a lot of morphemes like
these languages. A morpheme is small
meaningful unit of lexeme. It has two
components as stem and affix. Stem gives
main sense for lexeme and affix add
grammatical or semantical meaning to the
word. There are many ways to combine
morphemes to create words. Four of these
methods are common and play important
roles in speech and language processing:
inflection
,
derivation
,
compounding
, and
cliticization
[2]. In Uzbek the number of
possible inflectional affixes is rather big than
other non-Turkic languages. Because nearly
all parts of speech could be in inflected form
in
context:
Noun:
bola+jon+lar+im+dagi+lar+niki+mas+mi+ka
n+a; Simple verb: o‗qi+t + tir + ma + yot +
gan + lig + I + ni, Compound verb: mashq
qil+dir+ish+ayot+gan+lar, verbal compound:
ber+dir+tir+ib
yubor+ma+yot+gan+dan+mi+kan+a
and so
on.
I.I.
Morphotactic opportunity in
Uzbek language.
Here morphotactics also plays main
role
for
morphological
parsing.
After
morphological parsing, the components of
text are analyzed semantical approach.
Consequently all legal and illegal positions
morphemes are considered in spotlight. In
Uzbek morphotactics of words are such as
order position: (1)
prefix (
2
) root +
(3)
derivative affix +
(4)
lexical affix+
(5)
grammatical
affix
(
(
1)ham(2)qishloq
(3)lik(4)lar(5)imiz(5)dan
).
In English
(1)
Prefix+
(2)
rооt +
(3)
lехicаl suffiх +
(4)
grаmmаticаl suffiх (
(
1)
co
(2)
work
(3)
er
(4)
s).
However the model is like each other Uzbek
grammatical affixes match preposition and
adverb in English
.
The
most
sub-problem
of
morphological recognition emerged in Turkic
languages for machine translation. Because a
morphological dictionary is a database, in
which linguistic information could be stored.
Some times to identify model of
morphotactic knowledge of words is a bit
problematic
task
if
morphemes
are
compoundable:
yog‟ingarchilik
and
zargarchilk, paxtachilik
. First word cannot be
broken into parts, because there is not
yog‟in+
garchilik
,
but as a job there is
zar+
gar
used separately from
+
chilik
,
paxta+
chi+lik.
As a result, it is three forms of
morphemes: garchilik, gar+chilik, chi+lik.
Хорижий филология
№3, 2018 йил
65
Therefore we length of string as morpheme in
Uzbek. We assume that there is nine letter of
longest morpheme like g+a+r+c+h+i+l+i+k.
Linguistic database of Uzbek input software
in morphological parsing.
Additionally, orthographic rules has
important role for all agglutinative languages
for morphological analysis. Because there are,
so many phonetical changes in the words
make usually a large number of rules. From
right to left the first vowel is removed when it
analyzes for deleting some possessive cases.
So we can see this situation like this chart:
bur
u
n+im=>burnim-deleting
shah
a
r+im=> shahrim-deleting
Other possibilities are epenthesis of a
segment under phonological conditions. Take
for example possessive case or dative case in
Uzbek:
obro‗+im=>obro‗
y
im (my reputation);
u+ga=> u
n
ga (he=> him)
Word error rate (WER) is the sum of
insertions,
deletions,
and
substitutions
normalized by the length of the reference
sentence. A slight variant (WERg) normalizes
this value by the length of the Levenshtein
path, i.e., the sum of insertions, deletions,
substitutions, and matches: this ensures that
the measure is between zero (when the
produced sentence is identical to the
reference) and one (when the candidate must
be entirely deleted, and all words in the
reference must be inserted) [3].
In a parser, morphological analysis of
words is an important prerequisite for
syntactic analysis. Properties of a word the
parser needs to know are its part-of-speech
category and the morphosyntactic information
encoded in the particular word form. Another
important task is lemmatization, i.e. finding
the corresponding dictionary form for a given
input word, because for many applications a
lemma lexicon is used to provide more
detailed syntactic (e.g, valency) and semantic
information for deep analysis.
Alternation
and
adjacency
of
morphemes
is
important
to
analyze
automatically for finite state transducers.
Following scheme shows morphotactic order
of the verb in Uzbek.
Begin
b
u
r
u
n
im
sh
a
h
a
r
h
r
Хорижий филология
№3, 2018 йил
66
Verb
Passive
+Il, +l, +n, +In
Together +sh,
+Ish
Before
+guncha, +kuncha, +quncha
/+gunIMcha, +KunIMcha,
+QunIMcha
After
+Gach, +Kach, +Qach
Purpose
+moqchi
Position +Ib, +b
Position +a,+y
Position
+Gancha,
+Kancha, +Qancha
Position
+Gudеk,+Kudеk,+Qudеk
Infinitive +Ish, +sh
Tense Present +y
Condition
+sa
Infinitive
+Ar, +r
Infinitive
+v, +Uv
Present:
+moqda
Present:
+yap
Present:
+yotir,
+Ayotir
Past
+di
Past
+b, +Ib
Past
+Gan,
+Kan
+Qan
Person
+man, +san,
+miz, +siz,
+lar
Negative
+ma
Negative
+mas
Negative
+may
Person
+m, +ng,
+ngiz, +lar,
+k
-t, -tir, -dir, -ir, -giz; -kiz, -kaz, -qiz,
-qaz
Tense Present +a
Gerund
+Gan, +Kan, +Qan, +YDIgan,
+ADIgan,
+YOTgan,
+AYOTgan
Particles:
+mi, +chi, +a, +ya
End
Хорижий филология
№3, 2018 йил
67
II.
Derivative possibility of Uzbek
Hitherto owing to lack of resources of Uzbek language in database, we may see some
problems like verbal categories in morphology. In order to analyze correctly morphemes in the
context it should be construct classification and structure of verbs. Derivation is also productive in
Uzbek:
Stem (Noun)
Derivative affixes
Part of speech
Gul (Flower)
-chi (florist)
Noun
-dor
Adj.
-li (floral)
Adj.
-siz (without flower)
Adj.
-chilik
Noun
-la (blossom)
Verb
-don (flowerpot)
Noun
There are some issues on the types of affixes in the approach of inflection and derivation. For
instance in derivational diversity of we can see the models of morphotactics in the verbs:
Noun+
-a =>
sana,
-an =>
kuchan,
-i=>
ranji,
-ik=>
ko‗zik,
-ir=>
gapir,
-y=>
kuchay,
-ka=>
iska,
-la=>
gulla,
-lan=>
faxrlan,
-lash=>
ommalash,
-lashtir
=>sahnalashtir,
-sit=>
aybsit,
-sira
=>suvsira, -
iq =>
yo‗liq,
-g‘ar=>
jamg‗ar,
-qar =>
boshqar
Adjective+
-a=>
qiyna,
-i=>
tinchi,
-ay=>
toray,
-la =>
maydala,
-lan=>
shodlan, -
lash
=>
osonlash, -
lat=
>
-lashtir
=>soxtalashtir,
-r=>
qisqar,
-ar
=>oqar, -
si
=>
garangsi, -
sin =>
yotsin, -
sira
=>begonasira
, -t=>
to‗lat, -
it=>
berkit,
-
iq=>
namiq
Numeral+
-ik=>
birik, -
lan=>
ikkilan,
-lash=>
birlash
Pronoun+
-la =>
sizla,
-si =>mensi,
-
sira=>sensira
Adverb+
-ik=>
kechik
, -ir=>
ko‗pir,
-
ay=>
ko‗pay,
-la=>
tezla
, -lash=>
birgalash,
-
sit=>
kamsit,
-chi=>
ko‗pchi
Imitative words +
-a
=>shildira
, -illa
=>guvilla
, -ur
=>tupur,
-ira
=>yaltira,
-la
=>gumburla,
-
ra
=>ma‘ra
, -shi
=>g‗ingshi,
qir=>
hayqir
Modal words+
–la
=>yo‗qla,
-ol
=>yo‗qol
, -ot=>
yo‗qot
+modal affixes+
-imsira=>
kulimsiramoq,
-inqira=>
oqarinqiramoq,
-kila=>
tepkilamoq,
-
qila=>
chopqilamoq,
-gila=>
yugurgilamoq,
-g‘ila=>ezg‘ilamoq,
-
ish=>
to‗lishmoq, -
q=>
tutaqmoq,
-iq=>
toliqmoq,
-k=>
junjikmoq,
-
ik=>
ko‗nikmoq,
-la=>
savalamoq,
-ala=>
quvalamoq,
-qi=>
yulqimoq,
-
g‘i=>
to‗zg‗imoq,
-a=>
buramoq
Overall 56 types of lexical affixes that
made by other parts of speech. In our lexicon
includes 50 000 entries and their subdivision
of categorical parameters.
Some multifunctional affixes of them
come as homonyms. They make other parts of
speech like noun, adjective, adverb and so on.
In most cases, the words may be ambiguous
apart from discourse. Therefore, to point out
the certain places in syntactic position is also
crucial for computational analysis. For
example, the word
och
has different senses:
och rang –light colour, qorin och – be
hungry.
Besides the word
“och”
comes as a
component of idioms or compound verbs.
Ishtahani
och +ib
{ber, bo‗l, chiq, ket, ko‗r,
qo‗y, tashla}
+a
{bil, boshla, ol}
Ko‗gilni
och+ib
{ ber, ko‗r, o‗tir, qo‗y, tashla,
yubor}
+a
{ol}
Хорижий филология
№3, 2018 йил
68
Finite state transducers read their input
symbol by symbol and each time they read a
symbol, they give a corresponding output and
move to a new state. This improves the
processing speed fundamentally. Practically,
the processing speed is independent of the
size of the rules [5]. A lexicon compiler is a
program that reads sets of morphemes and
their morphotactic combinations in order to
create a finite-state transducer of a lexicon
[6].
Sirni
och (divulge)
Yo‗l
och (open the way)
Fol
och (guess)
Gul
och (flourish)
III.
Approaches to morphological
analysis
An inflectional form is a combination
of a stem with an inflectional affix. According
to Cerstin Mahlow, Michael Piotrowski
showed
four
approaches
to
restrict
combination of affixes [7]: naive, affix, stem,
indirection approaches.
Morphological analysis for machine
translation includes morphonological rules as
well. For instance English and Uzbek
languages have own rules: big=>bigger; quloq
(ear)=>qulog‗im (my ear)
In the early of 90s years there were
three types of morphological analizators
based on three models: generative model,
paradigmatic
model,
the
two-level
morphological model for Tatar language [8].
IV.
Algorithm for morphological
The
earliest
algorithms
for
automatically assigning part-of-speech were
based on a two stage architecture (Harris,
1962; Klein and Simmons, 1963; Greene and
Rubin, 1971). The first stage used a dictionary
to assign each word a list of potential parts-
of-speech. The second stage used large lists of
hand-written disambiguation rules to winnow
down this list to a single part-of-speech for
each word.
It is known that machine translation is a
huge problem for any language if there is lack
of resources. But it can be considered as a
very large problem for Uzbek language than
others. Because as other Turkic languages
Uzbek is very non structured language and
applying some strike method to it is very
difficult. Some of its difficulties has been
mentioned above. According to these issues,
it can be useful that if we will create a method
or program for this language which analyze
its parts. That, it should identify type and
meanings of words in sentences. For this, we
should analyze only words very first. It is
called
morphoanalyzer
. Using this analyzer
we can make a decision about words and their
meanings, morphological or other changings
in it as well.
So, creating this analyzer also can be
divided several steps:
-
Identifying a stem of lexemes;
-
Identifying parts of speech type of
stem;
-
Parsing all affixes added to the word
according to stem as token;
-
Identifying types of all parsed
affixes and noticing them.
These processes also does not go easily.
Because there are also many problems we can
face according to linguistical approach. For
example, to identify a base of word we need
the database of all simple words, which are
not include any affixes, in Uzbek language.
Then we should compare almost all words in
database with the word. There are some idea
to apply our work. Firstly, we take a letter
from the end of word every time and compare
with all words in database. So, we can get
base cutting all affixes in the ending of word.
For example: bolalarim (is not be found) ->
bolalari (is not be found)-> bolalar (is not be
found)-> bolala (is not be found)-> bolal (is
not be found) -> bola (is found and finishes).
Until we get ―bola‖ six times we compare all
words, which has less length than nine
(because ―bolalarim‖ has nine letters, and
every step we can decrease for one the
number of variants of words), in database.
But, if the word has prefix, such as
―serg‘ayratlar‖,
―noodatiylik‖,
―beg‘am-
liging‖, this method does not work: serg‘ayrat
(is not be found) -> serg‘ayra (is not be
found) -> serg‘ayr (is not be found) -> serg‘ay
(is not be found) -> serg‘a (is not be found) ->
serg‘ (is not be found) -> ser (is not be found)
-> se (is not be found) -> s (is not be found
Хорижий филология
№3, 2018 йил
69
and finishes unsuccessfully). Because until
the end of the word we cannot find a word in
database similar the word which we cut. If we
start cutting a letters from the beginning of
the word, the same problem can be faced
anyway.
Next, another idea is using
contains
method of the programming. To do this: we
identify a length of the word; select words
from the database that have less length than
the words‘; search all words in the component
of the word; if not found then decreasing the
length of selected words and repeating the
process until getting to success. However, in
this case we have more and more
combinations.
Despite these problems above if we get
a base using some methods, we can identify a
type part of speech of the base. But, parsing
all appendixes is also not easy. As our
approach to morphological analyzing from
left to right is appropriate for Uzbek
language. Firstly, stem is taken according to
parts of speech database, then identifying
Taking example of some lexeme and
wordforms we obtained like this algorithm by
python.
k=1
for i in range(0, len(word)):
if(otlar.__contains__(word[0: i+1])):
k=i+1
print(word[0: k])
word=word[k:]
k=10
while(len(word)>0):
if(qoshOtYas.__contains__(word[0:k]))
:
print(word[0:k])
word=word[k:]
if(len(word)>10):
k=10
else:
k=len(word)
elif(qoshimchalarOt.__contains
__(word[0:k])):
print(word[0:k])
word = word[k:]
if (len(word) > 10):
k = 10
else:
k = len(word)
RESULT:
BOLAJONLARIMGAMI
(to my dear children?)
bola
jon
lar
im
ga
mi
Conclusion
As we showed above the model of
morphotactic of Uzbek is crucial for analysis.
Uzbek verbs have grammatical categories
which are should be clarified stage of
segmentation each of them. Segmentation of
morphological parsing is multilevel process,
so there are a number of notable approaches
in the world. Each grammatical and
orthographical rules are important for finite
state transducer. The current article presents
some ways to resolve morphoanalyzer issue
for machine translation.
References:
1.
Jurafskiy D. Speech and language processing. 2007. – P. 4.
2.
Mitkov R. The Oxford handbook of computational linguistics. -P. 62.
3.
Cyril Goutte, Nicola Cancedda, Marc Dymetman, and George Foster Learning Machine
Translation Cambridge, Massachusetts, London, England, 2009. - P.6.
4.
Raül Canals, Anna Esteve, Alicia Garrido et.al., interNOSTRUM: A Spanish Catalan
Machine Translation System, Machine Translation Review, Issue No. 11, December (2000) –PP.
21-25.
5.
Krister Lindén, Miikka Silfverberg, and Tommi Pirinen HFST Tools for Morphology – An
Efficient Open-Source.
Хорижий филология
№3, 2018 йил
70
6.
Package for Construction of Morphological Analyzers / – Computational Morphology in
the Framework of the SLIM Theory of Language / State of the Art in Computational Morphology. –
Zurich, 2009 P. 30.
7.
Cerstin Mahlow, Michael Piotrowski (eds.). JSLIM – Computational Morphology in the
Framework of the SLIM Theory of Language / State of the Art in Computational Morphology. –
Zurich, 2009. –P. 15.
8.
D. Suleymanov, R. Gilmullin, R. Gataullin
Morphological analysis system of the Tatar
language based on the two-level morphological model / Turklang 2017. Kazan, 2017. pp. 6-26.
Abdurakhmonova N., Tuliyev U. Morphological analysis by finite state transducer for
uzbek-english machine translation.
This article describes brief results of the stages of an
automatic morphological analyzer for Uzbek language
,
which used for machine translation system.
The paper analyzes ordering of segment and the rules of the Uzbek wordforms generation in the
frame of morphological aspect.
Abduraxmonova
N.,
Тулиев У.
Инглизча-ўзбекча машина таржимасида
морфоанализатор таҳлили.
Ushbu maqolada mashina tarjimasida foydalaniladigan avtomatik
morfoanalizatorning bosqichlari amalga oshirish natijalarini qisqacha yoritib o„tilgan.
Shuningdek, o„zbek tilidagi so„zlarining segment birliklari tartibi va qoidasi morfologik aspektda
tahlilga tortilgan.