Xorijiy lingvistika va lingvodidaktika
–
Зарубежная
лингвистика
и
лингводидактика
–
Foreign
Linguistics and Linguodidactics
Journal home page:
https://inscience.uz/index.php/foreign-linguistics
Speech technology used in speaking foreign languages
National University of Uzbekistan
ARTICLE INFO
ABSTRACT
Article history:
Received July 2024
Received in revised form
10 August 2024
Accepted 25 August 2024
Available online
25 September 2024
Speech-enabled systems for computer-assisted language
learning (CALL) offer numerous advantages for language
learners. These systems address individual learner challenges,
allow for practice at a self-paced speed, and provide the
opportunity to practice beyond the time constraints of teacher
availability. They also offer the ability to store student profiles in
log files, enabling both students and teachers to monitor
progress and identify areas for improvement. Additionally, CALL
systems can help students who find it difficult to practice
speaking in public to enhance their language skills. In this paper,
we discuss the use of speech technology in CALL applications
designed to improve the spoken language skills of L2 learners.
2181-3701
/©
2024 in Science LLC.
https://doi.org/10.47689/2181-3701-vol2-iss3
This is an open-access article under the Attribution 4.0 International
(CC BY 4.0) license (
https://creativecommons.org/licenses/by/4.0/deed.ru
Keywords:
speech technology,
computer-assisted language
learning (CALL),
automatic speech
recognition (ASR),
pronunciation training,
non-native speech database,
lexical stress,
intonation in language
learning,
speech synthesis,
pronunciation error
detection,
foreign-accented speech
recognition.
Chet
tillarda
so'zlashishda
qo'llaniladigan
nutq
texnologiyasi
ANNOTATSIYA
Kalit so‘zlar
:
nutq texnologiyasi,
kompyuter yordamida til
o‘rganish (
CALL),
nutqni avtomatik aniqlash
(ASR),
talaffuzni o‘rgatish,
ona tili bo‘lmagan nutq
ma’lumotlar bazasi,
leksik stress, til o‘rganishda
intonatsiya,
Kompyuter yordamida til o'rganish uchun nutqni qo'llab-
quvvatlaydigan tizimlar (CALL) til o'rganuvchilar uchun bir qator
afzalliklarni taqdim etadi. Ushbu tizimlar o'quvchilarning
individual muammolarini hal qilish imkonini beradi, o'z-o'zidan
tezlikda mashq qilish imkonini beradi va mashq qilish imkoniyati
o'qituvchi mavjud bo'lgan vaqt bilan cheklanmaydi. Ular talaba
profillarini log-fayllarda saqlash imkoniyatini taklif qiladi va
talabalar ham, o'qituvchilar ham yaxshilanishlar va muammolarni
kuzatishi mumkin. CALL tizimlari, shuningdek, omma oldida nutq
1
PhD student, National University of Uzbekistan.
Xorijiy lingvistika va lingvodidaktika
–
Зарубежная лингвистика
и лингводидактика
–
Foreign Linguistics and Linguodidactics
Special Issue
–
3 (2024) / ISSN 2181-3701
43
nutq sintezi,
talaffuz xatolarini aniqlash.
so‘zlashda mashq qilishda qiynalayotgan talabalarga til
ko‘nikmalarini oshirishga yordam beradi. Ushbu maqolada biz L2
o'quvchilarining og'zaki nutq ko'nikmalarini yaxshilash uchun
mo'ljallangan
CALL
ilovalarida
nutq
texnologiyasidan
foydalanishni muhokama qilamiz.
Речевые технологии, используемые при говорении на
иностранных языках
АННОТАЦИЯ
Ключевые слова:
речевые технологии,
компьютерное изучение
языка (CALL),
автоматическое
распознавание речи (ASR),
тренировка
произношения, база
данных неродной речи,
лексическое ударение,
интонация при изучении
языка,
синтез речи,
обнаружение ошибок
произношения,
иностранный язык
распознавание
акцентированной речи
.
Речевые системы для компьютерного обучения языку
(CALL) предлагают ряд преимуществ для изучающих язык.
Эти системы позволяют решать индивидуальные проблемы
учащихся, дают возможность практиковаться в удобном
темпе, а возможность тренироваться не ограничена
временем, когда учитель доступен. Кроме того, они
предоставляют возможность сохранять профили учащихся
в лог
-
файлах, что позволяет как студентам, так и
преподавателям отслеживать прогресс и выявлять
трудности. Системы CALL также могут помочь студентам,
испытывающим трудности с практикой публичных
выступлений, улучшить свои языковые навыки. В данной
статье мы обсуждаем использование речевых технологий в
приложениях CALL, разработанных для улучшения навыков
устной речи у изучающих второй язык (L2).
INTRODUCTION:
One of the first challenges that learners frequently experience is the pronunciation
of “difficult” phones and phone sequences. Learners of languages like Swedish, Finnish,
and Estonian may face difficulties trying to pronounce specific vowels. Learners of Russian
may struggle to produce palatalized consonants like /l'/, /p'/, and /t'/, as well as
distinguish between palatalized and non-palatalized consonants. When speaking Swedish,
native Russian speakers typically pronounce consonants followed by vowels,
such as ä and
ö, with strong palatalization. Learners of French and Polish may have problems with the
pronunciation of nasal vowels.
Mispronunciation
A mispronounced phone may impede comprehensibility, especially if the error
results in saying a word with a different meaning. In Swedish, ö pronounced like [o] in the
noun mönster (“pattern”) would produce another word, monster (“monster”). In Russian,
if
the learner mispronounces the last consonant, the palatalized /t’/ in the noun мaть
(mat’, “mother”) so that it sounds like the non
-palatalized /t/, the resulting word may be
interpreted by the l
istener like the noun мат (mat,”swearing, obscene language”)
. In many
languages, the difference between spelling and pronunciation can lead to errors, especially
at the beginner level. For example, words and morphs with different spellings may be
pronounced as spelled. Another common mistake is interpreting certain character
combinations based on the spelling rules of another language. Native speakers of
Xorijiy lingvistika va lingvodidaktika
–
Зарубежная лингвистика
и лингводидактика
–
Foreign Linguistics and Linguodidactics
Special Issue
–
3 (2024) / ISSN 2181-3701
44
languages using the Latin alphabet may struggle to acquire languages like Russian, Greek,
Arabic, Chinese, or Japanese due to character confusion and pronunciation mistakes.
Lexical stress
Lexical stress in polysyllabic words that are familiar to students mostly in written
form can sometimes be placed incorrectly. Hincks (2005) gives some examples of English
words that are often mispronounced by Swedish learners: access, capacity, component and
contribute realized as / k’ses/, /’k p s ti/, /’k mp n nt/, and /’kantr bjut/. In some languages,
the accent indicates the morphological features of the word. In Russian the only difference
between the genitive singular and the nominative or accusative
plural of the noun рeка
(reka, “river”), which are spelled exactly in the same way, is the stress. In the genitive
singular реки (rekì) the second syllable is stressed. The first syllable is stressed in the
nominative/accusative plural of the same noun, реки (rèki). In such cases,
incorrectly
placed stress causes grammar errors.
Intonation
Intonation is frequently vital for the comprehension of the learner's words.
Intonation is widely used in Russian interrogative clauses to clarify their meaning and
distinguish them from positive ones. Other examples of intonational patterns that may
require practice include the polar question intonation in English and the distinction
between acute and grave word accents in Swedish.
It is important to consider both grammar and vocabulary when discussing faults in
spoken language. The student typically does not have time to consider what they are
saying, look up words in a dictionary, proofread their sentences, or consult a grammar
guide when speaking. Due to this, many students find that when speaking rather than
writing, they make more grammatical and lexical errors.
Rapid speech tempo and spelling discrepancies can also cause comprehension
issues. Beginning language learners of languages such as French and Danish, where
spelling and pronunciation vary greatly, may find it difficult to follow spoken language.
Many Russian words may be difficult for learners to recognize due to stressed and
unstressed vowels, devoicing of the final consonant in the utterance, and assimilations not
displayed in spelling.
Technologies in teaching spoken languages
Speech analysis has been utilized for teaching intonational patterns to second
language learners since the 1970s. The main principle is that the sound waveform or pitch
contour of a student’s utterance is visually displayed alongside those of the model
utterance. Research has indicated that the utilization of audio-visual feedback enhances
the understanding of target language prosody and intonation, as well as segmental
accuracy. There are numerous software packages available that include a speech analysis
component. However, the guidance in interpreting the feedback is often insufficient.
Hincks (2005) discusses the use of speech synthesis as an interactive tool for
teaching English to young adults who speak Swedish as their first language. In another
study mentioned by Hincks, formant synthesis has been successfully used to teach
Cantonese and Mandarin learners distinctions in English vowel quality. Although speech
synthesis is not widely used in computer assisted language learning, most developers
currently seem to prefer recordings of natural voices because speech synthesis can often
sound quite artificial.
Xorijiy lingvistika va lingvodidaktika
–
Зарубежная лингвистика
и лингводидактика
–
Foreign Linguistics and Linguodidactics
Special Issue
–
3 (2024) / ISSN 2181-3701
45
Handley and Hamel (2004) analyze the requirements of voice synthesis for CALL
and describe an experiment in which utterances created by a speech synthesizer have been
given to a group of instructors and CALL researchers. The participants of the experiment
were asked to judge the comprehensibility and the acceptability of the utterances for three
different functions of speech synthesis in CALL: reading machine, pronunciation tutor, and
conversational partner. The findings demonstrate that there are differences in the
evaluations of comprehensibility, acceptability, and general suitability for the three
distinct functions.
The utterances were determined to be most comprehensible in the context of use as
a conversational partner, and least comprehensible in the context of use as a pronunciation
tutor. Similarly, the output of the speech synthesizer was determined to be least acceptable
and appropriate for use as a pronunciation tutor, and most acceptable and appropriate for
use as a conversational companion. In their discussion of the issue of evaluating speech
synthesis for CALL, Handley, and Hamel make the case that proof is required to persuade
developers that speech synthesis is appropriate for use in CALL applications.
Automatic speech recognition
The research clearly demonstrates that native speaker-designed speech recognition
systems are worse at recognizing foreign-accented speech, but some researchers have
created non-native speech databases for developing speech recognizers specific to
learners' mother tongues. Neri et al. (2001) conclude that speech recognition provides an
optimal solution to pronunciation learning. A study presented in Neri et al. (2003) aims to
examine various reviews on the usability of speech recognition in pronunciation training.
While using automatic speech recognition makes it possible for learners to "converse" with
the computer in a spoken dialogue system, this is useful, although the conversations can
only take place within certain domains. The ability of speech recognition algorithms to
identify accented or mispronounced speech and to offer insightful assessments of
pronunciation quality has come under fire in a few papers that have been published in the
language education field. The study demonstrates that some of this criticism is unfounded
and is instead the product of a lack of experience with automatic speech recognition.
According to Wachowicz and Scott (1999), the efficacy of speech recognition-based CALL
systems depends not only on the speech recognizer's capabilities but also on two factors:
(a) how the language learning activity and feedback are designed; and (b) whether repair
strategies are included to protect against recognizer error.
Speech recognition systems can measure the rate at which learners speak, and rate
of speech has been shown to correlate with speaker proficiency. Unfortunately, current
speech recognition systems are not good at handling information contained in the
speaker's prosody. The limitations of the technology imply that the learner's utterances
have to be predictable and that detecting errors is only possible with a limited degree of
detail, which makes it difficult to give the learner corrective feedback.
Speech technology using non-native databases
A corpus of English sentences recorded by students with German and Italian as their
mother tongue has been collected and annotated by Bonaventura et al. (2000). Work on
the ISLE project aimed at modeling German- and Italian-accented English. The ISLE corpus
of non-native spoken English consists of nearly eighteen hours of annotated speech signals
spoken by Italian and German learners of English. Speech databases consisting of
utterances from non-native speakers have been created and used, and some researchers
Xorijiy lingvistika va lingvodidaktika
–
Зарубежная лингвистика
и лингводидактика
–
Foreign Linguistics and Linguodidactics
Special Issue
–
3 (2024) / ISSN 2181-3701
46
have explored the possibilities to improve the performance of speech recognizers on non-
native speech, as well as potential ways to improve the performance of speech recognizers
on non-native speech. Mayfield Tomokiyo (2000), Mayfield Tomokiyo and Jones (2001),
and Mayfield Tomokiyo and Waibel (2001) concentrate on the distinctions between native
and non-native speech as well as adaptation strategies for processing non-native
utterances. Gero and Giuliani (2004a and 2004b) use two child databases
—
one with
speech extracted from native English children and the other with English sentences read
by Italian learners of the same age
—
to train a speech recognizer and explore potential
ways to enhance recognition performance on non-native speech. Wang et al. (2003)
research the ways in which various acoustic model adaptation techniques can aid in
enhancing the speech recognizer performance on non-native speech with limited non-
native data at hand. Gaussian mixture merging is a technique used by Morgan and LaRocca
(2004) to enhance the performance of an Arabic speech recognizer on non-native data.
Many publications discuss automatic pronunciation error detection, pronunciation
grading, and spoken language tests. Ronen et al. (1997) explores methods for identifying
mispronunciations within language instruction systems. Kim et al. (1997) attempts
different probabilistic models to generate pronunciation scores for phone utterances from
the phonetic alignments generated by an HMM-based speech recognition system. The
speech database used in their experiments includes speech from American students
speaking French and native speakers of Parisian French. Langlais et al. (1998) talks about
automatic mispronunciation in non-native Swedish speech. The database used in their
research consists of speech from 21 non-native speakers from various nations.
Automatic speech recognition algorithms provide the basis of the procedures. Jo et
al. (1998) offers a system that uses audio processing and recognition techniques to identify
pronunciation mistakes and give Japanese learners diagnostic feedback. Cucchiarini et al.
(1998) address automatic pronunciation grading for Dutch. (1998a and 1998b). Franco
and Neumeyer (1998) present a paradigm for the automatic generation of phonetic
segmentations of learners' speech using HMMs in order to assess pronunciation quality
automatically. These segmentations yield scores for spectral match and duration. To get
the best results, Franco and Neumeyer concentrate on the work of calibrating various
machine scores. Franco et al. (1998) have observed that beginner language learners often
pause within words while reading. Franco et al. (1998) note that when reading, beginning
language learners frequently pause within words. To address this, they suggest modeling
interword pauses to generate more reliable segmental scores.
The task of predicting the degree of nativeness of the learner's utterances is
addressed by Teixeira et al. (2000). To achieve the best results, they take into account
prosody in addition to the segmental assessment of the speech signal. Cucchiarini et al.
(2000a, 2000b) demonstrate that expert fluency ratings of read speech can be predicted
based on automatically calculated temporal measures of speech quality; rate of speech
appears to be the best predictor; two other important determinants of reading fluency are
the rate at which the speakers articulate the sounds and the total number of pauses. A
method for assessing sentence stress in English spoken by Japanese students was
proposed by Imoto et al. (2002), with an accuracy rate of 95.1% for native speakers and
84.1% for non-native speakers. In their 2002a and 2002b publications, Raux and
Kawahara present a technique for identifying pronunciation mistakes that are most
detrimental to understanding. A speech recognition-based system's error rates are used to
Xorijiy lingvistika va lingvodidaktika
–
Зарубежная лингвистика
и лингводидактика
–
Foreign Linguistics and Linguodidactics
Special Issue
–
3 (2024) / ISSN 2181-3701
47
calculate intelligibility using a probabilistic algorithm, and they define an error priority
function that shows which errors are most important for intelligibility. A system for
categorizing Mandarin Chinese bisyllabic words according to the suitability of their lexical
tones is put forth by Ishida (2004). This method can be used to help non-native learners
acquire tone pronunciation skills. The DL2N1 corpus (Dutch as L2, Nijmegen Corpus 1),
which contains speech from native and non-native speakers of Dutch, is used by Truong et
al. (2004) to develop classifiers for three sounds that are frequently pronounced
incorrectly by L2 learners of Dutch: /A/, /Y/, and /x/. Tsubota et al. (2002, 2004) present
a method for detecting and diagnosing pronunciation errors in Japanese-speaking learners
of English. Bernstein et al. (2004) describe the development of an automatically scored
spoken Spanish test; Hincks (2005) mentions the commercially successful PhonePass test
that uses speech recognition to assess the correctness of
students’
responses and provides
scores in pronunciation and fluency. Neri et al. (2004) report on a study that was
conducted to obtain an inventory of segmental errors in the speech of adult learners of
Dutch and discuss setting priorities for pronunciation training.
A few works have been devoted to teaching intonation: Taniguchi and Abberton
(1999) find that interactive visual feedback of the voice fundamental frequency can help
Japanese learners to improve their English intonation; Hardison (2004) discusses
computer-assisted prosody training and two experiments that show the effective
pedagogical application of speech technology; Levis and Pickering (2004) discusses two
discourse-level uses of intonation, the use of intonational paragraph markers and the
distribution of tonal patterns, and teaching intonation in discourse using speech
visualisation technology. The experiments suggest that audio-visual training can help
learners of French to improve not just their prosody but also their segmental accuracy.
Prosody training and its effects are also examined in Delmonte et al. (1997), Delmonte
(1999), and Herry and Hirst (2002).
The Subarashii system, described in Bernstein et al. (1999), is designed for
beginning students of Japanese. The system analyzes the students' utterances and
responds in a meaningful way in spoken Japanese. Holland et al. (1999) describe a speech-
interactive graphics microworld in which learners speak to an animated agent. Dalby and
Kewley-Port (1999) present a pronunciation training program for adult learners that uses
automatic speech recognition. WebGrader is a multilingual pronunciation grading tool
based on speech recognition and pronunciation grading technologies. WebGrader is
described in Neumeyer et al. (1998). Speech recognition and pronunciation scoring
technologies are integrated into the Voice Interactive Training System (VILTS), a language-
training prototype that Rypa and Price (1999) defined as being designed to increase
speaking and understanding. Speech recognition technology is used by Kirschning and
Aguas (2000) to confirm that uttered words in Mexican Spanish are pronounced correctly.
A speech-recognition-based method has been effectively utilized to teach English speakers
how to pronounce the voiced and voiceless interdental fricatives, according to a study
published in 2000 by Mayfield Tomokiyo et al. Mak et al. (2003) introduced PLASER, a
multimedia application that provides immediate feedback and is intended to teach English
pronunciation to students whose mother tongue is Cantonese Chinese. Speech recognition
is used in the Let's Go Spoken Dialogue System described by Raux and Eskenazi (2004), in
Parling, a CALL system for children presented by Mich et al. (2004), in the SCILL (Spoken
Conversational Interaction for Language Learning) project presented by Ye and Young
Xorijiy lingvistika va lingvodidaktika
–
Зарубежная лингвистика
и лингводидактика
–
Foreign Linguistics and Linguodidactics
Special Issue
–
3 (2024) / ISSN 2181-3701
48
(2005), and in the system described by Bianchi et al. (2004). Seneff et al. (2004) discuss
the use of multilingual spoken dialogue systems as an aid to second language acquisition.
Speech recognition is used by a few commercially accessible pronunciations training
programs, including Rosetta Stone, Talk to Me and Tell Me More by the French business
Auralog, and TriplePlayPlus and Accent Coach created by Syracuse Language Systems. In
their 2000 study, Badin et al. explore the use of speech mapping tools and a virtual talking
head for audio-visual speech stimuli-based pronunciation instruction. Work on developing
a virtual language tutor that can assist with a variety of language learning tasks, including
conversational practice and pronunciation instruction,
is presented by Granström (2004).
REFERENCE:
1.
Bonaventura, P., et al. (2000). Collection and annotation of a corpus of non-native
English speech. ISLE Project.
2.
Bernstein, J., et al. (2004). Development of automatically scored spoken language
tests. PhonePass Test.
3.
Cucchiarini, C., et al. (1998). Automatic pronunciation grading for Dutch learners.
4.
Dalby, J., & Kewley-Port, D. (1999). Pronunciation training for adults using
automatic speech recognition.
5.
Franco, H., et al. (1998). Automatic generation of phonetic segmentations using
HMMs.
6.
Handley, Z., & Hamel, M.-J. (2004). Speech synthesis evaluation for computer-
assisted language learning.
7.
Hardison, D. (2004). Prosody training through audio-visual methods.
8.
Hincks, R. (2005). Use of speech synthesis in teaching English to Swedish learners.
9.
Holland, V., et al. (1999). Speech-interactive systems for language learning.
10.
Kim, H., et al. (1997). Probabilistic models for pronunciation scoring in CALL
systems.
11.
Langlais, P., et al. (1998). Mispronunciation detection in non-native Swedish
speech.
12.
Mayfield Tomokiyo, L. (2000). Acoustic modeling of non-native speech.
13.
Neri, A., et al. (2001). Speech recognition in pronunciation training.
14.
Raux, A., & Eskenazi, M. (2004). Spoken dialogue systems for second language
acquisition.
15.
Ronen, M., et al. (1997). Mispronunciation detection methods in CALL systems.
16.
Truong, K., et al. (2004). Classifiers for detecting common pronunciation errors
in Dutch L2 learners.