Авторы

  • Максуд Кахоров
    Ассистент, кафедра «Медиалингвистика и коммуникация», Узбекский государственный университет мировых языков

DOI:

https://doi.org/10.47689/2181-1415-vol6-iss5/S-pp314-318

Ключевые слова:

корпус мониторинговый корпус исторический корпус параллельный корпус образцовый корпус

Аннотация

В данной статье представлен подробный обзор корпусной лингвистики, подчеркивается ее роль как методологической основы, а не как традиционного лингвистического подраздела. Корпусная лингвистика фокусируется на эмпирическом анализе языка посредством больших коллекций реальных текстовых данных, известных как корпусы. С появлением передовых цифровых инструментов и увеличением доступности электронных текстов корпусные подходы стали необходимыми в лингвистических исследованиях. В статье описываются различные типы корпусов, включая справочные, мониторинговые, параллельные, исторические, учебные и специализированные корпусы, и обсуждается, как эти ресурсы открывают новые направления исследований в области грамматики, семантики и языковых изменений. В статье также рассматривается, как аннотированные корпусы и мощные поисковые инструменты облегчают сложный, масштабный лингвистический анализ, который был бы трудным или невозможным при использовании традиционных методов.


background image

Жамият

ва

инновациялар

Общество

и

инновации

Society and innovations

Journal home page:

https://inscience.uz/index.php/socinov/index

The role of corpus and corpus data in language

Maksud KAKHOROV

1

Uzbekistan State World Languages University

ARTICLE INFO

ABSTRACT

Article history:

Received April 2025
Received in revised form

15 May 2025
Accepted 25 May 2025

Available online

15 June 2025

This article provides an in-depth overview of corpus

linguistics, emphasizing its role as a methodological framework

rather than a traditional linguistic subfield. Corpus linguistics

focuses on the empirical analysis of language through large
collections of real-world text data, known as corpora. With the

advent of advanced digital tools and the increasing availability

of electronic texts, corpus-based approaches have become

essential in linguistic research. The article outlines various
types of corpora

including reference, monitor, parallel,

historical, learner, and specialized corpora

and discusses how

these resources enable new lines of inquiry in grammar,

semantics, and language change. It also explores how annotated
corpora and powerful search tools facilitate complex, large-

scale linguistic analyses that would be difficult or impossible

using traditional methods.

2181-

1415/©

2025 in Science LLC.

DOI:

https://doi.org/10.47689/2181-1415-vol6-iss5/S-pp314-318

This is an open access article under the Attribution 4.0 International
(CC BY 4.0) license (https://creativecommons.org/licenses/by/4.0/deed.ru)

Keywords:

corpus,

monitor corpus,

a historical corpus,

parallel corpus,

sample corpus.

Tilda korpus va korpus ma’lumotlarining ahamiyati

ANNOTATSIYA

Kalit so‘zlar

:

korpus,

nazorat korpusi,

tarixiy korpus,

parallel korpus,

namuna korpusi.

Ushbu maqola korpus lingvistikasiga chuqur nazariy sharh

berib, uni an’anaviy lingvistik kichik soha emas, balki

metodologik asos sifatida ko‘rib chiqadi. Korpus lingvistikasi

korpuslar

deb ataladigan real dunyo matn ma’lumotlarining

katta to‘plamlari orqali tilni empirik tahlil qilishga qaratilgan.

Zamonaviy raqamli vositalarning rivojlanishi va elektron

matnlarning keng tarqalishi bilan tilshunoslik tadqiqotlarida
korpusga asoslangan yondashuvlar muhim ahamiyat kasb

etmoqda.

Maqolada

turli

xil

korpuslar,

jumladan,

ma’lumotnoma, monitoring, parallel, tarixiy, o‘rganuvchi va

1

Assistant, Department of the Medialinguistics and Communication, Uzbekistan State World Languages University.

E-mail: maqsudqahhorov19@gmail.com


background image

Жамият

ва

инновациялар

Общество

и

инновации

Society and innovations

Special Issue

05 (2025) / ISSN 2181-1415

315

ixtisoslashtirilgan korpuslar tavsiflangan hamda ushbu

manbalar grammatika, semantika va til o‘zgarishini o‘rganishd

a

yangi imkoniyatlar yaratishi muhokama qilingan. Shuningdek,

izohli korpuslar va kuchli qidiruv vositalari an’anaviy usullar

bilan amalga oshirish qiyin yoki imkonsiz bo‘lgan murakkab,

keng ko‘lamli lingvistik tahlillarni qanday osonlashtirishini
ko‘rsati

b beradi.

Роль корпуса и корпусных данных в лингвистике

АННОТАЦИЯ

Ключевые слова:

корпус,

мониторинговый корпус,
исторический корпус,
параллельный корпус,

образцовый корпус

.

В данной статье представлен подробный обзор

корпусной лингвистики, подчеркивается ее роль как
методологической основы, а не как традиционного

лингвистического подраздела. Корпусная лингвистика

фокусируется

на

эмпирическом

анализе

языка

посредством больших коллекций реальных текстовых
данных, известных как корпусы. С появлением передовых

цифровых инструментов и увеличением доступности

электронных

текстов

корпусные

подходы

стали

необходимыми

в

лингвистических

исследованиях.

В статье описываются различные типы корпусов, включая

справочные,

мониторинговые,

параллельные,

исторические, учебные и специализированные корпусы, и

обсуждается, как эти ресурсы открывают новые

направления исследований в области грамматики,
семантики и языковых изменений. В статье также

рассматривается, как аннотированные корпусы и мощные

поисковые инструменты облегчают сложный, масштабный

лингвистический анализ, который был бы трудным или
невозможным при использовании традиционных методов.


Firstly, it is crucial to respond to the question, what is corpus linguistics? It is

certainly quite distinct from most other topics you might study in linguistics, as it is not
directly about the study of any particular aspect of language. Rather, it is an area which
focuses upon a set of procedures, or methods, for studying language (although, as we will
see, at least one major school of corpus linguistics does not agree with the
characterization of corpus linguistics as a methodology). The procedures themselves are
still developing, and remain an unclearly delineated set, though some of them, such as
concordance, are well established and are viewed as central to the approach.
Importantly, the development of corpus linguistics has also spawned, or at least
facilitated the exploration of, new theories of language

theories which draw their

inspiration from attested language use and the findings drawn from it.

With an emphasis on language as a collection of data at its heart, corpus linguistics

has established itself in the academic world, and the number of language disciplines that
use part or all of the methods of corpus analysis has now increased dramatically. This
tool looks up the term in the published materials that are accessible online through
Google Books and makes it evident how its usage has grown since the 1800s, when the


background image

Жамият

ва

инновациялар

Общество

и

инновации

Society and innovations

Special Issue

05 (2025) / ISSN 2181-1415

316

term "corpus" may have been used in a very different context, such as to refer to all of an
author's works, such as Shakespeare, and when it was first used by Aarts and Meijs in
1984. The link between this visual representation of a term coined to describe the new
field in the literature and the practice of corpus analysis is, as previously mentioned,
technology. Advances in word processing technology have resulted in the availability of
potentially vast repositories of electronic text data. Even the possibility of using the

World Wide Web as a “corpus” has become a reality, though one to be taken with certain

caveats. This situation is in marked contrast with the early days of concentrated corpus
development for language analysis. Access to electronically available language samples
via the World Wide Web means that the sheer scope of what corpus analysis techniques
can be applied to is remarkable, though not without attendant issues of copyright and
ethics.

There are different types of corpus analysis: a sample corpus, also known as a

general or reference corpus. Usually, monolingual corpora aim to capture features of a
language variety (e.g., American English, Irish English) in use in normal, everyday
situations. They tend

to be “snapshots” of a language, given that they are collected usually

at a particular point in time, e.g., between 1980 and 1990.

Monitor corpus A sample or general corpus that is consistently being added to

keep the language data it contains current.

A parallel corpus is two or more corpora of the same texts in different languages

that have been translated and can be compared side by side, often line by line.

A historical corpus, also known as a diachronic corpus text from different, specified

periods of time, which can be used to identify features of language in use at that time, but
also to track changes in language use over time. Often, this involves digitizing texts that
do not originally exist in electronic format.

Learner corpus texts gathered to represent the features of learner language, i.e.,

language used by nonnative speakers of a foreign language. The goal of gathering a
corpus like this is usually to inform teaching and learning processes and materials.

Specialized corpora that aim to capture a specific type of language use, to describe

in highly contextualized terms, language use in this domain.

If we consider the range of research questions that a corpus on its own allows us to

address, we can imagine it as covering a subset of all the research questions that a
linguist might ask. That subset will overlap with the subset of questions that a linguist
can ask without a corpus, but it is almost certainly greater in size than that set. This is
because with corpus data, we can take new approaches to several areas, including
grammatical description and even linguistic theory. Moreover, the set of questions that
may be readily addressed using a corpus grows significantly as suitable tools for
interrogating that data become available. It grows larger still if the corpus data we have
at our disposal is suitably annotated with linguistic analyses, in such a way that
linguistically motivated queries can be undertaken rapidly and accurately. For example,
without a corpus, we could certainly examine, and seek to describe, the use of non-finite
verbs in English

many scholars have, for instance, O’Dwyer (2006: 58–

9). However, the

number of examples of non-finite verbs that we could base our investigation on would
remain relatively small, being limited by the hand-and-eye techniques that we would
need to find them. With a suitable corpus of English, we would have at our disposal many
thousands of words with which to conduct our study. Still, in the absence of tools to


background image

Жамият

ва

инновациялар

Общество

и

инновации

Society and innovations

Special Issue

05 (2025) / ISSN 2181-1415

317

search the data rapidly, the exploitation of the data would be slow and prone to error

although we would probably be able to study a greater number of examples somewhat
more effectively than we would without the corpus. A search tool that can quickly extract
examples of particular words from the corpus would greatly improve the accuracy of our
searches (though spelling errors in the data itself might prevent the searches from being
wholly error-free). Furthermore, the time taken by the searches would fall dramatically.
And crucially, corpora allow access to reliable information regarding frequency. In the
absence of corpus data, even trained linguists find it very difficult to come up with
estimates of frequency in language that are reliable.

Being able to search for and extract frequencies of different word forms or phrases

gets us a long way. But it does not give us all the tools we need for every sort of research
question. We can search for walking, having walked, or to walk, but we cannot search for

‘every non

-

finite verb’. Searching for each n

on-finite form of every verb in English would

take a very long time indeed. Quantifying the relative frequency of, say, nouns and
adverbs in English would take even longer, to the extent that an investigation of these
features based on corpus data is effectively impractical if we use searches based on word
form alone. However, if our corpus already has annotations that show the part of speech
of each word in the corpus, then, armed with a search tool that understands these
annotations, we can fashion a query to extract the information we want

the frequency

of nouns, or a concordance of all non-finite verbs

rapidly and reliably. So, the

combination of corpus, search tool, and corpus annotation makes it possible to explore
research questions that would be almost unimaginable otherwise. As studies such as
Leech and Mukherjee show, even with something as apparently mundane as a non-finite
verb form, a large-scale investigation can reveal features of its use that have escaped
linguists who have used intuition or small numbers of examples alone. Indeed, Leech is a
telling example of this. Between its original publication in 1971 and its third edition in
2004, the increasing availability of corpora and tools to search them has led to parts of
the work, especially the chapter on modal verbs, being revised substantially. Two good

examples are Leech’s identification of emergent modal constructions such as need to and

had better, and his discussion of would as a pure hypothesis in expressions such as we
would think and one would expect. In both cases, it was the data available to Leech that
led to his identification of the growing salience of these features in English usage, and
hence to the modification of his earlier work. It is difficult to see how such observations
could be made reliably based on any other sort of evidence.


REFERENCES:

1.

Aarts, J., & Meijs, W. (Eds.). (1984). Corpus linguistics: Recent developments in

the use of computer corpora in English Language research. Amsterdam, Netherlands:
Rodopi.

2.

McEnery, T., & Hardie, A. (2012). Corpus linguistics: Method, theory and

practice. Cambridge, UK Cambridge University Press.

3.

O’Dwyer, B. 2006. Modern English Structures: Form, Function and Position.

New

York: Broadview Press

4.

Alderson, J. C. 1996. ‘Do corpora have a role in language assessment?’, in J.

Thomas and M. Short (eds.) Using Corpora in Language Research, pp. 248

59. London:

Longman.


background image

Жамият

ва

инновациялар

Общество

и

инновации

Society and innovations

Special Issue

05 (2025) / ISSN 2181-1415

318

5.

Leech, G. 1971. Meaning and the English Verb. London: Longman. 1992. ‘Corpora

and theories of linguistic performance’, in J. Svartvik (ed.), Directions in Corpus

Linguistics: Proceedings of the Nobel Symposium 82, Stockholm, 4

8 August 1991, pp.

105

22. Berlin: Mouton de Gruyter.

6.

Mukherjee, J. 2004. ‘Corpus data in a usage

-

based cognitive grammar’, in K.

Aijmer and B. Altenberg (eds.) The Theory and Use of Corpora: Papers from the Twenty-
Third International Conference on English Language Research on Computerized Corpora
(ICAME 23), pp. 85

100. Amsterdam: Rodopi.

7.

Leech, G. and Fallon, R. 1992. ‘Computer corpora: what do they tell us about

culture?’, ICAME Journal 16: 29–

50.

8.

Leech, G., Hundt, M., Mair, C. and Smith, N. 2009. Change in Contemporary

English: A Grammatical Study. Cambridge University Press.

Библиографические ссылки

Aarts, J., & Meijs, W. (Eds.). (1984). Corpus linguistics: Recent developments in the use of computer corpora in English Language research. Amsterdam, Netherlands: Rodopi.

McEnery, T., & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge, UK Cambridge University Press.

O’Dwyer, B. 2006. Modern English Structures: Form, Function and Position. New York: Broadview Press

Alderson, J. C. 1996. ‘Do corpora have a role in language assessment?’, in J. Thomas and M. Short (eds.) Using Corpora in Language Research, pp. 248–59. London: Longman.

Leech, G. 1971. Meaning and the English Verb. London: Longman. 1992. ‘Corpora and theories of linguistic performance’, in J. Svartvik (ed.), Directions in Corpus Linguistics: Proceedings of the Nobel Symposium 82, Stockholm, 4–8 August 1991, pp. 105–22. Berlin: Mouton de Gruyter.

Mukherjee, J. 2004. ‘Corpus data in a usage-based cognitive grammar’, in K. Aijmer and B. Altenberg (eds.) The Theory and Use of Corpora: Papers from the Twenty-Third International Conference on English Language Research on Computerized Corpora (ICAME 23), pp. 85–100. Amsterdam: Rodopi.

Leech, G. and Fallon, R. 1992. ‘Computer corpora: what do they tell us about culture?’, ICAME Journal 16: 29–50.

Leech, G., Hundt, M., Mair, C. and Smith, N. 2009. Change in Contemporary English: A Grammatical Study. Cambridge University Press.