199
CORPUS LINGUISTICS AND THE BRITISH NATIONAL CORPUS (BNC)
Nazarova Lola Maqsadilla qizi
Graduate student of Termiz University of Economics and Service
+99890 608 70 56
https://doi.org/10.5281/zenodo.11643499
Abstract. In this article you can have some information about Corpus linguistics and
The
British National Corpus (BNC) and also it’s the fastest-growing methodologies in contemporary
linguistics. In a conversational format, this article answers a few questions that corpus linguists
regularly face from linguists who have not used corpus-based methods so far and the history,
evaluation and features of The British National Corpus (BNC) in linguistics.
Keywords: Corpus, corpus linguistics, methodology, the context of the classroom the
methodology of corpus linguistics, Quantitative and Qualitative Analyses, The British National
Corpus (BNC), orthographic transcriptions and others.
Aннотация. В этой статье вы можете получить некоторую информацию о
Корпусной лингвистике и Британском национальном корпусе (BNC), а также о самых
быстрорастущих методологиях в современной лингвистике. В разговорном формате эта
статья отвечает на несколько вопросов, с которыми регулярно сталкиваются корпусные
лингвисты от лингвистов, которые до сих пор не использовали корпусные методы, а
также на историю, оценку и особенности Британского национального корпуса (BNC) в
лингвистике.
Ключевые слова: Корпус, корпусная лингвистика, методология, контекст занятия,
методология корпусной лингвистики, количественный и качественный анализ, Британский
национальный корпус (BNC), орфографические транскрипции и другие.
Corpus linguistics is the study of language based on large collections of "real life" language
use stored in corpora (or corpuses)—computerized databases created for linguistic research. It is
also known as corpus-based studies.
Corpus linguistics isviewed by some linguists as a research tool or methodology and by
others as a discipline or theory in its own right. Sandra Kübler and Heike Zinsmeister state in their
book, "Corpus Linguistics and Linguistically Annotated Corpora," that "the answer to the question
whether corpus linguistics is a theory or a tool is simply that it can be both. It depends on how
corpus linguistics is applied."Although the methods used in corpus linguistics were first adopted
in the early 1960s, the term itself didn't appear until the 1980s.
200
"Corpus linguistics is...a methodology, comprising a large number of related methods
which can be used by scholars of many different theoretical leanings. On the other hand, it cannot
be denied that corpus linguistics is also frequently associated with a certain outlook on language.
At the centre of this outlook is that the rules of language are usage-based and that changes occur
when speakers use language to communicate with each other. The argument is that if you are
interested in the workings of a particular language, like English, it is a good idea to study language
in use. One efficient way of doing this is to use corpus methodology...."
"In the context of the classroom the methodology of corpus linguistics is congenial for
students of all levels because it is a 'bottoms-up' study of the language requiring very little learned
expertise to start with. Even the students that come to linguistic enquiry without a theoretical
apparatus learn very quickly to advance their hypotheses on the basis of their observations rather
than received knowledge, and test them against the evidence provided by the corpus."
"To make good use of corpus resources a teacher needs a modest orientation to the routines
involved in retrieving information from the corpus, and—most importantly—training and
experience in how to evaluate that information."
"Quantitative techniques are essential for corpus-based studies. For example, if you wanted
to compare the language use of patterns for the words big and large, you would need to know how
many times each word occurs in the corpus, how many different words co-occur with each of these
adjectives (the collocations), and how common each of those collocations is. These are all
quantitative measurements....
"A crucial part of the corpus-based approach is going beyond the quantitative patterns to
propose functional interpretations explaining why the patterns exist. As a result, a large amount of
effort in corpus-based studies is devoted to explaining and exemplifying quantitative patterns."
In corpus linguistics quantitative and qualitative methods are extensively used in
combination. It is also characteristic of corpus linguistics to begin with quantitative findings, and
work toward qualitative ones. But...the procedure may have cyclic elements. Generally it is
desirable to subject quantitative results to qualitative scrutiny—attempting to explain why a
particular frequency pattern occurs, for example. But on the other hand, qualitative analysis
(making use of the investigator's ability to interpret samples of language in context) may be the
means for classifying examples in a particular corpus by their meanings; and this qualitative
analysis may then be the input to a further quantitative analysis, one based on meaning...."
The British National Corpus (BNC) is a 100-million-word text corpus of samples of written
and spoken English from a wide range of sources.The corpus covers British English of the late
201
20th century from a wide variety of genres, with the intention that it be a representative sample of
spoken and written British English of that time. It is used in corpus linguistics for analysis of
corpora.
The project to create the BNC involved the collaboration of three publishers (with the
Oxford University Press as the lead collaborator, Longman and W. & R. Chambers), two
universities (the University of Oxford and Lancaster University), and the British Library. The
creation of the BNC started in 1991 under the management of the BNC consortium, and the project
was finished by 1994. There have been no additions of new samples after 1994, but the BNC
underwent slight revisions before the release of the second edition BNC World 2001 and the third
edition BNC XML Edition 2007.
The BNC was the vision of computational linguists whose goal was a corpus of modern (at
the time of building the corpus), naturally occurring language in the form of speech and text or
writing that could be analyzed by a computer. Hence, it was compiled as a general corpus to pave
the way for automatic search and processing in the field of corpus linguistics. One of the ways the
BNC was to be differentiated from existing corpora at that time was to open up the data not just to
academic research, but also to commercial and educational uses.
The corpus was restricted to just British English, and was not extended to cover World
Englishes. This was partly because a significant portion of the cost of the project was being funded
by the British government which was logically interested in supporting documentation of its own
linguistic variety. Because of its potentially unprecedented size, the BNC required funds from the
commercial and academic institutions as well. In turn, BNC data then became available for
commercial and academic research.
The BNC is a monolingual corpus, as it records samples of language use in British English
only, although occasionally words and phrases from other languages may also be present. It is a
synchronic corpus, as only language use from the late 20th century is represented; the BNC is not
meant to be a historical record of the development of British English over the ages. From the
beginning, those involved in the gathering of written data sought to make the BNC a balanced
corpus, and hence looked for data in various mediums.
90% of the BNC is samples of written corpus use. These samples were extracted from
regional and national newspapers, published research journals or periodicals from various
academic fields, fiction and non-fiction books, other published material, and unpublished material
such as leaflets, brochures, letters, essays written by students of differing academic levels,
speeches, scripts, and many other types of texts.
202
The remaining 10% of the BNC is samples of spoken language use. These are presented
and recorded in the form of orthographic transcriptions. The spoken corpus consists of two parts:
one part is demographic, containing the transcriptions of spontaneous natural conversations
produced by volunteers of various age groups, social classes and originating from different
regions. These conversations were produced in different situations, including formal business or
government meetings to conversations on radio shows and phone-ins. These were to account for
both the demographic distribution of spoken language and those of linguistically significant
variation due to context.
The other part involves context-governed samples such as transcriptions of recordings
made at specific types of meeting and event. All the original recordings transcribed for inclusion
in the BNC have been deposited at the British Library Sound Archive. The majority of the
recordings are freely available from the Oxford University Phonetics Laboratory.
The nature of the BNC as a large mixed corpus renders it unsuitable for the study of highly
specific text-types or genres, as any one of them is likely to be inadequately represented and may
not be recognisable from the encoding. For example, there are very few business letters and service
encounters in the BNC, and those wishing to explore their specific conventions would do better to
compile a small corpus including only texts of those types.
There are two general ways in which corpus material can be used in language
teaching.Firstly, publishers and researchers could use corpus samples to create language-learning
references, syllabuses and other related tools or materials. For example, the BNC was used by a
group of Japanese researchers as a tool in their creation of an English-language–learning website
for learners of English for specific purposes (ESP). The website enabled English-language learners
to download frequently heard and used sentence patterns, and then base their own usage of the
English language on these sentence patterns. The BNC served as the source from which the
frequently used expressions were extracted. In using this website, users thus relied on reference
samples from the BNC to guide them in their learning of the English language. Such creation of
materials that facilitate language-learning typically involves the use of very large corpora
(comparable to the size of the BNC), as well as advanced software and technology. A large amount
of money, time, and expertise in the field of computational linguistics are invested in the
development of such language-learning material.
Secondly, the analysis of the corpus can be incorporated directly into the language teaching
and learning environment. With this method, language learners are given the opportunity to
categorize language data from the corpus and subsequently form conclusions about the patterns
203
and features of their target language from their categorizations. This method involves a greater
amount of work on the part of the language leaner and is referred to as “data-driven learning” by
Tim Johns. The corpus data used for data-driven learning is relatively smaller, and consequently
the generalisations made about the target language may be of limited value.In general, the BNC is
useful as a reference source for the purposes of producing and perceiving text. The BNC can be
used as a reference source when studying the use of individual words in various contexts, so that
learners become familiar with the different ways to use particular words in suitable contexts.Other
than language-related information, encyclopedic information is also found in the BNC. Learners
perusing data from the BNC are also introduced to British cultural features and stereotypes.
REFERENCES
1.
Bednarek, Monika. 2008. Semantic preference and semantic prosody re-examined. Corpus
Linguistics and Linguistic Theory 4(2).119–40.
2.
Behrens, Heike (ed.) 2008. Corpora in language acquisition research: history, methods,
perspectives. Amsterdam, Philadelphia: John Benjamins
3.
Elena Tognini-Bonelli, Corpus Linguistics at Work. John Benjamins, 2001
4.
Geoffrey Leech, Marianne Hundt, Christian Mair, and Nicholas Smith, Change in
Contemporary English: A Grammatical Study. Cambridge University Press, 2012
5.
John McHardy Sinclair, How to Use Corpora in Language Teaching, John Benjamins,
2004
6.
Kübler, Sandra, and Zinsmeister, Heike. Corpus Linguistics and Linguistically Annotated
Corpora. Bloomsbury, 2015.
7.
The Author Language and Linguistics Compass 3 (2009): 1–17, 10.1111/j.1749-
818x.2009.00149.x Journal Compilation ª 2009 Blackwell Publishing Ltd
