Corpora and Representativeness
3-4 May 2018 Nanterre (France)

Invited speakers

This year, we are very pleased to have two invited speakers with us:



  • Thomas Egan, Professor Emeritus, Inland Norway University of Applied Sciences 




“Some perils and pitfalls of non-representativeness”


After some general introductory remarks on representativeness, I revisit in this paper some of the arena where I myself have encountered problems in connection with instances of non-representativeness, illustrating the discussion with examples from corpora both small and large, both historical and contemporary, and both mono-linguistic and contrastive. I discuss some text types that compilers of general corpora would do wise to steer clear of, before going on to describe some less immediately obvious phenomena that lie in wait to mislead the unwary researcher.


  • Dawn Knight, School of English, Communication and Philosophy, Cardiff University


"Representativeness in CorCenCC: corpus design in minoritised languages"

Corpus design and construction in a minoritised language context pose interesting challenges, but also present opportunities not always open/available to developers of corpora for larger languages. During this presentation, I will discuss and examine these challenges and opportunities in more detail, with reference to the design and construction of the ESRC/AHRC-funded CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes – National Corpus of Contemporary Welsh) corpus. Its ambition to create a large-scale open-source corpus of contemporary Welsh, one with a functional design informed, from the outset, by representatives of all engaged academic and community user groups, makes CorCenCC a highly relevant point of reference for lines of discussion that will be presented.

The construction of CorCenCC began in 2016 and, when complete in 2019, it will be the first general corpus of Welsh language. It will include data from a range of different discourse contexts (from formal contexts, e.g. political documents, televised interviews and formal letters, to less formal ones, e.g. informal emails, phone calls and text messages), and geographical locations in Wales. Data will also be sampled from a range of different speakers and users of Welsh, so from all regions of Wales, of all ages and genders, with a wide range of occupations, and with a variety of linguistic backgrounds (e.g. how they came to speak Welsh), to reflect Wales’ diversity not only of text types but also of Welsh speakers themselves. This composition will allow users to make generalised observations about language use (i.e. not restricted to a specific discourse context or domain). CorCenCC will contain 10 million words by the end of the project, comprising 4 million each from spoken and written discourse and 2 million from digitally mediated discourse (e-language).

Defining and maintaining ‘balance’ and ‘representativeness’ in the design and construction of CorCenCC, and ensuring that all contemporary speakers and users in language communities that are located in areas and domains with different densities of both users and usage, is a major challenge. This is something compounded further when utilising unplanned, spontaneous contributions such as those derived from crowdsourcing means, which is something that the data collection process for CorCenCC is pioneering. These challenges will be unpacked within this presentation. Particular attention will be given to how we are ensuring that the design frame for CorCenCC truly reflects current social, cultural, geographical elements bearing upon contemporary Welsh as well as the domains in which it is used: providing guidelines for good practice that can be adapted to other minoritised language contexts. Metadata plays a key role in organising the ways in which a language corpus can be meaningfully analysed, so I will also demonstrate a streamlined, searchable database system for recording information about data collection and metadata as part of this discussion.

The presentation will end with an examination of some of the potential applications of CorCenCC and how priorities and engagement within a minoritised language context differ from those in major languages.


Dr Dawn Knight is a Reader in Applied Linguistics at the Centre for Language and Communication Research (CLCR), Cardiff University. Her research interests lie predominantly in the areas of corpus linguistics, discourse analysis, e-language, multimodality and the socio-linguistic contexts of communication. Dawn is currently leading a major multi-institutional team of academics, software engineers and Welsh language experts planning to construct the large-scale, open-source National Corpus of Contemporary Welsh (CorCenCC). The creation of CorCenCC is community-driven with impact being generated through a user-informed design, harnessing opportunities afforded by mobile technologies, specifically crowdsourcing and community collaboration.






Online user: 1