Corpora and Representativeness

AFLiCo JET 2018

3-4 May 2018 Nanterre (France)

SCIENTIFIC STATEMENT

The French Cognitive Linguistics Association (AFLiCo) invites you to submit paper proposals for its next workshop, AFLiCo JET 2018. AFLiCo workshops provide a forum for high-quality research in cognitive linguistics and, more generally, usage-based approaches to language. The topic of this year’s workshop is “corpora and representativeness”.

With the advent of corpus linguistics, the use of corpora has become central in linguistics. One underlying assumption is that the corpus is representative of the linguistic phenomenon under scrutiny. Of course, corpus representativeness itself is a methodological construct (Leech 2006, Habert 2010): language corpora are tools constructed by linguists, and their structural limitations constrain and condition the validity of linguistic findings.

Here is an open-ended list of issues that we wish to address in the workshop:

What does it mean for a corpus to represent language use, and what are the relevant criteria?
To what extent does representativeness rely on intuition, since it cannot be fully gauged empirically?
Because a corpus cannot be representative of all features of language use how can we address bias in sampling?
Does representativeness necessarily entail balance?
Can the design of a corpus be totally free from any form of theorization?

Solutions to these complex issues may reflect in the development and use of different types of corpora.

The representativeness of written corpora may rely on a variety of features. According to Biber (1993: 244), “[r]epresentativeness refers to the extent to which a sample includes the full range of variability in a population.” Variability can be defined as the interaction between situational (e.g. format, setting, author, addressee, purposes, topics) and linguistic, distributional parameters (e.g. frequencies of word classes). Sampling can be based on extralinguistic (sociological, demographic) criteria (Crowdy 1993). Balance, i.e. a proportion of sampled elements that reflects their frequency in the targeted language, is claimed to characterize some corpora (e.g. the Brown Corpus (Francis & Kucera 1979) and the Lancaster-Oslo-Bergen corpus (Johansson et al. 1978)), though it is not a prerequisite.

Although increasingly larger corpora, including monitor corpora, can be compiled from the Web (Baroni et al. 2009), large size is not necessarily a priority. “Big is beautiful” in the realm of corpora is, perhaps, a “delusion” (Svartvik 1992: 10). Large corpora are often presented as an ideal but, in practice, “small” corpora can go a long way in such domains as English language teaching (Ghadessy, Henry, and Roseberry, 2001), the study of metaphors (Cameron and Deignan 2003), dialectology (Hollmann and Siewierska, 2007; Boas and Schuchard, 2012), etc. Parallel corpora, i.e. collections of original texts and their translations in one or more languages, are particularly useful in areas of research such as contrastive linguistics, translation studies and computational linguistics (Kenning 2010), but their alleged lack of representativeness has called for inventive ways of using them (Nádvorníková 2017).

In the area of spoken corpora, collecting data that represents the variability of the multiple dimensions of speech (phonology and phonetics, prosody, gesture) remains a challenge today. Collecting, transcribing, annotating and analysing data, is a slow, sometimes complicated, task. Although phonological and prosodic annotations can be partially systematized (Bertrand et al. 2008), technological advances are yet to be made in the automatic recognition of speech and gesture in interactional contexts. Automatic motion capture technologies for gesture research are promising (Priesters & Mittelberg 2013, Guez et al. 2013), but little advanced. As part of initiatives such as the TGIR Huma-Num Multi-Com – CORLI Consortium, multimodality researchers collaborate to develop collective harmonised practices of collection, transcription and archiving of spoken corpora.

References

Baroni, Marco et al. (2009). “The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora.” In: Language Resources and Evaluation 43.3, pp. 209–226.

Biber, Douglas (1993). “Representativeness in Corpus Design.” In: Literary and Linguistic Computing 8.4, pp. 241–257.

Boas, Hans C. and Sarah Schuchard (2012). “A corpus-based analysis of preterite usage in Texas German.” In: Proceedings of the 34th Annual Meeting of the Berkeley Linguistics Society.

Cameron, Lynne and Deignan, Alice. (2003). Combining Large and Small Corpora to Investigate Tuning Devices Around Metaphor in Spoken Discourse. Metaphor and Symbol, 18(3): 149-160.

Crowdy, Steve (1993). “Spoken Corpus Design.” In: Literary and Linguistic Computing 8.4, pp. 259–265.

Francis, W. Nelson and Henry Kučera (1979). Manual of information to accompany a standard corpus of present-day edited american english, for use with digital computers. Department of Linguistics. Brown University. URL: http://www.hit.uib.no/icame/brown/bcm.html.

Guez, J., D. Boutet, C.-W. Hsieh, M.-H. Tramus, C.-Y. Chen, C. Vincent, C. Chabalier, J. Châteauvert, J. Lubek, F. Catteau, I. Renna, S. Delacroix. (2013). Le projet CIGALE (Capture et Interaction avec des Gestes Artistiques, Langagiers et Expressifs) : une plateforme transdisciplinaire de création et d'exploration du sens. » International Conference Le sujet digital : inscription, excription, téléscription, novembre 2013, Saint-Denis, France

Habert, B. (2010). Des corpus représentatifs : de quoi, pourquoi, comment ?. In M. Bilger (ed.), Linguistique sur corpus – études et réflexions, Perpignan : Presses Universitaires de Perpignan.

Hollmann, Willem B. and Anna Siewierska (2007). “A construction grammar account of possessive constructions in Lancashire dialect : some advantages and challenges.” In: English Language and Linguistics 11.2, pp. 407–424. DOI: 10.1017/S1360674307002304.

Johansson, Stig, Geoffrey Leech, and Helen Goodluck (1978). Manual of information to accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital computers. Department of English. University of Oslo. URL: http://clu.uni.no/icame/manuals/LOB/INDEX.HTM

Kenning, M.-M. (2010). What are parallel and comparable corpora and how can we use them ?. In A. O’Keefe and M. McCarthy (eds.), The Routledge Handbook of Corpus Linguistics. London : Routledge

Leech, G. (2006). New resources, or just better old ones ? The Holy Grail of representativeness. Language and Computers, 59 (1), 133–149.

Nádvorníková, O. (2017). Pièges méthodologiques des corpus parallèles et comment les éviter, Corela [Online], HS-21 | 2017, Online since 20 February 2017, http://corela.revues.org/4810 ; DOI : 10.4000/corela.4810

Priesters, M. A. & I. Mittelberg. (2013). Individual differences in speakers' gesture spaces: Multi-angle views from a motion-capture study. Proceedings of the Tilburg Gesture Research Meeting (TiGeR), June 19-21, 2013.

Svartvik, J. (1992). Corpus linguistics comes of age, in J. Sartvik (ed.) Directions in Corpus Linguistics, Proceedings of Nobel Symposium 82, Berlin et New York: Mouton de Gruyter, 7-16.

Online user: 1