CorCenCC – National Corpus of Contemporary Welsh

I CorCenCC¹

Rhoddwn ein sgyrsiau’n raddol, ein haraith
A mân eiriau’r heol;
O gadw’r stôr ddigidol
Hawliwn hwy, a’u galw’n ôl.

Welcome to CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes) – the National Corpus of Contemporary Welsh. CorCenCC is a language resource for Welsh speakers, Welsh learners, Welsh language researchers, and indeed anyone who is interested in the Welsh language. CorCenCC is a freely accessible collection of multiple language samples, gathered from real-life communication. Click on the Explore or Download tabs to access the corpus and start to investigate Welsh language as it is actually used. The corpus is accompanied by an online teaching and learning toolkit – Y Tiwtiadur – which draws directly on the data from the corpus to provide resources for Welsh language learning at all ages and levels. To find out more about the CorCenCC project, corpus and Y Tiwtiadur, click on the questions below.

An overview of what CorCenCC is, and how it can be used, can be found here. To read the CorCenCC project report, click here.

v2.0.0. of the CorCenCC query tools were developed by Laurence Anthony. For more information on Laurence’s other work (including a range of corpus-based tools), visit here.

What is a ‘corpus’?	What is CorCenCC?	Who built CorCenCC?
What is Y Tiwtiadur?	How do I cite CorCenCC?

What is a corpus?

A corpus is an electronic database of words. It is different from a dictionary, because when a user searches for a word in a corpus, rather than seeing a definition they see examples of the word in excerpts from a variety of texts (which might include conversations, books, blog entries, etc.), exactly as they were used by the original author or speaker. Users can also find out, for instance, how frequently a specific word is used, or what are the most frequently used words in specific kinds of communication (or across the entire corpus). This provides researchers with evidence of how language is actually used (rather than how we intuitively think it’s used). It also enables the creation of tailored texts or materials to help with language learning. Every word in a corpus is tagged with, for example, grammatical information (i.e. part of speech – noun, verb, etc.) and semantic information (relating to themes and topics), and information is provided about where each language excerpt is from (e.g. text type, speaker location). This makes a corpus a valuable electronic tool which allows us to explore and to better understand our language.

What is CorCenCC?

The CorCenCC corpus contains over 11 million words from written, spoken and electronic (online, digital texts) Welsh language sources, taken from a range of genres, language varieties (regional and social) and contexts. The contributors to CorCenCC are representative of the over half a million Welsh speakers in the country.

The creation of CorCenCC was a community-driven project, which offered users of Welsh an opportunity to be proactive in contributing to a Welsh language resource that reflects how Welsh is currently used.

To make CorCenCC as representative as possible, the project team decided on a framework for collecting language samples. Extracts were collected from sources including for example, journals, emails, sermons, road signs, TV programmes, meetings, magazines and books. Conversations were recorded by the research team, and a specially designed crowdsourcing app enabled Welsh speakers in the community to record and upload samples of their own language use to the corpus. The published corpus therefore contains data from Welsh speakers from all kinds of backgrounds, abilities and contexts, capturing how Welsh is truly used today across the country.

The CorCenCC team consulted potential users of the corpus at all stages of development. This informed the content and design of the corpus, maximising its value to a wide range of user groups, from teachers and learners to academic researchers, translators, publishers, policy-makers, language technology developers and others.

Who built CorCenCC?

CorCenCC is an interdisciplinary and multi-institutional project that was funded by the ESRC and AHRC (Grant Ref ES/M011348/1). The CorCenCC project involved 4 academic institutions (Cardiff, Swansea, Lancaster and Bangor Universities), and an international team of researchers, consultants and advisors representing community, industry and academic stakeholders.

The project was led by Dawn Knight, at the Centre for Language and Communication Research, Cardiff University. The full project team comprised:

1 Principal Investigator (PI – Dawn Knight), 2 Co-Investigators (CIs – Steve Morris and Tess Fitzpatrick), who, with the PI, designed the project and form the CorCenCC Management Team, and a total of 7 other CIs and 8 Research Assistants/Associates over the course of the project. In addition, there were 11 advisory board members, 6 consultants, 2 PhD students, 4 undergraduate summer placement students, 4 professional service support staff, 4 project ambassadors and 2 project volunteers. For more details visit the People tab of this website.

What is Y Tiwtiadur?

Y Tiwtiadur is a collection of data-driven teaching and learning tools designed to help supplement Welsh language learning at all different ages and levels. Y Tiwtiadur contains four distinct corpus-based exercises:

a Gap Filling (Cloze) exercise allowing teachers to delete words from a text at specified intervals (e.g. every 7th word) to encourage or assess learners’ comprehension abilities and prediction strategies;
a Vocabulary Profiler exercise that enables the grading of texts by word frequency;
a Word Identification exercise testing learners’ ability to guess a word in context; and
a Word-in-Context exercise that facilitates intensive work on a specified vocabulary item.

The tools in Y Tiwtiadur all use information from the 11 million word CorCenCC corpus. All the language in the corpus is from real life communication, so the word frequencies and the language samples in Y Tiwtiadur reflect how Welsh is really used across a range of data types, from different speakers/contributors, in different situations, and discussing a range of topics. Some of the tools give options to work with specific sections of data, based on topic or data type, for example.

How do I cite CorCenCC?

- To cite the CorCenCC corpus:
  - Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I., Thomas, E-M., Lovell, A., Morris, J., Evas, J., Stonelake, M., Arman, L., Davies, J., Ezeani, I., Neale, S., Needs, J., Piao, S., Rees, M., Watkins, G., Williams, L., Muralidaran, V., Tovey-Walsh, B., Anthony, L., Cobb, T., Deuchar, M., Donnelly, K., McCarthy, M. and Scannell, K. (2020). CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh. Cardiff University, http://doi.org/10.17035/d.2020.0119878310

- - Citing the CorCenCC project report:
    - Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I. and Thomas, E. M. (2020). The National Corpus of Contemporary Welsh: Project Report | Y Corpws Cenedlaethol Cymraeg Cyfoes: Adroddiad y Prosiect. arXiv:2010.05542, October 2020.
  - CorCenCC’s infrastructure and crowdsourcing app:
    - Knight, D., Loizides, F., Neale, S., Anthony, L. and Spasić, I. (2020). Developing computational infrastructure for the CorCenCC corpus – the National Corpus of Contemporary Welsh. Language Resources and Evaluation (LREV).
  - CorCenCC’s part-of-speech (POS) tagger ‘CyTag’:
    - Neale, S., Donnelly, K., Watkins, G. and Knight, D. (2018). Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. Poster presented at the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
  - CorCenCC’s semantic tagger ‘CySemTagger’:
    - Piao, S., Rayson, P., Knight, D. and Watkins, G. (2018). Towards a Welsh Semantic Annotation System. Proceedings of the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
    - Piao, S., Rayson, P., Knight, D., Watkins, G. and Donnelly, K. (2017). Towards a Welsh Semantic Tagger: Creating Lexicons for A Resource Poor Language. In Proceedings of The Corpus Linguistics 2017 Conference, held from 24-28 July 2017 at University of Birmingham, Birmingham, UK.
  - CorCenCC’s pedagogic toolkit ‘Y Tiwtiadur’:
    - Davies, J., Thomas, E-M., Fitzpatrick, T., Needs, J., Anthony, L., Cobb, T. and Knight, D. (2020). Y Tiwtiadur. [Digital Resource]. Available at: https://ytiwtiadur.corcencc.org
  - CorCenCC’s word frequency lists ‘Yr Amliadur‘:
    - Knight, D., Morris, S., Tovey-Walsh, B., Fitzpatrick, T. and Anthony, L. (2020). Yr Amliadur: Frequency Lists for Contemporary Welsh. Cardiff University, http://doi.org/10.17035/d.2020.0120164107

This englyn was written for the launch of CorCenCC on 24 February 2017 by poet and project advisory group member, Dr Emyr Davies.