There are different ways in which CorCenCC can be accessed. Click the links below to:
|Access full corpus||Access word frequency lists||Explore CorCenCC online|
Access full corpus
The published CorCenCC dataset includes 13,487,210 tokens (circa 11-million-words). Tokens are the smallest unit contained within a corpus, which includes words (i.e. items starting with a letter of the alphabet) and nonwords (i.e. items starting with a character that is not a letter of the alphabet).
The data in CorCenCC represents a wide range of contexts, genres and topics. For a detailed breakdown of this composition, see Knight, Morris and Fitzpatrick (2021). This data has, as far as possible, been anonymised using a combination of manual and automated techniques, and has been fully tagged in terms of part-of-speech (POS) and semantic categories. The POS and semantic tagging was carried out using CyTag and SemCyTag tools, available from CorCenCC’s GitHub website.
To request a copy of the CorCenCC corpus, please click here.
The CorCenCC corpus and associated software tools are licensed under Creative Commons CC-BY-SA v4 and thus are freely available for use by professional communities and individuals with an interest in language. When reporting information derived by using the CorCenCC corpus data and/or tools, CorCenCC should be appropriately acknowledged. Citation details are available here. Full documentation for this corpus, including details of the CorCenCC transcription conventions, metadata descriptors and corpus taxonomy are also found on CorCenCC’s GitHub site.
Existing corpus analysis tools can be used to carry out some basic analyses of CorCenCC (although please note that they may not support all functionalities for Welsh-language data). Such tools include: AntConc, WMatrix, CQPWeb, and #LancsBox, all of which are freely available.
Access word frequency lists
A range of word frequency lists from the CorCenCC corpus (Yr Amliadur) are available here. These include:
- Top 100 words in CorCenCC (rank ordered list)
- Top 1000 words in CorCenCC (ordered alphabetically)
- Top 100 lemmas in CorCenCC (rank ordered list)
- Top 1000 lemmas in CorCenCC (ordered alphabetically)
- Top 100 lemmas in CorCenCC (open-class words only)
- Top 1000 words in CorCenCC (open-class words only; ordered alphabetically)
- Top 500 nouns in CorCenCC (rank ordered list)
- Top 500 verbs in CorCenCC (rank ordered list)
- Top 500 adjectives in CorCenCC (rank ordered list)
- Top 50 adverbs in CorCenCC (rank ordered list)
- Top 50 interjections in CorCenCC (rank ordered list)
- Top 100 open-class words in the written component of CorCenCC (rank ordered list)
- Top 100 open-class words in the spoken component of CorCenCC (rank ordered list)
- Top 100 open-class words in the e-language component of CorCenCC (rank ordered list)
Click here to request a copy of the full frequency lists. These frequency lists include those listed above, in addition to the following:
- All frequency data, sorted alphabetically (excel file)
- All frequency data, in frequency order (excel file)
- The most-frequent 5000 words, with separate sheets for each 500-word frequency band (excel file)
These word frequency lists inform us which words and lemmas are most often used in the Welsh language (generally and within/across specific modes of communication). Details on how to cite these lists are available here.
Back to top
Explore CorCenCC online
A beta version of CorCenCC’s bilingual corpus query tools, with complete user guide, is available through the Explore tab of this website. This includes the following functionalities:
- Simple Query: to explore any word and/or lemma form in the corpus, and one or many part-of-speech (POS) tags, mutation types, or semantic category tags of a specific word and/or lemma. A randomised selection of results are presented in a KWIC (Key Word in Context) output. Results can then be filtered of results by mode, geographical area, context, genre, topic, target audience and source.
- Full Query: used to search for longer sequences of patterns (multi-word expressions) separated by spaces, using CorCenCC’s bespoke query syntax. Results are presented in a KWIC (Key Word in Context) output, which can be filtered according to mode, geographical area, context, genre, topic, target audience and source.
- Frequency List: produces a list of words or lemmas in the corpus, ranked according to frequency of occurrence.
- N-Gram Analysis: lists patterns of n-grams/clusters of 2-7 words, lemmas, or POS in the corpus, ranked according to frequency of occurrence.
- Keyword Analysis: displaying words that are unusually frequent in one sub-set of the corpus compared with a different ‘reference’ sub-set of the corpus.
- Collocation Analysis: displaying information on the relationships between word types that appear together within a given context window. [Functionality available soon]
CorCenCC’s accompanying pedagogic tools are available through the Y Tiwtiadur tab of this website.
All data in CorCenCC has been fully tagged in terms of part-of-speech (POS) and semantic category. These tags are fully searchable within the corpus and, in the case of Simple and Full Queries, POS-tags are also colour coded to ease the examination of patterns in query results. All data is also categorised according to its context of use, genre, topic etc., enabling users to examine patterns within/across specific types of text and demographic information in the corpus. Details of tags and taxonomies used, are available in the user guide on the main query tools page and via CorCenCC’s GitHub site.
Results from analyses using the query tools may contain tags where data has been anonymised, or (for spoken data) where transcription conventions have been used. Anonymisation tags include:
Personal names <anon> enwg1 </anon> – first male name
<anon> enwb1 </anon> – first female name
Phone numbers <anon> Rhif ffôn </anon>
Email addresses <anon> cyfeiriad e-bost </anon>
Personal addresses <anon> cyfeiriad </anon>
Spoken data was transcribed using CorCenCC’s bespoke transcription conventions. Examples include:
<S4> Rydym ni yn defnyddio ein trwyna’ i arogli. <arogli i mewn yn sydyn> Pan ym mae ‘da fi anwyd mae fy nhrwyn i’n mynd yn goch ac <=> mae </=> mae fel yn rhedag trwy’r amser.
Here, <S4> denotes the speakers in the conversation, <=> mae </=> indicates a repeated word in the conversation.
<S1> Boeth. A’r hen athrawon ‘na’n mynd fyny ac i lawr yn mynd <griddfan>.
<S2> <Chwerthin>. Gwrando ar y+
<S1> Ti’n cofio hyna <anon>enwb3</anon>?
<S2> +Gwrando ar y cloc yn tician.
Here the use of ‘+’ indicates when a speaker interrupts another speaker in the conversation – so they talk at the same time. The use of <anon>enwb3</anon> signals that a personal name has been anonmyised. Finally, <Chwerthin> indicates that the speaker is laughing and <griddfan> indicates a groan.