Job Advert (November 2018) – Research Assistant

‘CorCenCC – Corpws Cenedlaethol Cymraeg Cyfoes (The National Corpus of Contemporary Welsh): A community driven approach to linguistic corpus construction’

Applications are invited for a Research Assistant to join an inter-disciplinary and multi-institutional project that will create a large scale, open source corpus of contemporary Welsh language (CorCenCC). The project aims to redefine the scope, design and infrastructure of corpus development methodology, and to create a major Welsh language resource for use in community, commercial, educational and governmental settings.

The successful applicant will have a background in language studies, applied linguistics, and/or Welsh language. Excellent communication skills in written and spoken Welsh and English are essential, as is the ability to communicate professionally via the webpage, social networking technologies and the media.  The postholder will take a central role in the construction of the corpus, with a focus on collecting and processing written and oral language data.

This 3½ year project is funded by the Economic and Social Research Council (ESRC) and Arts and Humanities Research Council (AHRC). The project is led by Dawn Knight, at the Centre for Language and Communication Research, Cardiff University. The academic project team also includes Irena Spasic and Jonathon Morris (Cardiff University); Tess Fitzpatrick, Steve Morris and Alex Lovell (Swansea); Paul Rayson (Lancaster University) and Enlli Thomas (Bangor University).

This is a full-time post for a fixed term period of 8 months (1 January 2019 to 31 August 2019), based in Cardiff University.

Informal enquiries can be made to Dr Dawn Knight (

The successful applicants will be expected to take up the post from 1 January 2019 or as soon as possible thereafter.  Interviews will take place w/c 3 December 2018.

Salary: £27,025 – £31,302 per annum (Grade 5)

Date advert posted: Monday, 5 November 2018

Closing date: Thursday, 22 November 2018

To apply, or for more information, click here.


News (2018)

App now on Android! (August 2018)

Remember our crowdsourcing app? To complement the readily available iOS version, we’ve recently been working on creating more ways for Welsh speakers to include their data to the corpus. Now you can also contribute data to us via an overhauled web version, or on Android platforms!

One of our goals on CorCenCC is to include data directly from the community in the corpus, in order to ensure that contemporary Welsh is well-represented in the 10 million words. The app gives Welsh speakers the opportunity to record conversations between themselves and others across a range of contexts so that they can be included in the final corpus, and to upload metadata that helps us to categorise such contributions. Crowdsourced data is a relatively new direction, and so we see this as an exciting way to gather contemporary Welsh in addition to more traditional language data collection methods (which you’ll be familiar with if you’ve already seen CorCenCC team members out and about).

Of course, the key ingredient in all of this is your data. Whether it’s chatting with your friends in the pub or the café, your family at the dinner table, or anything else you can think of that you’d like to share, your Welsh is important to us – and we want to see as much of it as possible reflected in CorCenCC and available to the community to see and to learn from. So please do consider taking part as a contributor, or helping us spread the word about the availability of the app. CorCenCC will be a resource for all, so let’s make sure that it reflects all of our Welsh. Our crowdsourcing app is one way to do that.

If you’re interested in contributing via the app, you can register and contribute online here, or download our iOS or Android versions to your device. Please also feel free to email us with any app-based questions at


Whole Project Meeting Review (28/03/18)

On 28th March, 24 members of the CorCenCC team, from the CMT (CorCenCC Management Team) to the PAG (Project Advisory Group), descended to a sun-filled Cardiff for the second annual CorCenCC Whole Project Team Meeting.

The first part of the meeting focused on us saying farewell to members old, welcoming members new, and catching up on developments across Work Packages (WP) over the last twelve months. This included working demonstrations of the CorCenCC part-of-speech (POS) tagger, CyTag ( and of an early demonstrator of the CorCenCC query tool. A more user-friendly version of the latter of these will be on general release in due course – keep checking the website, Twitter and Facebook feeds for more news on this!

The afternoon focused more on our future users and advisory board members: reflecting on key risks and challenges faced by the project team; how we might engage future users of the corpus and how we can sustain and extend the fabulous work that is being carried out on CorCenCC into the future.

I am sure that the team will agree that it was an invaluable and motivating meeting: one which allowed us to congratulate all members of the team on their hard work so far, but also to whet people’s appetites on the exciting developments that are soon to come. Watch this space!

Many thanks to all of those who attended the meeting in person and helped to make it such a success – and to those who joined us (bilingually) from afar.

WordNet Cymru

Last month, three of the CorCenCC team members – Irena Spasic, Steven Neale and Dawn Knight – completed work on the WordNet Cymraeg project, which has been underway over a 3-month period in parallel with CorCenCC. WordNet Cymraeg is a lexical database of Welsh content words (nouns, verbs, adjectives and adverbs) grouped together as sets of synonyms, which are then linked to each other according to various lexical and semantic relationships. It follows the same methodology as WordNets in other languages, which have been crucial resources for determining the meaning of words in natural language processing tasks such as word sense disambiguation and text summarisation.

WordNet Cymraeg has been developed over a 3-month period, for which we were very pleased to be funded by the Welsh Government as part of their Grant Cymraeg 2050 scheme. In line with emerging trends in constructing WordNets automatically, we’ve leveraged bilingual dictionary information provided by our friends at the GPC (Geiriadur Prifysgol Cymru) to translate words from the English WordNet to Welsh, and then organised those words into Welsh synonym sets based on the original WordNet structure in English. We’re really happy with our resulting Welsh WordNet, which covers about 67% of what are considered to be the ‘core’ synonym sets for a new language – those 5,000 or so concepts that are the most common, and have the most relationships to other synonym sets.

We also had the opportunity to show our work to the funders and to the community as part of the recent Cymru Arloesol event at Tramshed Tech in Cardiff, at which a number of the projects funded by Grant Cymraeg 2050 demonstrated their progress. It was fantastic to be there to see the exciting ways that people are driving the development of Welsh language technology, and for our own Steven Neale to be able to give a presentation on the development of WordNet Cymraeg and the value it offers in that landscape. These are certainly exciting times for the development of technology delivered and available in Welsh, and Welsh natural language processing tools are going to have an important role to play in that.

To find out more about WordNet Cymraeg, visit, or to start using WordNet Cymraeg, files can be found at


News (2017)


The CorCenCC team are pleased to announce that we have been awarded funding from the Welsh Government’s Grant Cymraeg 2050 scheme for work on a project entitled WordNet Cymraeg.

The aim of the project is to automatically construct a WordNet for Welsh, a lexical database in which words are grouped into sets of synonyms (synsets), which are then organised into a network of lexico-semantic relationships. WordNets are widely used in natural language processing (NLP) to support understanding of meaning expressed in written and spoken language. As such, WordNet is vital for language technology applications such as question answering, information retrieval and machine translation.

These technologies are vital for development of user-friendly interfaces of smartphone and smart home apps, which will drive the use of Welsh-medium digital technology for Cymraeg 2050. By linking the WordNet Cymraeg project to the CorCenCC project, we will re-use its sustainability and engagement plans to increase the visibility and ensure the long-term future of the WordNet Cymraeg. Public engagement activities include:

  • A social media campaign will be carried out to advertise the project and encourage users to test functionalities.
  • Regional road shows/workshops will be held at schools, libraries and community centres/ Mentrau Iaith to raise potential users’ awareness of the WordNet and to provide basic training of its utilities.

WordNet Cymru will be led by Professor Irena Spasic, working with Dr Dawn Knight and Dr Steven Neale.


17/11/2017 – CorCenCC runs the only Welsh medium event in the 2017 ‘Being Human’ Festival

On Friday 17 November, Jenny Needs and Steve Morris went to the Ty’r Gwrhyd ‘Canolfan Gymraeg’ in Pontardawe to hold the only Welsh medium event at this year’s ‘Being Human’ Festival. This is the UK’s only national festival of the humanities and the only hub the festival has in Wales is in Swansea. The festival is led by the School of Advanced Study, University of London in partnership with the British Academy and the Arts and Humanities Research Council.

The name of the CorCenCC Welsh medium session was “Rho dy Gymraeg i ni / We want your Welsh!” and it was a fantastic opportunity to collect hours of spoken data through experimenting with the ‘Gogglebox’ television programme model and asking participants to give a live reaction to short films.

It was also an opportunity to engage with the public in the Swansea Valley and show them the app (as Jenny is doing in the picture). There was a good – and lively – response to the films from the ‘sofa critics’ and this is definitely a way of collecting data which we will look to use again in the future.


06/11/2017 – CorCenCC Away Day

On the 13th of November, the WP1 team (the PI, Co-Investigators and RAs) met up in Swansea for an away day. This meeting was multifunctional; it gave the team some much needed time to catch-up face-to-face, to take stock and positively reflect on the progress we have made so far, and to prioritise and plan the remaining months of the project. The meeting was productive and a good way to introduce new members of the team. To break the ice, our communication skills were tested by completing interactive communication tasks using images – in which we completed in record time. But the main focus was on how to stream line handling Big Data, and identifying the challenges of data collecting and possible ways of limiting them. Overall, the outcomes were positive, and we have already started implementing changes to data collection methods, such as the use of a web scraper to automate the extraction of e-language texts.



01/08/2017 – Would you be interested in working as a CorCenCC transcriber?

As you know, we have been busy recording Welsh being spoken up and down the country. Work has begun on transcribing the recordings, but we are now looking for more transcribers – would you be
interested? The work is flexible (you can work whenever suits you, and do as many/few hours as you wish) so it is easy to fit in around other activities, and the recordings are interesting and varied – one day you might be transcribing a lecture or sermon, and the next day a lively conversation down the pub! If you’d be interested in joining our team of transcribers, please email for more information.


17/02/2017 – CorCenCC Crowdsourcing App launch

To coincide with the launch of the website, February also witnessed the launch of the first release of the CorCenCC crowdsourcing app. The app is currently available on iOS and an Android version will be released within the next two-four months (keep an eye out for that!).

News of the app release was featured on the websites of all partner institutions, on tech websites and in Y Cymro and the Denbighshire Free Press (amongst others). We are hoping that by spreading the word about the app and project, we can raise people’s awareness of the importance and value of the work, and get as many people as possible involved in contributing data and/or using the corpus when it is finally constructed.


28/02/2017 – Project launch

To celebrate a successful first 12 months of the project, the CorCenCC team hosted a launch event at the Pierhead Building in Cardiff Bay. Scaffolded by a weighty media campaign, which included radio interviews on the BBC’s Good Morning Wales programme (PI Dawn Knight) and BBC Radio Cymru’s Post Cyntaf (Ambassador Nia Parry) and print and online press coverage in various outlets (including the BBC and Mail Online, institutional websites and tech blogs, amongst others), the event aimed to act as a springboard for engaging with the public, policy makers, educators, publishers and the media; raising awareness about the project and encouraging individuals to support the work.


The launch, attended by Alun Davies AM, Minister for Lifelong Learning and Welsh Language, gave guests the chance to find out more about the project, which is a collaboration between Cardiff, Swansea, Lancaster and Bangor universities, and is breaking new ground in creating a large-scale, open access corpus of contemporary Welsh language. Backed by high-profile ambassadors poet Damian Walford Davies, musician and presenter Cerys Matthews, broadcaster Nia Parry and international rugby referee Nigel Owens CorCenCC is community-driven and uses mobile and digital technologies to enable public collaboration. A demonstration of our new data collection app which enables Welsh speakers from all walks of life to contribute to the project, was on show at the event. CorCenCC partners and ambassadors also shared their impressions of how the resource will impact on their research, and on the Welsh language community more widely.

Alun and co

Alun Davies, Steve Morris, Dawn Knight, Bethan Jenkins and Tess Fitzpatrick

Minister for Lifelong Learning and the Welsh Language, Alun Davies, said: “I am very pleased to attend the launch of this exciting project today. Not only will this work give us a real record of how Welsh is actually being used, but it will also feed into our aim of developing the role of the Welsh language in technology which will be key if we are to meet our target of a million Welsh speakers by 2050.”


The CorCenCC team

Around 85 people attended the launch and the evening also marked the first time that the majority of the extended CorCenCC team were assembled in the same place together! The launch was sponsored by funds from the British Council, the School of English, Communication and Philosophy at Cardiff University, and the Research Institute for Arts and Humanities at Swansea University – many thanks for your support!

 01/03/17 – Whole Project Team meeting

Hot on the heels of the launch event, we held the first Whole Project Team meeting at Cardiff University on St David’s Day. The meeting, which will take place annually, brings together the CorCenCC Project Team (CPT – which comprises the PI, all CIs, RAs and PhD students), Consultants and all members of the Project Advisory Group, and is a great opportunity for the team to get to know each other a little better (face-to-face) and to discuss ideas and future plans. The aim of the meeting was to provide specific work package (WP) updates, to consider and discuss potential routes to engagement for the project as a whole (concentrating on input mainly from the Project Advisory Group) and to think about how we can best push the boundaries in current corpus research with future developments on CorCenCC.


We would like to say a big thank you to all of you who travelled far and wide to attend this meeting – we all thought it was a very successful and engaging meeting and is likely to provide us with an added strength in ideas and motivation to fuel the next steps of development on the project. We are looking forward to having you all back in Cardiff for the meeting in 2018!


CorCenCC newsletter – previous editions

Subscribe to our project newsletter

Enter your e-mail address in the form below then click the ‘Subscribe’ button