Publications:

  • Knight, Dawn, Steve Morris, Laura¬†Arman, Jennifer¬†Needs¬†and Mair¬†Rees. (2021, in prep.). Blueprints for minoritised language corpus design: a focus on CorCenCC. London: Palgrave.
  • Knight, Dawn, Steve Morris¬†and Tess¬†Fitzpatrick. (2020, in prep.). Corpus Design and Construction in Minoritised Language Contexts: A focus on CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes – National Corpus of Contemporary Welsh). London: Palgrave.
  • Knight, Dawn,¬†Fernando Loizides, Steven¬†Neale, ¬†Laurence Anthony,¬†and Irena Spasińá. (2020, accepted). Developing computational infrastructure for the CorCenCC corpus ‚Äď the National Corpus of Contemporary Welsh. Language Resources and Evaluation (LREV).
  • Corcoran, Padraig, Geraint Palmer, Laura¬†Arman, Dawn Knight¬†and Irena Spasińá. (2020, accepted). Word Embeddings in Welsh. Journal of Information Science.
  • Muralidaran, Vignesh, Dawn Knight. and Irena¬†Spasińá. (2020, accepted). A systematic review of unsupervised approaches to usage-based grammar induction. Natural Language Engineering.
  • Spasińá, Irena, David Owen, Dawn¬†Knight¬†and Andreas¬†Arteniou. (2019). Data-driven terminology alignment in parallel corpora. Proceedings of the Celtic Language Technology Workshop 2019, Dublin, Ireland.
  • Piao, Scott, Paul Rayson, Dawn Knight, and Gareth Watkins. (2018). Towards a Welsh Semantic Annotation System. Proceedings of the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
  • Neale, Steven, Kevin Donnelly, Gareth¬†Watkins, and¬†Dawn Knight. (2018). Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. Poster presented at the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
  • Rayson, Paul. (2018). Increasing Interoperability for Embedding Corpus Annotation Pipelines in Wmatrix and other corpus retrieval tools. Proceedings of the Challenges in the Management of Large Corpora workshop at the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
  • Rayson, Paul. and¬†Scott Piao. (2017). Creating and Validating Multilingual Semantic Representations for Six Languages: Expert versus Non-Expert Crowds. Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications held at the European Chapter of the Association for Computational Linguistics 2017 (EACL) conference, April, Valencia.
  • Piao, Scott, Paul Rayson, Dawn¬†Archer, Francesca¬†Bianchi, Carmen Dayrell, Mahmoud El-Haj, Ricardo-Mar√≠a Jim√©nez, Dawn¬†Knight, Michael¬†KŇôen, Laura L√∂fberg, Rao Muhammad Adeel¬†Nawab, Jawad¬†Shafi, Phoey Lee¬†Teh and Olga Mudraya (2016). Lexical Coverage Evaluation of Large-scale Multilingual Semantic Lexicons for Twelve Languages.¬†Proceedings of the LREC (Language Resources Evaluation) 2018 Conference, May 2016, PortoroŇĺ, Slovenia.

Back to top

Keynotes and Conference Presentations:

Back to top

CorCenCC Tools and Software:

The CorCenCC corpus and its associated tools are open source so are freely available via the CorCenCC GitHub site. To access the site, please click here.

Please cite these outputs as follows:

  • CorCenCC corpus and query tools:
    • Knight, Dawn, Steve Morris, Tess Fitzpatrick, Paul Rayson, Irena Spasińá, Enlli M√īn Thomas, Alex Lovell, Jonathan Morris, Jeremy Evas, Mark Stonelake, Laura Arman, Joshua Davies, Ignatius Ezeani, Steven Neale, Jennifer Needs, Scott Piao, Mair Rees, Gareth Watkins, Lowri Williams, Vignesh Muralidaran, Bethan Tovey, Laurence Anthony, Tom Cobb, Margaret Deuchar, Kevin Donnelly, Michael McCarthy and Kevin Scannell. (2020). CorCenCC: (Corpws Cenedlaethol Cymraeg Cyfoes – The National Corpus of Contemporary Welsh): A community driven approach to linguistic corpus construction. [Digital Resource].¬† Available at: www.corcencc.org/explore
  • Citing the CorCenCC project report:
    • Knight, Dawn, Steve Morris, Tess¬†Fitzpatrick, Paul Rayson, Irena¬†Spasińá and Enlli M√īn Thomas. (2020). Corpws Cenedlaethol Cymraeg Cyfoes –¬†The National Corpus of Contemporary Welsh – A¬†community driven approach to linguistic corpus construction:¬†Project Report. Published online at: [Details coming soon]
  • CorCenCC’s infrastructure and crowdsourcing app:
    • Knight, Dawn, Fernando Loizides, Steven¬†Neale, Laurence¬†Anthony, and Irena Spasińá. (2020). Developing computational infrastructure for the CorCenCC corpus ‚Äď the National Corpus of Contemporary Welsh. Language Resources and Evaluation (LREV).¬†[Details coming soon]
  • CorCenCC’s part-of-speech (POS) tagger ‘CyTag’:
    • Neale, Steven, Kevin Donnelly, Gareth Watkins,¬†and Dawn Knight. (2018) Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. In Proceedings of the 11th Edition of Language Resources and Evaluation Conference (LREC 2018). Miyazaki, Japan. May 7-12, 2018.
  • CorCenCC’s semantic tagger ‘CySemTagger’:
    • Piao, Scott, Paul Rayson, Dawn Knight and Gareth Watkins (2018). Towards A Welsh Semantic Annotation System. In¬†Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC2018), Miyazaki, Japan.
    • Piao, Scott, Paul Rayson, Dawn Knight, Gareth Watkins and Kevin Donnelly (2017). Towards a Welsh Semantic Tagger: Creating Lexicons for A Resource Poor Language. In Proceedings of The Corpus Linguistics 2017 Conference, held from 24-28 July 2017 at University of Birmingham, Birmingham, UK.
  • CorCenCC’s pedagogic toolkit ‘Y Tiwtiadur’:
    • Davies, Joshua, Enlli M√īn Thomas, Tess¬†Fitzpatrick, Jennifer Needs, Laurence Anthony, Thomas Michael Cobb¬†and Dawn Knight. (2020).¬†Y Tiwtiadur.¬†[Digital Resource]. Available at: www.corcencc.org/Y-Tiwtiadur

 

Back to top

Satellite projects and software

Below are details of all externally funded satellite projects of CorCenCC:

Start Date
Funder
Amount
Description [with PI]
Feb 2017 British Council £2000 Funding to support the public launch of the CorCenCC project the Pierhead Building, Cardiff [Knight]
Oct 2017 Welsh Government £24,992 Competitive commission from Welsh Government to provide a rapid evidence assessment of effective second language teaching approaches and methods. For more information, click here. [Fitzpatrick]
Jan 2018 Cymraeg 2050 2017-2018 Grant Scheme (GC2050/17-18/20) ¬£19,964 A project which focused on automatically constructing a WordNet for Welsh, a lexical database in which words are grouped into sets of synonyms (synsets), which are then organised into a network of lexico-semantic relationships. To access the WordNet Cymru website, click¬†here. [Spasińá]
Jan 2018 Welsh Joint Education Committee (WJEC) £1,968 Research grant (including intramural programme). Research grant to complete work on producing a B1 core vocabulary for Welsh for Adults (Canolradd level). For more information, click here. [Morris]
Jan 2019 Welsh Government Technology Funding ¬£20,000 Funding to support the development of a Welsh language Stemmer. For more information click here. [Spasińá]
Aug 2019 Welsh Government Technology Funding ¬£90,000 Project entitled: ‚ÄėWelsh language processing infrastructure: Welsh word embeddings‚Äô. The project focused on word embeddings for Welsh (primarily on creating a lexicon and Welsh word and term embeddings) and contributes to the Welsh Language Technology Action Plan‚Äôs aim to ‚Äėpromote Welsh language technology and coding resources to teachers and children and others‚Äô. For more information, click here. [Spasińá]
May 2020 Welsh Government Technology Funding ¬£90,000 Project entitled: ‘Learning English-Welsh bilingual embeddings and applications in text categorisation’. This project aims to extend the results of the previous one by creating cross-lingual representations of words in a joint embedding space for Welsh and English. [Knight]

Back to top

CorCenCC newsletter (archive)

Click below to view the archived editions of the newsletters that were published during the CorCenCC project:

Back to top