Publications:

  • Knight, D., Morris, S., Arman, L., Needs, J. and Rees, M. (2021, in prep.). Blueprints for minoritised language corpus design: a focus on CorCenCC. London: Palgrave.
  • Knight, D., Morris, S. and Fitzpatrick, T. (2021, in prep.). Corpus Design and Construction in Minoritised Language Contexts: A focus on CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes – National Corpus of Contemporary Welsh). London: Palgrave.
  • Knight, D., Loizides, F., Neale, S., Anthony, L. and Spasić, I. (2020). Developing computational infrastructure for the CorCenCC corpus – the National Corpus of Contemporary Welsh. Language Resources and Evaluation (LREV).
  • Corcoran, P., Palmer, G., Arman, L., Knight, D. and Spasić, I. (2020, accepted). Word Embeddings in Welsh. Journal of Information Science.
  • Muralidaran, V., Knight, D. and Spasić, I. (2020, accepted). A systematic review of unsupervised approaches to usage-based grammar induction. Natural Language Engineering.
  • Spasić, I., Owen, D., Knight, D. and Arteniou, A. (2019). Data-driven terminology alignment in parallel corpora. Proceedings of the Celtic Language Technology Workshop 2019, Dublin, Ireland.
  • Piao, S., Rayson, P., Knight, D. and Watkins, G. (2018). Towards a Welsh Semantic Annotation System. Proceedings of the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
  • Neale, S., Donnelly, K., Watkins, G. and Knight, D. (2018). Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. Poster presented at the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
  • Rayson, P. (2018). Increasing Interoperability for Embedding Corpus Annotation Pipelines in Wmatrix and other corpus retrieval tools. Proceedings of the Challenges in the Management of Large Corpora workshop at the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
  • Rayson, P. and Piao, S. (2017). Creating and Validating Multilingual Semantic Representations for Six Languages: Expert versus Non-Expert Crowds. Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications held at the European Chapter of the Association for Computational Linguistics 2017 (EACL) conference, April, Valencia.
  • Piao, S., Rayson, P., Archer, D., Bianchi, F., Dayrell, C., El-Haj, M., Jiménez, R-M., Knight, D., Křen, M., Löfberg, L., Nawab, R. M. A., Shafi, J., Teh, P-L., and Mudraya, O. (2016). Lexical Coverage Evaluation of Large-scale Multilingual Semantic Lexicons for Twelve Languages. Proceedings of the LREC (Language Resources Evaluation) 2016 Conference, May 2016, Portorož, Slovenia.

Back to top

Keynotes and Conference Presentations:

Back to top

CorCenCC Tools and Software:

The CorCenCC corpus and its associated tools are open source so are freely available via the CorCenCC GitHub site. To access the site, please click here.

Please cite these outputs as follows:

  • CorCenCC corpus:
    • Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I., Thomas, E-M., Lovell, A., Morris, J., Evas, J., Stonelake, M., Arman, L., Davies, J., Ezeani, I., Neale, S., Needs, J., Piao, S., Rees, M., Watkins, G., Williams, L., Muralidaran, V., Tovey, B., Anthony, L., Cobb, T., Deuchar, M., Donnelly, K., McCarthy, M. and Scannell, K. (2020). CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – The National Corpus of Contemporary Welsh. [Digital Resource].  Available at: www.corcencc.org/explore
  • The CorCenCC project report:
    • Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I. and Thomas, E-M. (2020). Corpws Cenedlaethol Cymraeg Cyfoes – The National Corpus of Contemporary Welsh – A community driven approach to linguistic corpus construction: Project Report. Published online at: [Details coming soon]
  • CorCenCC’s infrastructure and crowdsourcing app:
    • Knight, D., Loizides, F., Neale, S., Anthony, L. and Spasić, I. (2020). Developing computational infrastructure for the CorCenCC corpus – the National Corpus of Contemporary Welsh. Language Resources and Evaluation (LREV).
  • CorCenCC’s part-of-speech (POS) tagger ‘CyTag’:
    • Neale, S., Donnelly, K., Watkins, G. and Knight, D. (2018). Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. Poster presented at the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
  • CorCenCC’s semantic tagger ‘CySemTagger’:
    • Piao, S., Rayson, P., Knight, D. and Watkins, G. (2018). Towards a Welsh Semantic Annotation System. Proceedings of the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan.
    • Piao, S., Rayson, P., Knight, D., Watkins, G. and Donnelly, K. (2017). Towards a Welsh Semantic Tagger: Creating Lexicons for A Resource Poor Language. In Proceedings of The Corpus Linguistics 2017 Conference, held from 24-28 July 2017 at University of Birmingham, Birmingham, UK.
  • CorCenCC’s pedagogic toolkit ‘Y Tiwtiadur’:
    • Davies, J., Thomas, E-M., Fitzpatrick, T., Needs, J., Anthony, L., Cobb, T. and Knight, D. (2020). Y Tiwtiadur. [Digital Resource]. Available at: www.corcencc.org/Y-Tiwtiadur

 

Back to top

Satellite projects and software

Below are details of all externally funded satellite projects of CorCenCC:

Start Date
Funder
Amount
Description [with PI]
Feb 2017 British Council £2000 Funding to support the public launch of the CorCenCC project the Pierhead Building, Cardiff [Knight]
Oct 2017 Welsh Government £24,992 Competitive commission from Welsh Government to provide a rapid evidence assessment of effective second language teaching approaches and methods. For more information, click here. [Fitzpatrick]
Jan 2018 Cymraeg 2050 2017-2018 Grant Scheme (GC2050/17-18/20) £19,964 A project which focused on automatically constructing a WordNet for Welsh, a lexical database in which words are grouped into sets of synonyms (synsets), which are then organised into a network of lexico-semantic relationships. To access the WordNet Cymru website, click here. [Spasić]
Jan 2018 Welsh Joint Education Committee (WJEC) £1,968 Research grant (including intramural programme). Research grant to complete work on producing a B1 core vocabulary for Welsh for Adults (Canolradd level). For more information, click here. [Morris]
Jan 2019 Welsh Government Technology Funding £20,000 Funding to support the development of a Welsh language Stemmer. For more information click here. [Spasić]
Aug 2019 Welsh Government Technology Funding £90,000 Project entitled: ‘Welsh language processing infrastructure: Welsh word embeddings’. The project focused on word embeddings for Welsh (primarily on creating a lexicon and Welsh word and term embeddings) and contributes to the Welsh Language Technology Action Plan’s aim to ‘promote Welsh language technology and coding resources to teachers and children and others’. For more information, click here. [Spasić]
May 2020 Welsh Government Technology Funding £90,000 Project entitled: ‘Learning English-Welsh bilingual embeddings and applications in text categorisation’. This project aims to extend the results of the previous one by creating cross-lingual representations of words in a joint embedding space for Welsh and English. [Knight]

Back to top

CorCenCC newsletter (archive)

Click below to view the archived editions of the newsletters that were published during the CorCenCC project:

Back to top