Language Technology for Icelandic 2018-2022

This project is a part of a collaboration effort funded by the Icelandic government to make Icelandic available for use in today’s technological environment. Automatic Speech Recognition, Text-to-Speech and Machine Translation are three of the five core projects defined in the report, Language Technology for Icelandic 2018-2022 (Máltækni fyrir íslensku 2018-2022). Within the collaboration, called Samstarf um íslenska máltækni (SÍM), are nine companies and organizations specialized in linguistics and Natural-Language Processing. Entities within SÍM are Reykjavik University, University of Iceland, Árni Magnússon Institute for Icelandic studies, Blindrafélagið (BIAVI), Ríkisútvarpið (RÚV- National radio), Creditinfo (Media monitoring), Tiro ehf., Grammatek ehf. and Miðeind ehf.
Work on the project started formally the 1st of October 2019 with a 5 year total duration accepted by the Icelandic parliament.
Funding: Language Technology for Icelandic 2018-2022 (Máltækni fyrir íslensku 2018-2022)
Projects:
“Machine Translation (MT) of Icelandic and English”
“Automatic speech recognition (ASR) for Icelandic”
“Text-to-Speech (TTS) for Icelandic speech synthesis”
“Language resources and tools”
Timeline: October 2019 – October 2024.
Paper: Language Technology Programme
Text-to-Speech (TTS) for Icelandic speech synthesis

The Language and Voice Lab is responsible for developing text to speech synthesis for Icelandic in such a manner that it will be possible to produce multiple different voices. LVL will create an environment, and language resources, that will be released to enable players in the market to quickly and simply build synthetic voices for end users. LVL will ensure that the speech synthesis solutions developed can be integrated into software, where e.g. automatic recital or voice answering is needed.
Funding: Language Technology for Icelandic 2018-2022 (Máltækni fyrir íslensku 2018-2022)
Contributors: Atli Thor Sigurgeirsson, Þorsteinn Daði Gunnarsson
Timeline: First iteration: October 2019 – October 2020.
Research paper: Manual Speech Synthesis Data Acquisition
Code: LOBE
Automatic speech recognition (ASR) for Icelandic

The Language and Voice Lab is responsible for developing automatic speech recognition for Icelandic within the Language Technology for Icelandic project. The aim of developing ASR within the project is to enable people who design and develop voice-based user interfaces to add Icelandic easily. An open environment will be established for the development of speech recognition systems, and recipes for common usage will be made open and accessible.
Funding: Language Technology for Icelandic 2018-2022 (Máltækni fyrir íslensku 2018-2022)
Contributors: Helga Svala Sigurðardóttir, Jón Guðnason, Judy Fong, Þorsteinn Daði Gunnarsson, Michal Borský, Ragnheiður Kr. Þórhallsdóttir, Carlos Mena, Caitlin Richter, Ragnar Pálsson, Helga Svala,
Timeline: October 2019 – October 2022.
News, Voice Donation Platform, Code: Gáfu 1.500 raddsýni fyrir hádegi (Icelandic), Samromur.is, Broad Data Prep repository, Samromur paper, Punctuation models, Speaker Diarization recipes
Video:
Machine Translation (MT) of Icelandic and English

The goal of machine translation is to translate text or speech between two or more natural languages. In this project the goal is to implement a baseline statistical machine translation system between Icelandic and English and vice versa. The project is a part of the core machine translation project (V3) within the Icelandic National Language Technology Programme, defined in the Language Technology for Icelandic 2018-2022 project plan. We leverage the newly released ParIce corpus, a parallel corpus of 3.5M Icelandic and English translation segments.
Funding: Language Technology for Icelandic 2018-2022 (Máltækni fyrir íslensku 2018-2022)
Contributors: Steinþór Steingrímsson, Haukur Páll Jónsson, Hrafn Loftsson. Luke O’brien
Timeline: October 2019 – Summer 2020.
Support Tools
The LVL has been developing several support tools as part of the Language Technology Programme for Icelandic. The main ones are tools for performing named entity recognition, part-of-speech tagging, lemmatization, and parsing. Furthermore, emphasis has been on pre-training various types of language models that can be fine-tuned for downstream tasks.
Microservices at your service: bridging the gap between NLP research and industry
This project aims to increase inclusiveness and accessibility for the EU languages by making natural language processing (NLP) tools freely and openly available on the European Language Grid (ELG) platform. The project will make the NLP tools more accessible to a larger audience of software developers through:
- identifying relevant and interesting NLP tools. The tools will be identified via a bottom-up search on the software platforms, as well as by contacting the research institutions;
- conducting a survey and collecting standard or available test data sets for NLP tasks;
- testing the set of collected tools on the existing test data and selecting them based on the metrics performance and language coverage;
- dockerising the tools and expose an industry standard API to the service;
- sharing the docker images via the ELG platform.
The project targets the following languages: Finnish, Swedish, Norwegian, Spanish, Portuguese, Icelandic, Faroese, Lithuanian, Latvian and Estonian.
Funding: CEF Telecom
Contributors: Bjarni Bjarkason, Jökull Snær Gylfason, the University of Tartu (Estonia), Gradient (Spain), and Lingsoft (Finland).
Timeline: Mar. 2021 – Feb. 2023
Code: https://github.com/cadia-lvl/Icelandic-NER-API (any of the API repos there)
National Language Technology Platform (NLTP)
In this project, the most advanced language technology (LT) tools and solutions will be united in a novel, artificial intelligence driven National Language Technology Platform (NLTP). By tightly integrating mature, state-of-the-art LT technologies and services developed in CEF AT and other European and national programmes, the NLTP will provide public administrations, SMEs and general public with an efficient way to ensure multilingual access to online services, websites, documents and information removing the language barriers, increasing accessibility and fostering cross-border services.
The translation and speech processing services available in the platform will give public administration entities, their employees, SMEs and the public convenient and secure access to high quality tools with which to translate and make accessible a wide array of content, including confidential documents, across all the languages of the Digital Single Market and finally enable the vision of language parity and the full multilingualism enshrined in the European Charter of Fundamental Rights in an efficient, cost effective, and equitable manner.
Funding: CEF Telecom
This project is in collaboration with Culture Information Systems Centre (Latvia), Malta Information Technology Agency, Office of the State Advocate (Malta), University of Malta, University of Tartu (Estonia), Central State Office for the Development of Digital Society (Croatia), and University of Zagreb (Croatia).
Timeline: April 2021 – March 2023
Project website: https://www.nltp-info.eu/
Spoken Dialogue Framework for Icelandic
The spoken dialogue framework enables users to communicate with computers and other devices with their voice in Icelandic. The goal of this project is to develop and provide an open development framework for Icelandic spoken dialogue. The framework will feature automatic speech recognition (ASR), language understanding questions, text-to-speech synthesis (TTS), as well as several other language modules. Several of these modules are already in development as part of the five year Language Technology Programme for Icelandic while others will be new developments or areas for end users. This project will be developed and tested in collaboration with industry partners (Grammatek ehf and Tiro ehf) as well as the open sector.
Funding: Strategic Research and Development Programme for Language Technology
Contributors: Caitlin Richter, Ragnar Pálsson, Tiro, Grammatek
Using Machine Learning Models for Clinical Diagnoses
The goal is to examine the feasibility of using automatic models for clinical analyses. The project consists of two sub-goals. The first sub-goal is to develop a model based on deep neural networks which will use data from the icelandic healthcare system. The second sub-goal is to develop a prediction model for clinical diagnoses. The dataset will come from the capital region’s healthcare clinics. A portion of the dataset will be handmarked by clinical experts. This project will be developed jointly by LVL and Heilsugæsla, the health clinics.
Funding: Strategic Research and Development Programme for Language Technology
Contributors: Hlynur Davíð Hlynsson, Hrafn Loftsson
Computer-Assisted Pronunciation Training in Icelandic
Language technology can be used to make teaching easier and more fun. It is important for small languages like Icelandic to get more users and an important step in getting more users is language learning and teaching. Computer-assisted pronunciation training (CAPT) makes it easier to teach more students simultaneously and automatically. This training will be integrated with the Icelandic Online system used in the Icelandic as a second language program at the University of Iceland.
Funding: Strategic Research and Development Programme for Language Technology
Contributors: Caitlin Richter, Ragnar Pálsson, Þorsteinn Daði Gunnarsson, Tiro ehf, the Arni Magnusson Institute, and the University of Iceland.
Automatic Entity Linking for Icelandic
Named Entity Disambiguation (NED) is the task of mapping named entities (NEs), such as names of persons, locations and companies, from an input text document to corresponding unique entities in a target Knowledge Base (KB).
The goal of this project is: 1) to create Icelandic training material for NED related research and development, 2) to create a Comprehensive, Open, Knowledge Graph of Icelandic NEs, and 3) to develop a fast and accurate Entity Linker for Icelandic.
Funding: Strategic Research and Development Programme for Language Technology
Question-Answering
n this project, we develop a novel information-seeking procedure to crowdsource the creation of Question-Answering datasets. The crowd workers use an app, https://gameqa.app, to ask questions, mark correct answers, and perform quality control. The answer candidates are searched for in multiple sources, as opposed to only in Wikipedia as is most common in previous methods.