Summer jobs at LVL

We want you at LVL

Automatic translatation from English to Icelandic below with [human notes].

In collaboration with Reykjavík University, LVL is looking to hire 10 students full-time over the summer.

The duration is two months from the 10th of June and will consist, among other things, of gathering recordings for Automatic Speech Recognition (ASR) and Text-To-Speech (TTS) as well as transcribing speech recordings for other applications. You will go out into the world, meet other people and get a glimpse into the work which is being done to digitise Icelandic.

Application deadline is the 5th of June.

For further information and how to apply (in Icelandic) see here.

Í samstarfi við Háskólann í Reykjavík leitast LVL við að ráða 10 nemendur í fullu starfi yfir sumarið.

Tímalengdin er tveir mánuðir frá 10. júní og mun m.a. fela í sér upptökur fyrir sjálfvirka talfærslu [talgreiningu] (ASR) og textatal [talgervingu] (TTS) ásamt því að þýða talupptökur fyrir önnur forrit [fyrir önnur verkefni]. Þú munt fara út í heiminn, kynnast öðru fólki og láta sjá þig í verkinu sem er unnið til að stafvæða íslensku [taka þátt í vinnunni að gera íslensku stafræna].

Umsóknarfrestur er 5. júní næstkomandi.

Frekari upplýsingar og hvernig á að nota [sækja um] (á íslensku) sjá hér.

Role models?

Sjá íslenska þýðingu neðar.

Will the Icelandic Language Technology program and our efforts become role models for other similar languages? Read about RU’s take on the work we are participating in and see us in action (picture taken pre-Covid).

Gæti íslenska máltækniáætlunin orðið leiðarljós annarra lítilla málsvæða? Hérna er umfjöllun HR um starfsemi okkar (Myndir teknar fyrir Covid).

A large milestone in Named Entity Recognition for Icelandic!

Progress in NER celebrated with a suitable cake.

Last Tuesday Svanhvít and Ásmundur completed the first stage in Named Entity Recognition project for Icelandic. They finished the daunting task of labeling all named entities in a text corpus of 1 million tokens (MIM-GOLD), into the following categories: Person, Location, Organization, Micellaneous, Money, Percent, Time, and Date.

To assist with the task, they first preprocessed the corpus using regular expressions to catch some cases and then verified and completed the labeling using the brat rapid annotation tool. Their next task will be to create a few baseline NER tagging systems using the labelled dataset.

The dataset will be publicly available this spring.

First milestone in the Language Technology for Icelandic project

The LVL team celebrating the first milestone in the Language Technology for Icelandic project. Ólafur Helgi Jónsson, Sunneva Þorsteinsdóttir and Steinþór Steingrímsson are missing from the picture.

Last week we celebrated achieving the first milestone in the Language Technology for Icelandic project with a cake!

After a lot of hard work the past few months we achieved the first milestone in Automatic Speech Recognition (ASR), Text-to-Speech (TTS) and Machine Translation (MT).

In ASR, the focus has mostly been on data creating and gathering. 55,000 utterances have been collected (donated by adults) via the crowd-sourcing platform (based on Common Voice) with plans to reach 100.000 utterances for the next milestone. The process is being extended to include younger voices in collaboration with schools and authorities. Today we started working with Öldutúnsskóli in Hafnarfjörður. The goal is to reach 80.000 young voice utterances for the next mileston. Additionally, data has been gathered from RÚV (audio, video and subtitles) and CreditInfo (transcriptions). Along with data gathering, the team is also developing tools to post-process Icelandic ASR text for better readability.

In TTS, we successfully created a voice recording client (LOBE) and three reading scripts in order to collect high quality speech and corresponding text data. The reading scripts were created from Risamálheild and seek to maximize diphone coverage. So far 20 hours have been collected from two speakers, male and female. The aim is to finish collecting 20 hours from each speaker early this year. From the collected data two TTS prototypes have been created in Ossian, which extends the Merlin back-end. The current prototypes are quite naive but we have integrated a grapheme-to-phoneme model for the Icelandic language into the prototypes.

In MT, we successfully created a phrase-based statistical machine translation system using the open source tool Moses. Our collaborators at Miðeind created neural machine translation systems based on BiLSTMs and Transformers. The models were trained on the newly available English-Icelandic parallel corpus, ParIce. The systems were then evaluated w.r.t. training time, throughput and BLEU score. The code and 
systems are freely available but are still under development for milestone two. In milestone two we will continue to develop the systems further and adjust them to specific needs of the Icelandic language.

Conference – Er íslenskan góður „bisness“?

Tomorrow, 16th of October 2019, there will be a conference on Icelandic language technology. The conference will take place at Veröld – hús Vigdísar and starts at 8:00.

A number of people affiliated (past and present) with the LVL will be giving talks there such as:

  • David Erik Mollberg, Ólafur Helgi Jónsson, Viktor Sveinsson Sunneva Þorsteinssdóttir – students at HÍ and RU will launch an open speech data collection initivate for Icelandic.
  • Anna Björk Nikulásdóttir – project manager at SÍM (Samstarf um íslenska máltækni – Collaboration on Icelandic Language Technology) and CEO of Grammatek will talk about tools in language technology.
  • Hrafn Loftsson – docent at the School of Computer Science in RU will talk about automatic text summarization for Icelandic.

The conference focuses on the importance of Icelandic language technology for academy and industry and is open for anyone to attend.

For more details on the conference and speaker list, see the facebook event (Icelandic)