Helpdesk for Language Resources

new question

27 questions found

When is textual data open data? 1 answer answered

Do you have practical guidelines that clarify when language data can be considered open data? How and when does the PSI apply?

by The ELRC Helpdesk Team at Mar 20, 2019 Categories Legal
Is GDPR relevant to newswire text corpora? 2 answers answered
In an industry-oriented research project funded by the European Regional Develpment Fund, our research lab is creating a multilayer text corpus of the Latvian language. The text corpus is being annotated at several layers of syntactic and semantic analysis: treebank, named entities, coreferences, frame semantics, etc. The project proposal was assesed and approved (receiving top scores) by EU experts.

According to the terms of the funding program, the results of the project must have the potential to be commercialized (the so called knowledge and technology transfer). To foster the implementation of this requirement, our plan (according to the project proposal) is to distribute the language resource ("data set") with a dual licence:
1. CC BY-NC-SA 4.0 for non-commercial use;
2. individual licence agreements for commercial use (with the same terms and for the same symbolic fee for all commercial users).
The full data set (a work-in-progress version) is already publicly available on GitHub (https://github.com/LUMII-AILab/FullStack); applying for a commercial licence is expected to be a good faith. We are also planning to distribute the data set via ELRA and LDC catalogues, as well as META-SHARE and CLARIN, and ELRC (if relevant).

The first potential customer, a language technology company from USA, has contacted us, and they are willing to sing a commercial licence agreement. We have prepared a draft licence agreement (I have attached it just in case), however, its approval has stuck on my side.

Since the text corpus is partially based on public newswire texts (~60%), my administration has consulted with some local GDPR experts, because the random newswire text units (mostly, random isolated paragraphs from random articles) contain mentions of random persons. These local experts have concluded that it is most likely illegal (w.r.t. GDPR Article 6 "Lawfulness of processing") to distribute such language resources for commerical use, so my administration says "no".

No common-sense arguments have been helpful so far:
- Neither other research groups nor commercial companies will re-distribute the language resource together with their prototypes and products. They will derive neural, statistical, or rule-based language models from the original language resource. The derived language models will be their intellectual property. The derived language models will not contain any IPR or GDPR subjects of the original language resource.
- I have consulted with researchers from other European universities, who are working on similar language resources. None of them has faced such an obstacle. However, they suggested a reasonable approach: there has to be an option for the personal data subjects to have their data removed (or anonymised), should they request it. Note that we cannot conduct mass-anonymisation of the text corpus, because it will be used also for training automatic named entity recognizers.
I'm also wondering whether text corpora are relevant at all w.r.t. the term "processing (of personal data)"? In the corpus development, we are not intentionally collecting data and facts about particular persons. Neither we nor the users of this data set will use it as a personal data source (apart from the general machine learning task of named entity recognition).

Is this indeed the case, that GDPR does not allow us to distribute this language resource for commercial use, i.e., for deriving language models for commercial applications?

Best regards,

Normunds Gruzitis
Head of Artificial Intelligence Lab
Institute of Mathematics and Computer Science
University of Latvia
by Normunds Gruzitis at Aug 08, 2018 Categories Legal
eTranslation’s privacy policy 1 answer answered closed

The texts that we need to translate are confidential. What happens to the texts that were submitted for translation? Do you keep them or are they deleted afterwards?

by The ELRC Helpdesk Team at Jun 21, 2018 Categories Other
Who is behind ELRC? 1 answer answered closed

Is the ELRC a company or a part of the European Commission? How are your operations funded and what does your organization look like? I have tried to find this information on your website. I want to know before I contribute with texts, etc. I work for a government agency.

by Anders M. at Apr 09, 2018 Categories General management
Copyright and privacy protection of LR for sharing 1 answer answered closed
There are several legal questions and concerns for translators who may want to share their translation memories - in particular how to get the TM (language resources) legally "ready" for sharing. This includes in particular the following issues:
1. How to prepare TMs as they typically contain names, phone numbers and other personal information. Also, the meta-data typically contains information about the translator in charge (personal data). How to deal with this?
2. How can privacy protection be implemented in such cases?
3. What is the situation of the tmx as regards the copyright? How is this ensured?
by ELRC Secretariat (on behalf of BMI) at Feb 15, 2018 Categories Legal