Is GDPR relevant to newswire text corpora?

In an industry-oriented research project funded by the European Regional Develpment Fund, our research lab is creating a multilayer text corpus of the Latvian language. The text corpus is being annotated at several layers of syntactic and semantic analysis: treebank, named entities, coreferences, frame semantics, etc. The project proposal was assesed and approved (receiving top scores) by EU experts.

According to the terms of the funding program, the results of the project must have the potential to be commercialized (the so called knowledge and technology transfer). To foster the implementation of this requirement, our plan (according to the project proposal) is to distribute the language resource ("data set") with a dual licence:

  1. CC BY-NC-SA 4.0 for non-commercial use;
  2. individual licence agreements for commercial use (with the same terms and for the same symbolic fee for all commercial users).

The full data set (a work-in-progress version) is already publicly available on GitHub (https://github.com/LUMII-AILab/FullStack); applying for a commercial licence is expected to be a good faith. We are also planning to distribute the data set via ELRA and LDC catalogues, as well as META-SHARE and CLARIN, and ELRC (if relevant).

The first potential customer, a language technology company from USA, has contacted us, and they are willing to sing a commercial licence agreement. We have prepared a draft licence agreement (I have attached it just in case), however, its approval has stuck on my side.

Since the text corpus is partially based on public newswire texts (~60%), my administration has consulted with some local GDPR experts, because the random newswire text units (mostly, random isolated paragraphs from random articles) contain mentions of random persons. These local experts have concluded that it is most likely illegal (w.r.t. GDPR Article 6 "Lawfulness of processing") to distribute such language resources for commerical use, so my administration says "no".

No common-sense arguments have been helpful so far:

  • Neither other research groups nor commercial companies will re-distribute the language resource together with their prototypes and products. They will derive neural, statistical, or rule-based language models from the original language resource. The derived language models will be their intellectual property. The derived language models will not contain any IPR or GDPR subjects of the original language resource.
  • I have consulted with researchers from other European universities, who are working on similar language resources. None of them has faced such an obstacle. However, they suggested a reasonable approach: there has to be an option for the personal data subjects to have their data removed (or anonymised), should they request it. Note that we cannot conduct mass-anonymisation of the text corpus, because it will be used also for training automatic named entity recognizers.

I'm also wondering whether text corpora are relevant at all w.r.t. the term "processing (of personal data)"? In the corpus development, we are not intentionally collecting data and facts about particular persons. Neither we nor the users of this data set will use it as a personal data source (apart from the general machine learning task of named entity recognition).

Is this indeed the case, that GDPR does not allow us to distribute this language resource for commercial use, i.e., for deriving language models for commercial applications?

Best regards,

Normunds Gruzitis
Head of Artificial Intelligence Lab
Institute of Mathematics and Computer Science
University of Latvia

 

Discussion

  • Dear Normundus,
     
    Let me briefly discuss copyright before I move to personal data: I assume you have taken good care of IPR clearance and you are sure that you have all the necessary rights and permissions (regarding primary data) to distribute the corpus in the intended manner (i.e. under the CC BY-NC-SA 4.0 license and under bespoke commercial licenses)?
     
    Now, regarding personal data: the concept of personal data is indeed extremely broad; if 'random people' are mentioned in the corpus, then it contains personal data.
    The concept of processing is just as broad and what your team is doing also qualifies as 'processing'. See art. 4 of the GDPR for the definitions of both terms.
     
    From the legislator's perspective, the solution to your problems should be anonymisation of the corpus (at least before it's distributed), but you mention this is not an option, which we understand.
     
    In some circumstances, the GDPR does allow you to process unanonymised personal data for research purposes without consent of the data subjects (cf. art. 89), providing that you apply 'appropriate safeguards for the rights and freedom of data subjects'. Nothing in the GDPR (unlike in the Copyright Directive) says that 'research' should be 'non-commercial'; nevertheless, it has to be 'research'. Since the GDPR is fairly new, this is still largely a grey area -- the Latvian legislator might have provided you with some guidelines regarding the processing of personal data for research purposes in the Latvian law accompanying the GDPR (if such law exists). It is not unlikely that a number of uses that you envisage for research purposes may be covered by this exception.
     
    Furthermore, please be informed that lawfulness (the art. 6 you mention) is not the only obligation you have to comply with under the GDPR; remember that there are also obligations of transparency, purpose and storage limitation, security etc.
     
    Even if we put all this aside, as far as transferring data to an US-based company is concerned, the situation is complicated (since the US is not approved by the Commission as a country that 'ensures an adequate level of protection' -- see art. 45 of the GDPR). As ELRC Helpdesk, we are not really in a position to advise you on distributing your data to the United States for commercial use, nor to contradict what your in-house lawyers told you. It could be much simpler if you considered an appropriate anonymisation technique (e.g. replacing named entities with other words belonging to the same category in a randomised manner).
     
    However, if you want to distribute the corpus via ELRC-Share, the CC BY-NC-SA 4.0 license is satisfying from our point of view.
     
    Kind regards,
    ELRC Helpdesk
    avatar
  • Dear Normundus,
     
    Let me briefly discuss copyright before I move to personal data: I assume you have taken good care of IPR clearance and you are sure that you have all the necessary rights and permissions (regarding primary data) to distribute the corpus in the intended manner (i.e. under the CC BY-NC-SA 4.0 license and under bespoke commercial licenses)?
     
    Now, regarding personal data: the concept of personal data is indeed extremely broad; if 'random people' are mentioned in the corpus, then it contains personal data.
    The concept of processing is just as broad and what your team is doing also qualifies as 'processing'. See art. 4 of the GDPR for the definitions of both terms.
     
    From the legislator's perspective, the solution to your problems should be anonymisation of the corpus (at least before it's distributed), but you mention this is not an option, which we understand.
     
    In some circumstances, the GDPR does allow you to process unanonymised personal data for research purposes without consent of the data subjects (cf. art. 89), providing that you apply 'appropriate safeguards for the rights and freedoms of data subjects'. Nothing in the GDPR (unlike in the Copyright Directive) says that 'research' should be 'non-commercial'; nevertheless, it has to be 'research'. Since the GDPR is fairly new, this is still largely a grey area -- the Latvian legislator might have provided you with some guidelines regarding the processing of personal data for research purposes in the Latvian law accompanying the GDPR (if such law exists). It is not unlikely that a number of uses that you envisage for research purposes may be covered by this exception.
     
    Furthermore, please be informed that lawfulness (the art. 6 you mention) is not the only obligation you have to comply with under the GDPR; remember that there are also obligations of transparency, purpose and storage limitation, security etc.
     
    Even if we put all this aside, as far as transferring data to a US-based company is concerned, the situation is complicated (since the US is not approved by the Commission as a country that 'ensures an adequate level of protection' -- see art. 45 of the GDPR). As ELRC Helpdesk, we are not really in a position to advise you on distributing your data to the United States for commercial use, nor to contradict what your in-house lawyers told you. It could be much simpler if you considered an appropriate anonymisation technique (e.g. replacing named entities with other words belonging to the same category in a randomised manner).
     
    However, if you want to distribute the corpus via ELRC-Share, the CC BY-NC-SA 4.0 license is satisfying from our point of view.
     
    Kind regards,
    ELRC Helpdesk
    avatar