View a Ticket

Discussion

Dear Normundus,

Let me briefly discuss copyright before I move to personal data: I assume you have taken good care of IPR clearance and you are sure that you have all the necessary rights and permissions (regarding primary data) to distribute the corpus in the intended manner (i.e. under the CC BY-NC-SA 4.0 license and under bespoke commercial licenses)?

Now, regarding personal data: the concept of personal data is indeed extremely broad; if 'random people' are mentioned in the corpus, then it contains personal data.

The concept of processing is just as broad and what your team is doing also qualifies as 'processing'. See art. 4 of the GDPR for the definitions of both terms.

From the legislator's perspective, the solution to your problems should be anonymisation of the corpus (at least before it's distributed), but you mention this is not an option, which we understand.

In some circumstances, the GDPR does allow you to process unanonymised personal data for research purposes without consent of the data subjects (cf. art. 89), providing that you apply 'appropriate safeguards for the rights and freedom of data subjects'. Nothing in the GDPR (unlike in the Copyright Directive) says that 'research' should be 'non-commercial'; nevertheless, it has to be 'research'. Since the GDPR is fairly new, this is still largely a grey area -- the Latvian legislator might have provided you with some guidelines regarding the processing of personal data for research purposes in the Latvian law accompanying the GDPR (if such law exists). It is not unlikely that a number of uses that you envisage for research purposes may be covered by this exception.

Furthermore, please be informed that lawfulness (the art. 6 you mention) is not the only obligation you have to comply with under the GDPR; remember that there are also obligations of transparency, purpose and storage limitation, security etc.

Even if we put all this aside, as far as transferring data to an US-based company is concerned, the situation is complicated (since the US is not approved by the Commission as a country that 'ensures an adequate level of protection' -- see art. 45 of the GDPR). As ELRC Helpdesk, we are not really in a position to advise you on distributing your data to the United States for commercial use, nor to contradict what your in-house lawyers told you. It could be much simpler if you considered an appropriate anonymisation technique (e.g. replacing named entities with other words belonging to the same category in a randomised manner).

However, if you want to distribute the corpus via ELRC-Share, the CC BY-NC-SA 4.0 license is satisfying from our point of view.

Kind regards,

ELRC Helpdesk
- by The ELRC Helpdesk Team
- at 7 years, 5 months ago
Dear Normundus,

Let me briefly discuss copyright before I move to personal data: I assume you have taken good care of IPR clearance and you are sure that you have all the necessary rights and permissions (regarding primary data) to distribute the corpus in the intended manner (i.e. under the CC BY-NC-SA 4.0 license and under bespoke commercial licenses)?

Now, regarding personal data: the concept of personal data is indeed extremely broad; if 'random people' are mentioned in the corpus, then it contains personal data.

The concept of processing is just as broad and what your team is doing also qualifies as 'processing'. See art. 4 of the GDPR for the definitions of both terms.

From the legislator's perspective, the solution to your problems should be anonymisation of the corpus (at least before it's distributed), but you mention this is not an option, which we understand.

In some circumstances, the GDPR does allow you to process unanonymised personal data for research purposes without consent of the data subjects (cf. art. 89), providing that you apply 'appropriate safeguards for the rights and freedoms of data subjects'. Nothing in the GDPR (unlike in the Copyright Directive) says that 'research' should be 'non-commercial'; nevertheless, it has to be 'research'. Since the GDPR is fairly new, this is still largely a grey area -- the Latvian legislator might have provided you with some guidelines regarding the processing of personal data for research purposes in the Latvian law accompanying the GDPR (if such law exists). It is not unlikely that a number of uses that you envisage for research purposes may be covered by this exception.

Furthermore, please be informed that lawfulness (the art. 6 you mention) is not the only obligation you have to comply with under the GDPR; remember that there are also obligations of transparency, purpose and storage limitation, security etc.

Even if we put all this aside, as far as transferring data to a US-based company is concerned, the situation is complicated (since the US is not approved by the Commission as a country that 'ensures an adequate level of protection' -- see art. 45 of the GDPR). As ELRC Helpdesk, we are not really in a position to advise you on distributing your data to the United States for commercial use, nor to contradict what your in-house lawyers told you. It could be much simpler if you considered an appropriate anonymisation technique (e.g. replacing named entities with other words belonging to the same category in a randomised manner).

However, if you want to distribute the corpus via ELRC-Share, the CC BY-NC-SA 4.0 license is satisfying from our point of view.

Kind regards,

ELRC Helpdesk
- by The ELRC Helpdesk Team
- at 7 years, 5 months ago

Helpdesk for Language Resources

Is GDPR relevant to newswire text corpora?

Discussion