PAPAYA – PlAtform for PrivAcy preserving data Analytics is a research project financed by the European Union under the Horizon 2020 programme, aiming at the research and the development of new Data Mining and Data Analysis techniques which result both privacy-preserving for the final users who share their data and cost-effective for the companies who need to perform analytical queries on them.
For this purpose, a versatile, user-friendly platform will be developed, one which provides the users with a clear display and an integrated management of the shared data in different specific contexts and for different purposes.
The project focuses on two scenarios that are transforming at ever-increasing speed, thanks to new technologies, and in which, typically, both the respect for the user’s privacy and the relevance of the information acquired via data analysis are crucial: healthcare analytics and web analytics.
The project involves, along with MediaClinics Italia, high-profile european research centers (Eurecom, Karlstad University) and poles of technological excellence (IBM Israel, Orange, Atos).
The valuable insights that can be inferred from data analytics generated and collected from a variety of devices and applications are transforming businesses and are therefore one of the key motivations for organisations to adopt such technologies. Nevertheless, the data being analysed and processed are highly sensitive and put the privacy of the users who chose to share them, knowingly or unknowingly, at risk.
Nowadays, the current European General Data Protection Regulation (GDPR) represents a major challenge for companies (especially small-medium enterprises) as they are required to follow a privacy-by-design approach into their systems and to adopt Privacy Enhancing Technologies that on the one hand, protect data to ensure their clients’ privacy and on the other allow their processing while keeping them meaningful, useful, and protected at the same time.
The PAPAYA project aims at addressing the privacy concerns when data analytics tasks are performed by untrusted third-party data processors. Since these tasks may be performed obliviously on protected data (i.e. encrypted data), the PAPAYA will design and develop dedicated privacy preserving data analytics modules that will enable data owners to extract valuable information from this protected data, while being cost-effective and accurate.
The solutions developed under the PAPAYA project are designed from the analysis of four different real restricted settings:
This scenario defines a setting where the data owner applies data analytics primitives to her sensitive data. Because of the computational burden of these operations, the data analytics tasks are offloaded to a third-party data processor, such as a cloud server.
This case addresses the scenario where multiple data owners collaboratively process a large dataset containing data from all the different data owners and derive some relevant information such as a global machine learning model. In this setting, each owner’s data privacy is protected from the others. Only the outcome of the privacy preserving analytics is known by all the data owners. This will help small entities to augment their dataset and hence have more accurate information while remaining compliant with the GDPR.
In this scenario, the data comes from a single source that protects it from being read by a third party. However, the data owner allows a third party to perform analytical tasks over its encrypted data, provided that the third party will only learn the analytics result. In this scenario, the data source can be a single user, or a privacy preserving aggregator of data coming from different users. The latter case addresses Article 89 of the GDPR 2018 that imposes that processing encrypted data to statistical ends should be done in a privacy preserving way whenever possible. In this scenario, a special care will be given to the leakage generated by the analytical process. For example, multiple queries should not allow the querier to recover sensitive information from the database.
In this setting, the data to be analysed comes from different sources and is queried by a third party. In this setting, neither the server nor the querier sees the collected data in clear, but only in an encrypted version, thus achieving end-to-end privacy.
On the basis of the analysis of these scenarios, specific, adequate to different contexts software modules and cryptography primitives are implemented and integrated in a all-in-one platform at the use of the citizens.
MediaClinics intends to exploit the results of the project applying them in the sector of digital health: the innovative Data Mining and Data Analysis techniques developed will allow more precise and deeper epidemiological researches, and may result of crucial usefulness for the development of machine learning techniques to be used as a diagnostic support tool in medical centers.
PAPAYA will also make available to specific users’ classes (e.g. IT developers) the underlying cryptographic primitives used by the system, to allow the design and development of additional new technologies to safeguard the right to privacy of the citizen.