Big data in health

Sous titre

Technical, human and ethical challenges

In the field of health, big data denotes to all sociodemographic and health data, available from different sources which collect them for various reasons. Use of these data presents numerous advantages: identification of disease risk factors, support for diagnosis, for the choice of treatments and monitoring treatment efficacy, pharmacovigilance, epidemiology, etc. However, it still gives rise to numerous technical and humans challenges and as many ethical issues.


  • What is it used for? Optimizing prevention and treatment of disease Predicting epidemics Pharmacovigilance
  • Data storage and processing: Technical and human challenges
  • Ethical issues to be resolved
  • Reading time

    15-20 min

  • Difficulty

    2 sur 5

Report drawn up in collaboration with Rodolphe Thiebaut, Director of the Translational Medicine Statistics Team (Inria/Inserm Unit 1219), lecturer at Institut de santé publique d’épidémiologie et de développement (ISPED - Public Health Epidemiology and Development Institute, Bordeaux), Director of the Clinical and Epidemiological Research Support Unit at CHU de Bordeaux, and scientist at the Vaccine Research Institute (Créteil).

Loupe comprendre Understanding the importance of big data in health

In health, like in many other domains, technological progress has dramatically increased the quantity of data collected every moment. Hence, while it took ten years to obtain the first human genome sequence, in 2003, the same result can now be achieved in less than one day. This acceleration in technology has led to an exponential increase in the volume of available data. This is a godsend for medical research where big data is an almost inexhaustible source of new knowledge, crucial to innovation and medical progress!

Picto Base de données % Vast sources and types of data

France has approximately 260 public databases in the field of health, and the Epidémiologie‐France website lists up to 500 medical and economic databases, cohorts, registries and ongoing studies.

Medical-administrative databases

These databases offer highly exhaustive and objective data on a large population scale, with few people lost to follow-up. These are major advantages compared to information which can be collected during short- or medium-term studies, conducted in specific or limited populations, often based on declarations by participants.

The most extensive medical-administrative database is the SNIIRAM (French national inter-schema information system on health insurance). This database contains all reimbursements issued by the French national health insurance scheme for each contributor, throughout their lives (laboratory tests, medications, ambulances, appointments with the dates and names of the health care professionals consulted, disease codes in certain cases, etc.). This system allows these reliable data to be monitored in the long term.

Numerous other medical-administrative databases exist, such as the ATIH (technical hospitalization data agency) database, and pension fund databases (including CNAV). There are also a number of databases managed by research centers, namely the CépiDc database (Inserm), which lists the medical causes of death in France since 1968.

Who has access to SNIIRAM data?

This database is currently accessible to the health agencies and not-for-profit public research organizations. In 2013, approximately fifty scientists searched this database regularly, with more than 17,000 queries, i.e., 30% more than the previous year.
A decree issued by the French Ministry of Health which prohibits for-profit organizations (insurance companies, pharmaceutical companies, etc.) from accessing this database was found to be illegal by the Council of State which has requested that it be annulled by the end of 2016. Consequently, all organizations wishing to conduct a general interest survey will soon be able to access these data, and requests for access are expected to dramatically increase in the next few years.



A cohort is a group of people sharing a number of characteristics, followed up by scientists over a varying period of time so as to identify the onset of health events (disease or physical dysfunction) and the associated risk or protective factors.

Research organizations set up large cohorts, including up to several tens of thousands of participants, followed up for several years. This is the case for the Constances, I-Share, MAVIE and NutiNet-Santé cohorts, for example, created in partnership with Inserm. The Constances cohort, currently being created, will ultimately include 200,000 adults aged 18 to 69 years, attending Social Security health clinics. The I-Share cohort will include 30,000 university students, followed up for 10 years. The MAVIE observational study monitors everyday accidents among over 25,000 Internet volunteers. NutiNet-Santé collects a multitude of data on lifestyle, health and dietary habits from 500,000 French people.

All data collected enable epidemiological studies and surveillance to be performed, with a potentially major impact in terms of public health.

Clinical studies

Public laboratories also conduct numerous clinical research studies, including specific patient populations, with analysis of their risk profiles and health statuses. However, the number of data collected from a given patient is constantly increasing, with hundreds of data items collected from a given individual, as opposed to only ten or so a few years ago.

Dozens of clinical, biological, imaging and genetic parameters are routinely collected in oncology. This is also the case for vaccine development. Hence, in the context of the DALIA clinical trial conducted by the Vaccine Research Institute, which aims to evaluate a therapeutic vaccine for HIV, the immune cells of all patients were counted thanks to recognition of surface markers, and their function was tested. The protocol generated approximately 800 measurements per patient and per visit, not including the study on the genetic expression of numerous markers (47,000 sensors/patient/visit) and high-throughput sequencing of the virus itself.


Vue d'écran et vérification de la collecte et de la cohérence des données prélevées sur les patients © Inserm/Delapierre, Patrick
Screen capture and checking of patient data collection and consistency © Inserm/P. Delapierre

Connected health devices

Connected health devices also generate a vast number of data able to be transferred and shared: devices which count the number of steps, heart rate, blood glucose levels, blood pressure, etc. These data are usually stored and managed by web giants or GAFAM: Google, Apple, Facebook, Amazon and Microsoft.

Pictogramme microscope Challenges facing research

Major technical challenges

The enormous volumes of data now available have given rise to technical challenges in terms of storage and processing capacity. Increasingly complex statistical and computer algorithms and programs are proving necessary.

Research organizations are all equipped with storage servers and supercomputers. These are shared platforms in the majority of cases, given their cost. This is the case for the Mésocentre de calcul intensif aquitain (MCIA, Bordeaux), for instance, shared by Université de Bordeaux, CNRS, Inra, Inria and Inserm in the region. Another example is Platine, a European immunomonitoring platform in Lyon, managed by several biotech firms, together with the Centre Léon Bérard cancer center and Inserm. This aims to assist the therapeutic decision-making process for physicians in oncology and infectious diseases, by analyzing patients' initial immunological status.

Another problem is the somewhat fragmented nature of big data. The data collected are increasingly diverse, due to:

  • their nature (genomic, physiological, biological, clinical, social, etc.),
  • their format (text, numerical values, signals, 2D and 3D images, genomic sequences, etc.),
  • their dispersed distribution within several information systems (hospital groups, research laboratories, public databases, etc.).

In order to be processed and analyzed, this complex information needs to be acquired in a structured, coded manner before being entered in databases or data warehouses. A number of standards are being designed, such as I2b2 (i.e., Informatics for Integrating Biology and the Bedside), developed in Boston and now used at university hospitals in Rennes and Bordeaux, and at Hôpital Européen Georges Pompidou (Paris). This system was, for instance, used to identify and quantify the increased risk of myocardial infarction in patients on Avandia, and contributed the withdrawal of this medicinal product from the market.

Thanks to these standards, hospitals and health centers are better equipped to compile all collected data (pharmacy, laboratory test, imaging, genomic, medical and economic and clinical data, etc.) in biomedical warehouses, able to be searched by scientists via web interfaces. Numerous research teams are also working on integrated platforms, to match databases and aggregate their data with cohort data. Hence, the Hygie project, conducted by the Institut de recherche et de documentation en économie de la santé (Health Economics Documentation and Research Institute), is working on matching the SNIIRAM and SNGC (national pension insurance occupational management system) databases. The aim is to create an information system on daily social security benefits on a sample of 800,000 individuals, to be added to the CONSTANCES cohort files.

In practice

When a scientist wishes to start a study based on the use of big data, s/he starts by identifying useful databases and submits a request for specific access to the teams or organizations which hold these data. S/he will then need to work with numerous skilled personnel to conduct meta-analyses bringing together all of these data. As regards the DALIA trial, for example, the analysis of the results necessitated the contribution of approximately fifty individuals from various disciplines: clinicians, immunologists, biologists, virologists, laboratory technicians, clinical research assistants, database administrators, biostatisticians, or bioinformaticians.


Picto ordinateur portable Big data, how useful is it?

Companies, not-for-profit or for-profit research organizations, scientists, physicians, industrialists, etc. Big data is of interest to numerous stakeholders in the health domain as it enables considerable medical progress.

Optimizing prevention and treatment of disease

Multidimensional data collected long term from large populations allow risk factors to be identified for certain diseases, such as cancer, diabetes, asthma, and neurodegenerative diseases. These factors are then used to create preventive messages, and to set up programs intended for the populations at risk.

Big data, moreover, enables the development of diagnosis support systems and instruments allowing tailored treatments. These systems are based on processing vast volumes of individual clinical data. From this perspective, the IBM Watson supercomputer is, for instance, able to analyze the genomic sequencing result for cancer patients, compare the data obtained with those already available, and thus propose a tailored therapeutic strategy, within a few minutes. Without this instrument, this analytical process takes several weeks. The clinics and hospitals concerned are entering into partnership with IBM which holds this supercomputer and provides the results.

Big data can also be used to verify treatment efficacy. For example, in the vaccine domain, clinicians now measure hundreds of parameters during clinical trials: cell counts, cell function, expression of the genes concerned, etc., whereas a few years previously, they were limited to the concentration of the antibody concerned. Ultimately, these changes, the big data generated, and the ability to analyze them, could make it possible to determine whether immunization has been successful after only an hour, using a microdroplet of blood.

Predicting epidemics

Having access to vast data on the health status of individuals in a given region makes it possible to identify the increased incidence of diseases or harmful behaviors, and to alert the health authorities.

Hence, the HealthMap site aims to predict the onset of epidemics, using data originating from numerous sources. Developed by epidemiologists and American computer scientists in 2006, this site operates by collecting reports issued by health departments and public organizations, official reports, and web data, etc. This information is constantly updated so as to identify health threats and to alert populations. We should also mention the GLEAM simulator, destined to predict the spread of a specific epidemic, by processing air transport data.

Since 1984, in France, the Sentinel network has been monitoring several infectious diseases and has been issuing alerts on epidemics, thanks to the contribution of 1,300 primary care practitioners and a hundred or so pediatricians throughout the country. At least once a week, these practitioners report the number of cases observed for seven transmissible diseases (acute diarrhea, Lyme disease, mumps, influenza syndromes, male urethritis, chickenpox and shingles), together with suicidal acts. These data are transferred, via a secure network, to Institut Pierre Louis d’Épidémiologie et de Santé Publique France, in collaboration with Institut de Veille Sanitaire (InVS).

Improving pharmacovigilance

The analysis of data obtained from cohorts or medical and economic databases in the long term thus enables numerous phenomena to be monitored, and, in particular enables comparison between treatments and the onset of health events. This practice makes it possible to identify serious adverse events and issue alerts for certain risks. In 2013, the SNIIRAM database made it possible to study the risk of stroke and myocardial infarction among women using third-generation birth control pills.


Georgios Gropetis, responsable du centre de calcul de l'UMRS 707 © Inserm/P. Latron
Georgios Gropetis, Head of the Calculation Center at UMR-S 707 (Sentinel network) © Inserm/P. Latron

Prévention Between data protection and advances in research: the ethical challenges of big data

During clinical trials, consent is required before any health data are collected. Likewise, any scientists or clinicians using health care data must notify the patient concerned and submit a declaration to the CNIL (French Data Protection Agency). However, other data are collected unbeknown to contributors, particularly from keyword searches or during data transfer from connected devices. This evidently raises ethical issues as to whether individuals wish to share these data with third parties or not, along with the protection of anonymity.

And numerous other questions are raised: should all data be stored? Should they be shared? Who should manage them, and what are the conditions for sharing? How can we ensure that Google, Apple, Facebook and Amazon do not take over some of these data? There challenges are serious: risk of disclosure of privacy and consequences on social life, loss of confidence in the authorities and confidentiality of research, aggressive advertising, etc. These problems are regularly reviewed by ethics committees, including the National Consultative Ethics Committee in France.

The authorities have also examined this issue: the health system reform law, promulgated on January 26, 2016, provides for the extension of aggregate health data for research purposes, studies or evaluation of public interest, to all citizens, health care professionals or organizations (public or private) contributing to the function of the health care system and treatment. Extension of data is associated with several conditions:

  • the data must not enable the persons concerned to be identified (the law drastically restricts access to personal data that may be used to identify a person),
  • the research must not lead to the promotion of products intended for health care professionals or medical institutions, or give rise to exclusion of cover in insurance policies, or changes to contributions or insurance premiums.

To have access to these data, any research or study organizations wishing to conduct a public interest project must submit their project to the Institut national des données de santé (Institut on Health Data), made up of, but not limited to, State representatives, national health insurance scheme users, and public/private health data consumers and producers. The study protocol should then be validated by a scientific board, before the CNIL issues its decision on the aspects relating to the protection of privacy. However, in June 2016, the implementing decrees for this new organizational system had not yet been published.


Inserm and the National Health Data System

The health system reform law of January 2016 provides for the creation of the Système national des données de santé (SNDS - National System for Health Data). This system will notably consist of:

  • national health insurance scheme data (SNIIRAM),
  • hospital data (PMSI database)
  • causes of death (CépiDc-Inserm).

Governance of this system is expected to include data producers, including Inserm. In more practical terms, Inserm is expected to act as the data extraction and provision operator for processing taking place for research purposes.


In April 2016, Marisol Touraine, Minister for Social Affairs and Health, launched an online national consultation on big data in the field of health. The objective was to allow all French people to give their opinion on the desired objectives for patients, health care professionals, industries, insurers and the authorities, as well as the acceptable conditions for the use of health data. The conclusions are expected by the end of 2016. Scientists are calling for fairly broad extension of these data, and simplified access. Their aim is to accelerate research via adapted technical platforms, allowing high levels of security (only collect data of potential interest to the research subject, isolate identifying data, encode certain information, and limit access and copying of information, etc.).