You are here: HomeScience OverviewInformation Technology (IT) › Big Data in Genomics: Challenges and Solutions

Big Data in Genomics: Challenges and Solutions

Dec. 12, 2012
Fig. 1: Big Data in Genomics. Schematic representation of a pipeline from data generated using NGS to data “translation” for clinicians and researchers. Data is generated by Next-Generation DNA Sequencers (omics data such as genomes, transcriptomes, exosomes, epigenomes, and other types of similar information), transferred to the “cloud” or internal servers, analyzed and visualized using different solutions that are available in the market. Finally, the data is translated as a short report to clinicians and researchers after a deep analysis for biomarkers and drug targets associated to specific disease phenotypes. Genome variants are also identified when comparing different samples, generating high-quality interpretation based on our current knowledge. This type of pipeline will ease the implementation and application of different types of omics data for the clinics and also for research purposes. Between data transfer, storage and visualization, patient data needs to be secured by encryption of the information. Some solutions for both medical and scientific data security have been developed recently, but since this is a new area of study in biomedical informatics and big challenges lie ahead. Abbreviations: COMP, Computer; E.G.s, Examples; IGV, Integrated Genome Viewer; NGS, Next Generation Sequencing. Image Design: Eduardo Braga Ferreira Junior
Fig. 1: Big Data in Genomics. Schematic representation of a pipeline from data generated using NGS ... more
Fig. 1: Big Data in Genomics. Schematic representation of a pipeline from data generated using NGS ... © rangizzz - Table 1: Examples of companies and Institutions that provide solutions to generate, store, analyze, ... 

Pipelines to deal with increasing amounts of genomics data will be needed to store, transfer, analyze, visualize, and generate "short" reports to researchers and clinicians (for more information see figure 1). In fact, an entirely new genomics industry could be made possible by cloud computing, which will transform medicine and life sciences. Cloud computing opens a new world of possibilities for the genomics industry to transform the way we approach research and medicine. However, one of the downsides of cloud computing is keeping the data private.

The Coming Age of Data-driven Science and Medicine
The understanding of how the underlying systems in living organisms operate will require the integration of many layers of biological information that high-throughput technologies are generating. The complexity of the data generated in scientific projects will only increase as we continue to isolate and sequence individual cells and organisms while lowering the costs to generate and analyze this data, such that hundreds of millions of samples can be profiled. Sequencing DNA, RNA, the epigenome and other omics from numerous cells in different individuals will take us to the exabyte (1018 bytes) data scale in the next 5 years or so [7]. Integrating all this data will demand high-performance computational environments like those at big genome centers [7]. The integration between hardware and software infrastructures tailored to deal with big data in life sciences will become more common in the years to come.

Importantly, data-driven medicine will enable the discovery of new treatment options based on multi-model molecular measurements on patients and learning from the trends in differential diagnosis, prognosis and prescription side-effects in clinical databases [8]. The combination of omics data with clinical information from patients will enable new scientific knowledge that could be applied in the clinics to help in patient care [8]. In addition, medical informatics, represented by EMRs from patients and personalized therapies will enable the application of targeted treatments for specific diseases. Thus, it is tempting to imagine how both scientific inquiry and patient care would be performed differently when dealing with Big Data repositories if large amounts of genomic and clinical data are collected and shared by health care professionals (fig.


Challenges and Solutions
These revolutionary changes in Big Data generation and acquisition create profound challenges for storage, transfer and security of information. Indeed, it may now be less expensive to generate the data than it is to store it. One example of this issue is the National Center for Biotechnology Information (NCBI). The NCBI has been leading Big Data efforts in biomedical science since 1988, but neither the NCBI nor anyone in the private sector has a comprehensive, inexpensive, and secure solution to the problem of data storage (even though companies with different solutions are starting to appear as shown in table 1). These capabilities are beyond the reach of small laboratories or institutions, posing several challenges for the future of biomedical research.

Another challenge is to transfer data from one location to another; it is mainly done by shipping external hard disks through the mail. An interesting solution for data transfer is the use of Biotorrents, which will allow open access sharing of scientific data and uses a peer-to-peer file sharing technology [9]. Torrents were primarily designed to facilitate distribution of large amounts of data in the internet and this solution could be applied to biomedicine [9].

Security and privacy of data from individuals is also a concern. Possible solutions to this issue include the use of better security systems with advanced encryption algorithms, like the ones used by banks in the financial sector to secure their clients' privacy [10]. In addition, a new generation of consent forms that specifically allow study participants or patients to openly share the data generated on them with researchers will be needed [10]. The use of "in house" hardware solutions instead of cloud computing could also ease the implementation of big data with more information protection. One example is the system that Knome is implementing named knoSYS100 (table 1). These are just some of the solutions that could be applied to overcome the challenges to deal with big data privacy, but others will emerge in the near future.

Success in biomedical research dealing with the increasing amounts of omics data combined with clinical information will depend on our ability to interpret high scale data sets that are generated by emerging technologies. Private companies such as Microsoft, Oracle, Amazon, Google, Facebook and Twitter are masters in dealing with petabyte scale data sets. Science and Medicine will need to implement the same type of scalable structure to deal with volumes of data generated by omics technologies. The life sciences will need to adapt to the advances in informatics to successfully address the Big Data problems that will be faced in the next decade.

Related Articles :

Keywords : 23andme Big Data Bioinformatics Biomarker Cancer Cloud Computing DataGenno Diagnostics Encode Epigenomics Fabricio F. Costa Genomic Enterprise genomics Illumina Life Science Nano Nanotechnology NCBI Next Generation Sequencing NGS Oncology Personal Genomics Screening Science

Email request

Children’s Memorial Research Center
2300 Childrens Plaza
Chicago, IL 60614


RSS Newsletter