Big Data in Genomics: Challenges and Solutions
- Fig. 1: Big Data in Genomics. Schematic representation of a pipeline from data generated using NGS to data “translation” for clinicians and researchers. Data is generated by Next-Generation DNA Sequencers (omics data such as genomes, transcriptomes, exosomes, epigenomes, and other types of similar information), transferred to the “cloud” or internal servers, analyzed and visualized using different solutions that are available in the market. Finally, the data is translated as a short report to clinicians and researchers after a deep analysis for biomarkers and drug targets associated to specific disease phenotypes. Genome variants are also identified when comparing different samples, generating high-quality interpretation based on our current knowledge. This type of pipeline will ease the implementation and application of different types of omics data for the clinics and also for research purposes. Between data transfer, storage and visualization, patient data needs to be secured by encryption of the information. Some solutions for both medical and scientific data security have been developed recently, but since this is a new area of study in biomedical informatics and big challenges lie ahead. Abbreviations: COMP, Computer; E.G.s, Examples; IGV, Integrated Genome Viewer; NGS, Next Generation Sequencing. Image Design: Eduardo Braga Ferreira Junior
- © rangizzz - Fotolia.com
- Table 1: Examples of companies and Institutions that provide solutions to generate, store, analyze, and visualize omics and clinical data.
In the age of information-driven technologies, is Life Sciences prepared for a big data revolution?
Every era has its technological breakthroughs. The widespread use of computers and the internet in the beginning of the 21st century has impacted the way we approach and search for information . The emergence of social networks (examples include Facebook, Twitter, LinkedIn and others) and "cloud" solutions for data storage, with computer processor speed increasing at a fast pace has changed the way we generate information . Life Sciences have been highly affected by the generation of large data sets, specifically by overloads of omics information (genomes, transcriptomes, epigenomes and other omics data from cells, tissues and organisms). The use of DNA sequencing machines, which are smaller in size but capable of generating piles of data faster and at a lower cost, have changed science and medicine in ways never seen before . The current era is beginning to look like the era of "big data"; a term that refers to the explosion of available information, which is a byproduct of the digital revolution . However, with biomedical data accumulating in computers and servers around the world , concerns over privacy and security of patient data are emerging.
Next-Generation Sequencing (NGS) platforms that use semiconductors  or nanotechnology  have exponentially increased the rate of biological data generation in the last two years. While the first human genome was a $3 billion dollar project requiring over a decade to complete in 2002, we are now close to being able to sequence and analyze an entire genome in a few hours for less than a thousand dollars. The decrease in costs has enabled the generation of information at the petabyte (1012 bytes) scale. However, even though both computers and the internet have become faster, we have a lack of computational infrastructure that is needed to securely generate, maintain, transfer, and analyze large-scale information in life sciences and to integrate omics data with other data sets, such as clinical data from patients (mainly from Electronic Medical Records or EMRs). In this article, a short overview of the challenges faced by big data production, transfer and analysis will be given.
In addition, the changing landscape of privacy and personal information in the era of big data will be discussed.
How Did Big Data Become so Big?
Big Data has affected several unrelated sectors in society, including communications, media, medicine, and scientific research among others . In science, for example, in less than 10 years, the time and cost of sequencing genomes was reduced by a factor of 1 million. Today, personal genomes can be sequenced and mapped faster for a few thousand dollars. Personal genomics is a key enabler for predictive medicine, where a patient's genetic profile can be used to determine the most appropriate medical treatment. The Encode Project offers a nice perspective on Big Data generation and analysis with the participation of different research groups by providing an elaborated framework for personal genomics . Projects such as Encode have produced piles of data, illustrating how Big Data is becoming integral for scientific research . Indeed, science today is increasingly "social", especially in fields such as genomics in which huge amounts of data are generated. Encode is a good training tool for researchers in big scientific enterprises that will become increasingly common. In these types of projects, tons of data are generated, stored, transferred and analyzed (see also figure 1 for a complete overview).
With the increased need to store data and information generated by big projects, computational solutions such as cloud-based computing have emerged. Cloud computing is the only storage model that can provide the elastic scale needed for DNA sequencing, whose rate of technology advancement could now exceed Moore's Law. In fact, cloud solutions from different companies have been used, but several challenges are posed by it, particularly related to the security and privacy of personal medical and scientific data, (fig. 1, table 1). Perhaps the greatest advantage could be the ability to offer a broad platform for development of new analysis and visualization tools as well as a software service to use these tools on shared data sets in a secure and collaborative workspace . In fact, some companies already offer such solutions (table 1). There is also an opportunity for a version of an App or Google Play Store, specifically for genomics tools, from which hundreds of specialty applications could be developed . Companies such as Illumina and 23andme already offer an open platform for developers and more companies will implement APIs (Application Programming Interfaces) in their services. However, solutions to overcome data privacy issues will be crucial.
Pipelines to deal with increasing amounts of genomics data will be needed to store, transfer, analyze, visualize, and generate "short" reports to researchers and clinicians (for more information see figure 1). In fact, an entirely new genomics industry could be made possible by cloud computing, which will transform medicine and life sciences. Cloud computing opens a new world of possibilities for the genomics industry to transform the way we approach research and medicine. However, one of the downsides of cloud computing is keeping the data private.
The Coming Age of Data-driven Science and Medicine
The understanding of how the underlying systems in living organisms operate will require the integration of many layers of biological information that high-throughput technologies are generating. The complexity of the data generated in scientific projects will only increase as we continue to isolate and sequence individual cells and organisms while lowering the costs to generate and analyze this data, such that hundreds of millions of samples can be profiled. Sequencing DNA, RNA, the epigenome and other omics from numerous cells in different individuals will take us to the exabyte (1018 bytes) data scale in the next 5 years or so . Integrating all this data will demand high-performance computational environments like those at big genome centers . The integration between hardware and software infrastructures tailored to deal with big data in life sciences will become more common in the years to come.
Importantly, data-driven medicine will enable the discovery of new treatment options based on multi-model molecular measurements on patients and learning from the trends in differential diagnosis, prognosis and prescription side-effects in clinical databases . The combination of omics data with clinical information from patients will enable new scientific knowledge that could be applied in the clinics to help in patient care . In addition, medical informatics, represented by EMRs from patients and personalized therapies will enable the application of targeted treatments for specific diseases. Thus, it is tempting to imagine how both scientific inquiry and patient care would be performed differently when dealing with Big Data repositories if large amounts of genomic and clinical data are collected and shared by health care professionals (fig. 1).
Challenges and Solutions
These revolutionary changes in Big Data generation and acquisition create profound challenges for storage, transfer and security of information. Indeed, it may now be less expensive to generate the data than it is to store it. One example of this issue is the National Center for Biotechnology Information (NCBI). The NCBI has been leading Big Data efforts in biomedical science since 1988, but neither the NCBI nor anyone in the private sector has a comprehensive, inexpensive, and secure solution to the problem of data storage (even though companies with different solutions are starting to appear as shown in table 1). These capabilities are beyond the reach of small laboratories or institutions, posing several challenges for the future of biomedical research.
Another challenge is to transfer data from one location to another; it is mainly done by shipping external hard disks through the mail. An interesting solution for data transfer is the use of Biotorrents, which will allow open access sharing of scientific data and uses a peer-to-peer file sharing technology . Torrents were primarily designed to facilitate distribution of large amounts of data in the internet and this solution could be applied to biomedicine .
Security and privacy of data from individuals is also a concern. Possible solutions to this issue include the use of better security systems with advanced encryption algorithms, like the ones used by banks in the financial sector to secure their clients' privacy . In addition, a new generation of consent forms that specifically allow study participants or patients to openly share the data generated on them with researchers will be needed . The use of "in house" hardware solutions instead of cloud computing could also ease the implementation of big data with more information protection. One example is the system that Knome is implementing named knoSYS100 (table 1). These are just some of the solutions that could be applied to overcome the challenges to deal with big data privacy, but others will emerge in the near future.
Success in biomedical research dealing with the increasing amounts of omics data combined with clinical information will depend on our ability to interpret high scale data sets that are generated by emerging technologies. Private companies such as Microsoft, Oracle, Amazon, Google, Facebook and Twitter are masters in dealing with petabyte scale data sets. Science and Medicine will need to implement the same type of scalable structure to deal with volumes of data generated by omics technologies. The life sciences will need to adapt to the advances in informatics to successfully address the Big Data problems that will be faced in the next decade.
I would like to thank both Kelly Arndt and Steve Iannaccone for their thoughtful inputs and for critically reading this article.
 Costa F.F. : Drug Discovery Today. 2012. In press
 Rothberg J. M. et al.: Nature 475(7356): 348-352 (2011)
 Clarke J. et al.: Nat Nanotechnol. (4): 265-70 (2009)
 Birney E.: Nature. 489(7414): 49-51 (2012)
 Schadt E.E. et al.: Nature Reviews Genetics. (9): 647-657 (2010)
 Shah N.H. and Tenenbaum J.D.: J Am Med Inform Assoc. 19(e1): e2-e4 (2012)
 Langille M.G. and Eisen J.A.: PLoS One. 5(4): e10071 (2010)
 Schadt E.E.: Mol Syst Biol. 8: 612 (2012)