June 13, 2014, the journal Science published an article titled "Big Biological Impacts from the big Data" written by the Science Promotion (AAAS) science and Technology publishing consultant, Mike May. In view of the large data as a current hotspot concept, this paper compiles the text. This paper first combs the three-level meaning of large data, and then analyzes and interprets these three meanings. With more and more genomic data, many organizations are aware of the prospect of using large data.
This article lists the methods or tools that some organizations have developed or are developing to analyze large data. Biodatomics, for example, has developed Biodt software that is 100 times times faster than traditional software analysis; The computing system developed by the Acd/labs Company in Toronto, Canada, can integrate various data formats when processing large data; California State IBM A text-mining tool developed by the Almaden Research Center; Thomson Reuters numedii a drug reuse based on large data. Large data in addition to the above three meanings, this article also mentions that large data should also contain "complexity" and enumerates the refs analysis platform developed by GNS Healthcare, Massachusetts, based on the complexity of data. Ultimately, this article argues that all efforts to develop large data should fall in the direction that big data can contribute to future biological and medical development.
large data and life sciences
Large data is one of the hottest concepts at the moment, and it is also an easily misinterpreted concept. As the name suggests, big data means a lot of data, but that's just a literal meaning. Broadly, large data includes three layers of meaning (3V): Large amounts of data (volume), Fast processing of data (velocity of 處理 the "data"), and variability of the source (variability's). This is an important feature of the information that relies on large data tools for analysis.
Crandall, director of the Institute of Computational Biology at George Washington University in the United States, says that while biologists are spending a lot of energy collecting data, in fact, the bottleneck in biology now is big data. For example, in August 2002, a complete genome sequencing exercise for the first person focused on the experts from 20 research institutes, using the infrastructure that they had configured to experience 13 years, and 3 billion dollars for 3 billion nucleotide sequences. Currently, sequencing for a person requires only 1000 dollars, producing more than 320 genomes a week. As researchers continue to develop methods to deal with the volume, speed, and variability of large data, researchers have begun to develop new methods of analyzing information.
The data sources and forms of life sciences include genetic sequencing, molecular pathways, and different populations. If researchers can solve this problem, the data will turn into potential wealth, the problem is how to deal with these complex information. Now, the field is looking for tools and technologies that can analyze large data and translate it into better understanding of basic life science mechanisms and the application of analytical results to population health.
(1) "Quantity" continued to increase
Decades ago, pharmaceutical companies began to store data. Keith Crandall, associate director of the Merck Research Laboratory in Boston, USA, says Merck has been in the process of organizing clinical trials involving tens of thousands of patients for years and has the ability to identify the information needed from millions of of patients ' records. At present, the company has a new generation of sequencing technology, each sample can produce megabytes of data. Even large pharmaceutical companies need help in the face of such a large order of magnitude of data. For example, Bryn Roberts, from Swiss Roche, says Roche's research data for a century is more than twice times more than the data produced in a single large-scale test of hundreds of cancer cells in 2011-2012 years. The research team led by Roberts expects to be able to tap into more valuable information from these stored data. As a result, the team collaborated with Pointcross, a California State company, to build a platform that would be flexible enough to look for Roche's 25-related data. These data, including those of thousands of complexes, will use the knowledge acquired today to tap into new drugs.
To deal with a large amount of data, a biological researcher does not need a specialized device like a company to process the resulting data. For example, Life Technologies Corporation (currently part of Thermo Fisher Scientific) Ion personalized Operation Genome sequencing instrument (Ion Personal, Genome Machine). This new device can be sequenced up to 2 gigabases within 8 hours. Thus it can be operated in the laboratory of the researcher. Life Technologies Company also has a larger instrument, within 4 hours sequencing can be up to ten gigabases.
However, the new generation sequencing provides both benefits and problems for life science researchers in the academic and industrial fields. As Crandall complains, they are not effective at studying so many genomes unless the computer system is developed to meet the need to analyze large amounts of data. Based on this status, the team led the Boston University medical assistant Professor W. Evan Johnson collaborated to develop analysis of the data produced by the new generation of sequencing (next generation sequencing,ngs) platform, which translates the gigabases information of the DNA into gigabytes of the computer. The software compares DNA samples with reference genomes to identify pathogens. Crandall says that each sample stores 20,000 megabytes of data, and there are thousands of such samples, so that each sample analysis produces quite a bit of data.
In fact, such a large number of data is actually very useful for health care, because researchers must design their experiments to fully consider the diversity of people. Chas Bountra, a professor of translational Medicine from Cambridge University, said the conclusions obtained from 500,000 people were more convincing than those obtained from 10.
There are also researchers who expect to see more and more effects of genomic data on health care. For example, genetic information can reveal biomarkers, or indicators of certain diseases (some molecules only appear in certain types of cancer). Genomics provides a powerful basis for people to understand disease, says Dr. Gil McVean, professor of genomics at the Human Genetics Center of the Wellcome Trust Pew of the University of Oxford, the United Kingdom. Genomics can identify biomarkers associated with a particular type of disease and target treatment based on this marker. For example, because a molecule is driving some kind of cancer, it can target this molecule to treat cancer. To apply this idea, McVean's team of 3 3 million dollars donated by Li Ka-shing to the University of Cambridge is creating the Li Ka Shing Health Information and Discovery Center (Li Ka Shing Pew for Tiyatien and Discovery). The centre will set up a large data research institution. McVean concludes that the center will combine analytical data processes with genomic research so they can overcome some of the challenges of collecting large data and analyzing large data.
High-speed
of
(2) analysis
The second V, Velocity, means processing data and analyzing data at Gaoyao speed. The researchers need to work at a high speed to analyze large amounts of increased data.
In the past, there were bottlenecks in analyzing gene-related data. Alan Taffel, the Biodatomics director of Maryland, believes that the traditional analysis platform actually constrains the researchers ' output (capacity) because they are difficult to use and rely on bioinformatics personnel, so the work is inefficient. It often takes days or weeks to analyze a large DNA.
In view of this, Biodatomics has developed Biodt software, which provides more than 400 tools for analyzing genomic data. Integrating these tools into a software package makes it easy for researchers to use and apply to any desktop computer, and the software can also be stored through the cloud. The software is more than 100 times times faster than traditional systems for processing information, and it takes a day or a week before, and now takes only a few minutes or a few hours.
Some experts think it is necessary to sequencing new tools. Jaroslaw Zola, an associate professor of electronic computing engineering at Rutgers University in New Jersey, says that the new generation of sequencing technology requires novel computing strategies to handle data from a variety of sources, depending on how data is stored, how data is converted, and how data is analyzed. This means that biological researchers need to learn to use cutting-edge computer technology. However, Zola that the information technology staff should be pressured into developing methods that could be easily mastered by field experts, hiding the complexities of algorithms, software, and hardware architectures, while ensuring efficiency. Currently, Zola's team is working on this and developing new algorithms.
(3) Variability
First, biological laboratories often have a variety of devices that produce data that exists in some form of documentation. Therefore, the computing system developed by Acd/labs in Toronto, Canada, can integrate various data formats when dealing with large data. Acd/labs's Global strategy director (Director of World Strategy) says the system can support more than 150 document formats produced by various devices, which facilitates the pooling of multiple data into the same environment, such as the Spectrus database in its development. The database can be accessed through a client or Web page.
Large biological data also reflect new variability. For example, Definiens, a German researcher, analyses the tissue phenotype (tissue phenomics), a tissue or organ sample structure-related information, including cell size, shape, absorbed dye, and cell-associated substances. These data can be used in a number of studies, such as tracking changes in the characteristics of cells during development, determining the effects of environmental factors on the organism, or measuring the effects of drugs on the cells of certain organs/tissues.
Structured data, such as a data table, does not reveal all information, such as the drug treatment process or biological process. In fact, living organisms exist in an unstructured form, and there are thousands of ways to describe biological processes. Merck's Johnson thinks it's a bit like a journal text document, and it's hard to dig data out of the literature.
Analysts and researchers Ying Chen, an analyst and researcher at IBM's Almaden Research Center in California State (IBM's Almaden), have been working on text mining tools for several years, and are currently using the "accelerated Drug discovery Solution "(Accelerated drug
Discovery solution). The platform is a collection of patents, scientific literature, basic chemistry and biological knowledge (such as mechanisms of interaction between chemical substances and molecules), with 1 more than 6 million compounds in the structure, almost 7 000 diseases. Using this system, researchers can look for compounds that might be useful for treating a disease.
Other companies are working to tap existing resources to discover the biological mechanisms of disease, based on which to study ways to treat diseases. NUMEDII, a Thomson Reuters company based in Silicon Valley, is dedicated to finding new uses of existing drugs, also known as drug reuse (drug repurposing). NuMedii's chief scientist, Craig Webb, says the use of genomic databases, integrating knowledge sources and bioinformatics methods, quickly discovers new uses of drugs. The company then designed clinical trials based on safety in the original use of the drug, which is fast and inexpensive to develop drugs. Webb described a project of the company: Researchers collected gene expression data from more than 2 500 ovarian cancer samples, and combined several computer algorithms to predict whether existing drugs have the potential to treat ovarian cancer or treat some type of molecular subtype of ovarian cancer.
(4) Complexity
Stephen Cleaver, executive director of Novartis's Biomedical Research Institute (Novartis Institutes for Biomedical RESEARCH,NIBR), added complexity (complexity) on a three-v basis. He argues that the process is complicated by the fact that pharmaceutical researchers, through certain patients, go to certain groups of patients and then integrate the data. In the area of health care, the complexity of large data analysis is further increased because of the combination of various types of information, such as genomic data, protein group data, cell signaling, clinical research, and even environmental science research data.