In February 1977, Fredrick Sanger and his colleagues published the complete genome sequence of the first organism, the 5,375 nucleotides of the phage phiX174. Since then, it has become clear that genome-wide research will be tedious as scientists detect more complex species. Fortunately, the development of genomics soon has a solution. Just 4 months later, a new small company in Cupertino, Calif., began selling Apple II to electronics enthusiasts. Scientists have also quickly discovered that this relatively cost-effective new computing system is ideal for storing and analyzing genetic data.
Nowadays, molecular biology is inseparable from the help of computers. When highly automated sequencing instruments generate millions of of megabytes of new data every day, researchers are still able to routinely search for huge online databases to find new links between genes. In fact, the new scientific discipline of "bioinformatics" has sprung up to classify and study the growing new information on biology.
Many research institutes have established specialized computing centers to handle too much data. However, recently, bioinformatics experts began to borrow another strategy from the computer industry to avoid more spending, which is cloud computing (or distributed computing). Cloud-based systems are different from localized storage and analysis data, and they are programmed to allocate more intensive work to hundreds of remote servers. Early adopters of cloud computing genomics had to write their own software, but now computer experts and server companies have begun to design more user-friendly interfaces to further promote the technology.
Calculation without limit
The most obvious argument for cloud computing is the absolute amount of new sequencing data. "Our organization is not large enough to produce 1 million megabytes a day." "said Michael Schatz, assistant professor of quantitative biology at Cold Quangang Laboratories in New York. This is enough to fill a single desktop's entire hard drive in just days.
Globally, the DNA sequencing apparatus produces approximately 15 billion trillion (PB) bytes of data a year (which is still growing rapidly), and 1PB is 1000 TB, Schultz explains. To burn 15 billion megabytes of data to a large-capacity DVD, the carved discs stack up to 2.5 miles high, which is just raw data. The experimental data of microscopic pictures and other phenotypic information can even multiply the storage problem.
Fortunately, some companies have strong financial and computational experience and have been able to solve data problems of this size. For example, Google collects and processes tens of billions of megabytes of schedule information for users. "They are dealing with data in one day that exceeds the amount of data produced in the world over the past year." "Schatz said.
To meet this demand, Google is using cloud computing technology to assign jobs to hundreds of servers in the "cloud" around the world. By distributing computing systems such as Amazon's EC2 system, researchers can get a similarly inexpensive, convenient service that anyone can rent a similar large server "cloud".
However, before rushing to choose cloud computing, researchers should assess their needs and local resources. Some scientists do not need to share data with distant collaborators, they can use their own institutions of computing, services than remote cloud systems faster and cheaper. Schatz suggest that you follow the experience: "If you have more than hundreds of millions of trillion of data, and partners to share, then the cloud computing platform is the most appropriate." ”
Some research institutions do not have a dedicated computing center and therefore want to use cloud computing. "Traditionally, you're going to build a big data center and buy a lot of equipment." But it's not only expensive, but most of the time, the machines are idling. So the good thing about cloud computing is that you just pay the service charge when you use it, and the rest of the time you don't spend it. Richard Holland, chief business officer of Eagle Genomics, UK.
Another "cloud"
In addition to having access to a large number of remote servers, a typical service for cloud computing is the provision of basic software. Many cloud computing industries now rely on free, open-source tools, such as a wide range of Apache server software and Apache Hadoop plug-ins. The former is primarily responsible for the underlying communications between each server and network, while the latter is used to perform complex computational tasks and allocate effectively among thousands of servers.
Web companies initially developed this architecture to meet their own needs--hadoop to deal with all the world's Facebook photos and Yahoo! searches. In 2009, however, Schatz and his colleagues began using it in genomic data. Since then, Hadoop has become the first choice for bioinformatics in cloud computing. "In life sciences, it is a fact standard to analyze hundreds of millions of trillion or 1 billion trillion data at a time." "Schatz said.
One of the great advantages of Hadoop is the simplicity of the operation, at least for scientists familiar with computer programming. "As long as you know Java programming is enough to run large-scale analysis tasks in very large clusters, this is a big advantage of Hadoop." Jens Dittrich, a professor of information systems at Sarren University in Saarbruecken, Germany. Hadoop does not have to keep track of which processor is doing what, and programmers can write algorithms like stand-alone work. Also, Hadoop can handle complex operations at the bottom and assign programs to thousands of servers.
Overall, cloud computing, especially Hadoop, does have some drawbacks. To analyze data in cloud computing, researchers must first put data in. Even if the net speed is fast, millions of trillion data uploads also need several hours. Because Hadoop lacks the advanced indexing system used in many databases, it is inefficient for some types of analysis. Some index architecture is better, the program can identify specific fragments of data, which is necessary for specific queries. And some systems have no index, they have to search the entire dataset, often take longer.
Dittrich and his colleagues have recently begun to address these two issues. The team's newly developed Hadoop intrusive indexing system creates an index of multiple datasets when uploading data to the cloud, and the often-wasted computing time can be used to create an effective tool for optimizing subsequent analysis. These indexes can speed up the processing process, and some research problems can even accelerate to a hundredfold. "Frankly, this is not the final answer, it depends on the analytical task ... But for most tasks, we've done very well. "Dittrich said.
Even if the new technology makes Hadoop more powerful, experts in this field still emphasise that it will never be a common solution. Dittrich and Schatz both suggest that cloud-based systems are good at answering some biological questions, but not in other areas. Sequencing reading, identifying genetic variants, and classifying them through RNA expression patterns are a qualified target for cloud computing solutions because they all need to search for information about individual fragments from large data sets. On the other hand, metabolic pathway modeling is a complex calculation on a small dataset, so the local computing system is more appropriate.
Big data for other people
Hadoop is not very useful to biologists who are not accustomed to writing their own computer programs. Some companies have been targeting these scientists and are starting to provide a user-friendly interface for cloud computing data analysis.
"There are various types of clouds. Eagle's Holland said. From the most basic server leasing protocol (also known as "infrastructure as a service") to a comprehensive architecture of application services or "Software as a service" (software as services, SaaS), readily available. SaaS, service companies provide cloud infrastructure, data storage, and bio-information software. In many cases, researchers can send their sequencing results directly to the company and then perform common types of analysis in a point-and-click Network environment. Now, in San Diego, California, Illumina and other sequencing companies are offering their own SaaS systems, and a number of start-ups are starting to explore the new market.
Every service company has its own way. For example, eaglegenomics companies connect each pre-built program to tailor software for each user. "People usually find us and say, ' We need to build an analytical process for SNP prediction or mutation localization, '" Holland says, and the company then uses the published algorithms and "integrates them together to form a ... A workflow that can answer these questions. Researchers can then use this custom process to analyze their data on cloud servers. More experienced users can explore the computer code themselves or make changes.
If some researchers want to find more convenient cloud portals, some companies now offer generic software to solve conventional problems. "Biologists can use a lot of functionality on our servers simply by logging in to a Web browser and clicking on a button." "California Mountain View SaaS provider, CEO of DNAnexus and co-founder Andreas Sundquist said.
While SaaS companies often develop their own proprietary code and user interface, scientists should still consult the underlying algorithms when buying cloud services. "The researchers are actually a group of conservatives who prefer algorithms that have been published, tested by peer review and widely understood, and do not tend to experiment with new technologies on important data." "Holland said.
Fortunately, most new bio-information companies are willing to discuss their systems. "At the moment all the algorithms that are integrated into spiral are peer reviewed, and we understand very well that people want to use open source." "Adina Mangubat, CEO of Spiral Genetics Company in Seattle, Washington, said. For ease of use, Spiral puts its own user interface and data processing layer in the published algorithm. Other companies in the field have followed, with most SaaS leases allowing researchers direct access to the underlying software code.
Cloud cover
Cloud computing is still a relatively novel thing, and researchers in some areas are still skeptical about it, especially in the fields of pharmacology and biomedical scientists. They hold sensitive patent data and patient information. "People will certainly feel that local clusters are easier to control than in a cloud environment," he said. "Mangubat said.
There is really no reason for this concern. Research has shown that three-fourths of recent medical safety incidents in the United States have been due to the loss of laptops or portable storage devices by clinicians. "If they use a cloud ... Stealing a laptop is not a big problem, because you can't just put the patient's data in a notebook in the first place. "Sundquist said.
In fact, as banks, governments and e-commerce companies have already imported their data into cloud storage, the security system for server devices has become quite complete. Some companies that target the medical research market are also very concerned about data security laws. "One of our fundamental principles is to ensure that we have the enterprise-level security controls and the characteristics necessary for clinical and diagnostic operations." "Sundquist said.
Even if scientists rented bare cloud infrastructure and wrote their own algorithms, they would want security. Mangubat pointed out that the popular Amazon EC2 Cloud leasing services to comply with the physical security of medical data, so only the researcher's own software is the only potential weakness.
Fuzzy storage
Another common concern for cloud computing is data archiving, which researchers should ask before signing a server lease. If the SaaS company collapses or the researchers decide to change to a different system, then the lease should give a clear path to the data being extracted. "We offer services that allow you to engrave everything on a disc and send them a large pile of hard drives, and you're not ' married ' to a cloud all your life." "Mangubat said.
However, for general-purpose storage, the cloud can provide protection for accidents and local disasters, as cloud services typically replicate data in multiple locations. "Maybe one of the data centers is hit by a meteor, another volcano erupts, but you can still get another data backup," he said. Sundquist explained.
Cloud storage can also help solve problems in digital information archiving. For example, data stored on a standard computer floppy disk several decades ago is often unreadable because the disk drives and operating systems are obsolete. In cloud computing storage, workers constantly move data to new media, and version control systems can retain older versions of the software. Later, researchers should be able to recover these data and tools for analysis.
However, not everyone is satisfied with such a solution. "As long as you can cover it is not a file. "Dittrich said. To prevent valuable sequence data from being destroyed by computer programs and human errors, he recommends storing additional backups on another medium. "A good way to make a backup is to use a medium that can only be written once, and a non-censored DVD is a good idea, you can only burn it once and you can never cover it again." "he said.
But as billions of trillion of data continues to accumulate, some experts say that the final storage system for genomic data may be the DNA itself, which completes the connection between the computer and the creature. The view is that it may be cheaper and quicker to reorder a stored biological sample later than to get the original sequence data from the data archive. "At the moment, DNA sequencing takes days and costs too much, but looking ahead ... If sequencing is more or less just a matter of moments, it may become a data storage medium. "Schatz said.