In the Applied Science Survey, the authors describe their experience in porting the commercial genetic recombination applications based on high-performance computing to the cloud environment, outlining the key infrastructure decisions they have made, and the process of turning these decisions from pure HPC type design into their favorite large data design.
The goal of the project is to implement a commercial genetic analysis application that can achieve great scalability and control the costs involved. The application has been designed to run internally on the infrastructure of High-performance Computing (HPC) classes, and the capacity of that infrastructure is about to reach its limits. At the same time, the amount of analysis is expected to increase rapidly. As a result, people want to try porting applications to a cloud environment.
In addition, time constraints prevent us from redesigning the original application, allowing only minor changes in how the application is organized. We will begin by briefly outlining the problem of computational genomics from an IT perspective.
The general approach to understanding the genome sequence of organisms involves the following steps:
multiple copies of the original genome (usually 30-60) into a large number of randomly overlapping fragments (fragment), each with a fixed length (for example, 30-200 base pairs). Reads the sequence of each short fragment, which produces a large number of small files. Use the previously known organism genome ("reference genome") to make best guesses about the location of each sequence fragment on the reference genome. This seems to be a reasonable method, because the genome of each species is usually not much different. Statistical methods were used to determine the most likely base pairs at each location of the recombinant genome. Given the purpose of data compression, this method can be expressed in the form of deltas: Single nucleotide polymorphism (gene mutation at a given location), or insertion or deletion (indels), which indicates a change in the length of the entire genome.
Because the genome is very large (for example, the human genome has a length of 3.3 billion base pairs), level Two analysis means a lot of computational and data challenges. When considering the use of Read quality data (read-quality), each incoming base pair is encoded by 1 bytes of information. Thus, an incoming dataset containing 60 fragments (complete DNA molecules) will contain approximately 3.3*109 * 60 or 200GB of data, assuming that there is a large amount of aggregated I/O connectivity between the CPU and the storage medium, and that the kernel calculates about 500 to 2, 500 hours.
In the past, dealing with such problems was within the scope of supercomputers or HPC. You need to use a fast, large central filesystem; The input dataset should also be placed in the system, and the large stateless server farm participating in it will perform this calculation.
While processing such a dataset may seem manageable, handling tens of thousands of such datasets within a reasonable time span is a challenge. One of the limiting factors is that as the cost of genomic sequencing continues to grow, the capital investment needed to build and operate a large number of processing systems will continue to grow.
For this reason, cloud computing has become an attractive model. It provides a large number of computing power with variable pricing, and can rent servers as needed and return them when they are not needed. However, to take full advantage of the cloud, the following challenges must be overcome:
data needs to be delivered efficiently across the WAN across the cloud and needs to use the appropriate toolset. The appropriate combination of cloud storage product types needs to be selected because the cloud cannot provide fast, expensive HPC type storage. Job choreography takes into account the storage structure and corresponding extensions. The basic horizontal cloud expansion pattern must be reflected in the infrastructure. If possible, you must choose the best combination of cloud hardware, software, and virtualization.
The remainder of this article is structured as follows:
Introduction: Provide a general overview of the relevant work and introduce other background knowledge. Part 2nd: Introduce the basics of ibm®smartcloud™enterprise. Part 3rd: Introduce the system infrastructure we have chosen for the porting system. Part 4: Introduce the results and provide a comparison between the current infrastructure and the alternatives. Part 5th: Discussion of results and lessons learned. Conclusion: Describe potential future work directions and summarize the main points of this article.
At present, some cloud vendors can provide a large number of computing functions by using the mode of instant payment. In some cases, to ensure consistency with the workload, customers can choose to use the underlying hardware.
So, in recent years there have been examples of successful genomic workflows designed to run in the cloud. The researchers chose to use a cloud system (such as Amazon elastic Compute Cloud, Amazon EC2) to process it "as is" and to make further improvements to the application to take advantage of the power of cloud computing.
Our work is essentially different in the following ways:
Our ability to change the original application within the working time limit is limited. While this limitation will certainly create flaws in design, we think it is very common. Because the primary focus is total cost, by tracking data bytes, trying to ensure that they are transmitted at the shortest possible distance, and focusing on performance, identifying and eliminating all bottlenecks, we can conduct a nearly comprehensive design. We handle all the work presented in this article in IBM SmartCloud Enterprise. Because we maintain close contact with teams that support and develop IBM SmartCloud Enterprise, we have a special opportunity to view the cloud infrastructure as white box, and use that information as an operational guide. We are able to experiment with tuning cloud configurations to better support such data-intensive workloads in the future.
In short, many data-intensive applications are written in our minds using custom expensive supercomputers or HPC clusters. Hopefully, our experience and decision making process will help people who are trying to increase stability and reduce the processing costs of the tasks that the application supports.
Introducing the Cloud environment
IBM SmartCloud Enterprise is a virtualized infrastructure, a service (IaaS) product, that enables users to borrow resources in the immediate-payment mode and to price most resources in hours. Its compute node uses Kernel virtual Machine (KVM) as a virtual machine management program and a direct attached hard disk to provide its virtual machine (VM) with the best cost-effective temporary storage. In addition, IBM SmartCloud Enterprise provides a network-attached storage block memory that can be attached to a VM at a time. The VM is connected to block storage and virtual machines at a speed of 1Gbps.
IBM SmartCloud Enterprise is made up of a number of data centers around the world called: Pod. In many cases, you only need to install a deployable topology on a pod (that is, the most recent pod from the data Resource). However, multiple pod topologies are uncommon because data may come from all over the world and the effective capacity of the pod varies.
IBM SmartCloud Enterprise offers several VMS, from 32-bit Copper to 64-bit Platinum, and each type has different temporary storage allocations for hourly charges. Persistent (block) cloud storage is also available in multiple increments and can be billed by capacity and the number of operations per second of I/O. Data transfer within and outside of IBM SmartCloud Enterprise will also generate costs based on traffic.