NoSQL how to deal with biomedical Big data release time: 2012.05.31 12:24 Source: Zhongguancun Author: Shelanjing
One important trait that distinguishes large data from massive data is the processing of large amounts of mixed-structure data. In the biomedical field, there are many such data to be dealt with. Wang Yu, associate researcher at the Institute of Health Service and medical intelligence at the Military Academy of Medicine, shared at the fourth session of China's cloud computing conference that he used NoSQL to process biomedical large data. Wang Yu says large data integration applications cover health management data, massive sequencing data, and large data management, integration, and analysis are challenges to it under large data.
Wang Yu said that biomedicine is a combination of medicine, biology including engineering, information technology related disciplines, should say based on information technology, the research process to link up, they want to through genetic engineering to study the results of the basic medical research better from drug development, drug research and development to integrate, We can take a quick look at this chart, which is a five-year literature on cancer research and drug target gene research, which has evolved into an important field of research in information technology, and is experiencing big data shocks. The first large data source, is high throughput, personalized diagnosis and treatment basically through the human and human genetic differences, direct personalized medication, into asynchronous enhanced diagnosis and treatment of pertinence, this process is more complex, cost 3 billion of dollars.
Four sources of large data
The rapid development of the 2005 technology, its sequencing ability to double every five months, blue Line storage trends, the red is the sequencing capacity growth. If, in the light of this trend, it is predicted that by 2015 1 million of people around the world will have their own personal sequencing problems, it can now be imagined that the effects on human health and medicine can be immeasurable if they are better able to guide individualized treatment and medication with the help of biological technology.
We can see a gradual increase in computational power and sequencing capabilities.
Another source of large data in the field of drug development is also in the development of biology, drug research and development model by looking for cancer, finding drug targets, finding compounds in compounds, from the previous basic research to the back of the basic settings, a rather intensive process, for small and medium-sized enterprises also in TB.
The third source of data is clinical medicine, laboratory data, and no such data are combined to make the data on medical institutions grow very fast, with the University of Pittsburgh Medical Center UPMC in the United States reaching two TB.
The fourth big data comes from health management, mobile medicine is a hot area in the past two years, business survey said it will reach 1.4 billion U.S. dollars, compared with 2010 10 times times, portable physiological equipment, with the development of mobile Internet is also greatly popularized, Particularly Web2.0 health services and Health Network, about their own personal health information, if all can connect to the Internet this number of immeasurable, mobile Internet 800 million, can imagine that this is the future of important large data sources.
The biological field that we see above four major data sources, in fact these four data sources are not isolated, future biologists want data to be integrated, mining analytics to support clinical decisions, To achieve such a goal or to say that we can see a lot of large data management and analysis of the many challenges. In fact, these challenges are very difficult to solve, now using cloud computing technology some pioneers and innovators and companies are also trying to solve a number of problems with cloud computing, and also has a preliminary results, the use of cloud platform, cloud such as the basic solution in the form of services released, can allow the general small and medium-sized scientific research institutions, Institute, to enjoy these open services, to stand on the shoulders of others and move on.
Four aspects of the application of large biological data
Here are four ways to focus on the application of large data in cloud biology that has an impact on the presence of clouds. These four aspects are genetic sequencing, clinical drug research and management, and health management. The first case is crossbow, a process software for genome-wide analysis, which is meant to be a single server that completes a person's health analysis prior to development, and this software shrinks time by Hodoop on Amazon's cloud platform. As a result, it is now on the 32CPU nuclear mission compression less than 3 hours, the entire cost of less than 100 dollars, which is part of a lot of work. From the start of the crossbow project, in fact, the company involved in how to use cloud computing to speed up DNS data analysis, which has a more important call DNAnexus, we can see the services provided by the company in the laboratory through the sequencing instrument, the human genome test data, The raw data between 100 g to 600 g, the introduction of services into the cloud services platform provides a very flexible and diverse range of data sequencing and alignment workflows, data can be managed efficiently, and sequencing results can be well represented in the user's best form, Or the third party data security and reliable sharing.
This is the diagram of its basic business, the company's more famous point, in the last year, Google invested 15 million of dollars, and Google to contact the CPI database, it was based on Amazon platform to carry its sequencing Analysis Services, the so-called use of Amazon 10 CPU, the future will migrate to Google Cloud platform. In addition to dnanexus US research and development investment very quickly, some companies do similar work, because in this field, based on genetic sequencing analysis, the resulting in both the guidance of each other's diagnosis and treatment, data mining is very significant.
Third, clinical medical data Management application, the United States company Explorys, it is based on the private cloud model, to provide services to third-party agencies, third-party agencies can put their own clinical data, operational data financial data hosted to this platform, the platform provides the greatest benefit of real-time data analysis, This size hosted 13 million people, about 440 billion of the content, the data scale of 60 TB or so, reached 70 TB in 2013, top technology on the Hodoop walk.
The fourth application is electronic medical records, the company is also the United States, called practice Fusion, the United States this is a small and medium-sized, reduce the cost can use SaaS way, they scale 100,000, 20 million registered patients, provide functional doctor arrangement, patient's diagnosis and treatment plan and contracted, Even for the patient's personal management, they will also provide.
The fifth application based on clinical medical applications, the research center, the University of Texas Anderson Oncology Center, is among the best in the United States to meet their own hospital clinical services, the people like Analysis Services they built a private cloud, providing resources for reasonable virtualization and dynamic processing capabilities, private cloud now look, It is capable of 8,000 processors and can support more than three terabytes of data, they carry very diverse, including oncology pathology research, epidemiology, accurate prediction of the cause of the disease and model studies, they have to build a private cloud technology to solve, they have two considerations, On the one hand, large private medical institutions are concerned about pathology, their data is quite large, using 1 billion of data, according to their CIO, several of their large providers to communicate, found that the public cloud to provide them with the service platform of Service quality assurance, may not be able to do to accept, So you want to invest in your own private cloud data center.
The sixth case is drug research and development process management, drug research and development management is a very long time, the data volume is very large process, Japan Fujitsu for the research process of data Management provides SaaS services, this service is mainly for Japanese small and medium-sized enterprises, in the United States has a company to do better, AMAG, The company completely took it business in 2009 and did not buy its own servers, and all of its business was SaaS-mode software services, which now use a lot of home SaaS services, including storage, they are now in the Egnyte storage capacity of 6TB, they are very bright, at present, Their data security is effectively guaranteed.
The final case introduced by Microsoft's HealthVault, a lot of people should know the platform, it was released in 2007, the goal is to manage personal and family health instrument, now achieve functions such as the hands can be input upload, from portable devices, to third-party agencies to import medical records, Provides a store-enabled mode application by providing an open SDK or an open interface to support integration with Third-party applications. This is Microsoft's own private cloud, now known as the cloud migrating to Android, the front section provides the web, the physical detection device provides a standard interface model. To make a summary of the above statements, you can see in the biomedical large data applications, have tried to do large data, they are more based on both the public cloud or private cloud, ultimately hope to be able to provide the ability to open large data. Now from the big data research, are in Europe and America.
Can be said from the above application can be seen, many manufacturers they consider using cloud computing when processing large data more consideration of security and bandwidth cost issues, large data concentration in the local, extreme in the cloud data exchange overhead often make your business performance is very drag, The reason why many applications migrate to the cloud is that the big data itself is migrating into the cloud, particularly in the biomedical field the phenomenon is very evident in the Amazon, now including the biomedical field of TB data, in the Amazon has a good practice, you deploy the data flow on the above Amazon can naturally use this data. Cloud computing plays a very important role in Hadoop. We can see that with the popularization of sequencing technology, clinical medical records, rapid application of biology into normality, we face a variety of applications are basically large data applications, cloud computing for large data applications provide a good model, we should promote the integration and application of medical data, and use the marketplace model to build our own biomedical dataset resources.