Summary: Data analysis Framework (traditional data analysis framework, large data analysis framework) medical large data has all the features mentioned in the first section. While large data brings with it various advantages, the traditional data processing
Data analysis Framework (traditional data analysis framework, large data analysis framework)
Medical large data has all the features mentioned in the first section. While the medical data brings various advantages, the large data brings various characteristics, which makes the traditional data analysis methods and software have many problems. Before the advent of the large data age, limited by the availability of data and the limitation of computational power, traditional data management and analysis adopted different ideas and processes. Traditionally, the research on the problem is based on the hypothesis, and then studies the causality of things, hoping to answer "why".
In the big data age, the emergence of massive data provides a more detailed and comprehensive view of the data from different angles, thus opening people's curiosity and exploring desires, and people want to know what the data tells me, not just whether my conjecture is validated by the data. More and more people are using large data to dig up associations of interest, unrelated dependencies, and then to compare, analyze, generalize, and study ("Why" becomes an option rather than the only ultimate goal). The difference between large data and traditional data leads to different analysis processes, as shown in Figure one:
Figure I
Faced with massive data and different analysis ideas, the difference between management and analysis of large data and traditional data analysis is increasing. A single preset structured database that answers specific questions is obviously not fully competent to deal with large amounts of data and miscellaneous problems. The mixed diversity of data can be reflected in some survey data. A SAS survey shows that unstructured data in an organization can account for up to 85% of the total amount of data, and that non-digital, unstructured data must be quantified and used in decision analysis (Troester, 2012).
Another SAS survey, conducted in 2013, showed that only 26% of the 461 agencies that provided complete feedback said they had large data that was structured (Russom, 2013). In addition, in institutions, the analysis of data generally does not have a single source. Alteryx's survey showed that only 6% of the 200 agencies surveyed reported that their data was only one source, the most common being 5-10 sources, which were distributed in Figure II (Alteryx, 2014).
The survey also showed that 90% of the surveyed samples indicated data integration problems, 37% indicated that other groups were required to provide data, and 30% said they could not get the data they wanted, generally estimating that 60% to 80% of a data analyst's time was spent in the data processing preparation phase (Alteryx, 2014).
Figure II
This shows the importance of effective data management, database establishment and data analysis process. Traditional data management processes include extraction (extraction), Transformations (transformation), and load (load). Through ETL, the data can be given an appropriate structure for specific analysis findings. The specific data preparation analysis process is shown in Figure three: 1 to extract data from single or multiple sources. 2 purify, format, standardize, aggregate, add, or follow other specific data processing rules. 3 loading the processed data into a specific database or storing it in a specific file format. 4 use various methods for data analysis.
Figure Three
The central content of ETL is still applicable to large data, but due to the large number of large data and diversity of database and data management and processing of the requirements of the higher and more complex, so that the linear processing of the entire data has become quite labor-intensive, material, and time.
In addition, the rapidity of large data and variability makes it less feasible to store data in a single central database. In this case, the most popular idea is to divide the data processing, that is, to store the data to a number of storage nodes (such as the Network database), in each node processing data (or even after the initial analysis, but the extent of processing according to customer specific problems and adjust), and then aggregated together, provided to single or multiple databases, then select the appropriate analysis method to obtain useful results as needed. ETL runs through the process of the whole large data management analysis. Figure IV illustrates the approximate large data management analysis process and the name of some of the most data processing analysis platform tools.
Figure Four
The SAS Data Warehouse Research Institute (TDWI) has conducted a survey to help people make better decisions when choosing software and hardware for large data analysis. For large data technology, features, and user operations, surveys offer three choices: 1 are now in use and will continue to be used. 2 will be used in three years. 3 no plan to use. The left side of figure five shows the percentage of respondents responding to various large data analysis platform tools. The right-hand side of figure five shows the potential growth of platform tools and the percentage of respondents who make commitments to this tool.
Figure Five
Based on a comprehensive consideration of potential growth and commitment, the survey also further divides the large data analysis platform into 4 groups: The first group is moderately committed, moderate to strong growth potential; The second group is moderately to strong commitment to moderate growth potential; The third group is weak to moderate commitment, modest growth potential Group fourth for moderate to strong commitment to weak growth potential. Figure six shows the content distribution of these groups. Limited to space, this article does not elaborate on each of the listed platform tools specific content, interested readers can refer to the literature to obtain more detailed introduction.
Figure Six
Figure five and Figure du show the most popular platforms and data processing methods for open source free Hadoop and MapReduce. With their potential growth and commitment, it can be foreseen that Hadoop and MapReduce are and will continue to drive and promote the processing and application of large data.
Here, let's briefly introduce the concepts of Hadoop and MapReduce. Hadoop is a decentralized data-processing framework based on Java. It provides high throughput reading and writing to data stored on multiple hardware devices. More importantly, it is highly tolerant of large data and highly available for parallel applications. The Hadoop framework structure consists of several name nodes (Namenode) and Data nodes (DataNode). Tens of millions of large data files are split into smaller chunks of file information stored in multiple data nodes, and can be any computer hardware device.
Data attribute information about these files, called metadata, is stored in the name node (namenode). Namenode primarily manages file system namespaces and client access to files. The frame structure of Hadoop is shown in Figure VII:
Figure Seven
When accessing and manipulating data files, the client contacts the name node to extract the attribute information of the file information block, such as location, filename, etc. Then, based on these attribute information, the client reads the data block directly from the corresponding data node. Hadoop itself has redundancy and replication capabilities to ensure that data can be recovered without any loss in the case of a single hardware storage device failure, such as having 3 backups by default for each data node.
In addition, Hadoop can automatically balance the data load per data node when new data nodes are added to the frame. Similarly, a name node can have redundancy and replication capabilities that can be used to recover data attribute information when a single name node that stores data property information fails.
MapReduce is a programming model that can be used to handle large data in parallel. The same program can be written and run in a variety of languages (Java,ruby,python, etc.) under the framework of Hadoop in MapReduce programming model. The key is three words: map,reduce, and parallel processing. We use an example to understand the general working principle of MapReduce. For example, we have 130 characters of string "Open the petals around the flower wrapped around the laugh melon Night village Chesche Melon Night vines around the next rattan play Doll", the task is to calculate the number of occurrences of each word.
The simplest method is to read each word sequentially to establish an index of identification and calculate the number of occurrences of memory, if it is a new word, the value is 1, if the number of occurrences of the value of the added up. This is done in a sequential fashion, and the time spent increases linearly with the length and complexity of the string. When the strings are millions of times, such as genomic data, the time spent will be quite astonishing. Parallel processing can save a considerable amount of time.
We first split the original file into several small file blocks, then each small file block for the identification index and appended values (not cumulative, just a simple single point), and then reorder the same words together, then we use the reduction method to calculate the word and its corresponding occurrences of the value. Figure Eight shows the specific example steps:
Figure Eight