A new method for large data processing and analysis
There are many ways to process and analyze large data, but most have some common features. That is, they use the advantages of hardware, using extended, parallel processing technology, the use of non-relational data storage to deal with unstructured and semi-structured data, and the use of advanced analysis and data visualization technology for large data to convey insights to end users.
Wikibon has identified three large data methods that will change the business analysis and data management markets.
Hadoop
Hadoop is an open source framework for processing, storing, and analyzing large amounts of distributed, unstructured data. Originally created by Yahoo Doug Cutting, Hadoop was inspired by MapReduce, MapReduce, a user-defined function that Google developed in the early 2000 to index Web pages. It is designed to handle PB-and EB-level data distributed across multiple parallel nodes.
The Hadoop cluster runs on inexpensive commercial hardware, so there is no financial pressure to extend the hardware. Hadoop is now a project for the Apache Software Alliance (the Apache Software Foundation), with hundreds of contributors constantly refining their core technologies. Basic concept: Unlike the way in which massive data is limited to a single machine, Hadoop divides large data into parts so that each part can be processed and analyzed at the same time.
How Hadoop works
Customers obtain unstructured and semi-structured data from sources such as log files, social media feeds, and internal data stores. It breaks the data into "parts", which are loaded into the file system of multiple nodes of the commercial hardware. The default file storage system for Hadoop is the Hadoop Distributed file system. File systems, such as HDFs, are good at storing large amounts of unstructured and semi-structured data because they do not need to organize data into relational rows and columns.
The "parts" are copied multiple times and loaded into the file system. Thus, if one node fails, the other node contains a copy of the failed node data. The name node acts as a mediator and is responsible for communicating information such as which nodes are available, where some data is stored in the cluster, and which nodes fail.
Once the data is loaded into the cluster, it is ready to be parsed through the MapReduce framework. The customer submits a "match" task (usually a query statement written in Java) to a node called Job Tracker. The job tracker references the name node to determine what data is required to complete the work and where the data is stored in the cluster. Once determined, the job tracker submits the query to the relevant node. Each node is processed concurrently, concurrently, rather than concentrating all data into one location. This is an essential feature of Hadoop.
When each node finishes processing the specified job, it stores the results. The customer launches the "Reduce" task through the task tracker. The summary map phase stores the resulting data on each node, gets the answer to the original query, and then loads the answer into another node in the cluster. Customers can access these results that can be loaded into a variety of analysis environments for analysis. MapReduce's work is done.
Once the mapreduce phase is complete, data scientists and others can use advanced data analysis techniques to further analyze the processed data. These data can also be modeled, the data from the Hadoop cluster transferred to the existing relational database, data warehousing and other traditional IT systems for further analysis.
Technical components of Hadoop
The Hadoop stack consists of multiple components. Including:
· Hadoop Distributed File System (HDFS): The default storage layer for all Hadoop clusters;
· Name node: In the Hadoop cluster, a node that provides data storage location and node failure information.
· Level Two node: a backup of the name node that periodically replicates and stores the data for the name node in case the name node fails.
· Job Tracker: A node in the Hadoop cluster that initiates and coordinates mapreduce jobs or data processing tasks.
· From nodes: the common node of the Hadoop cluster, which stores data from the node and obtains data processing instructions from the job tracker.
In addition to the above, the Hadoop ecosystem includes many free subprojects. NoSQL data storage systems, such as Cassandra and HBase, are also used to store MapReduce job results for Hadoop. In addition to Java, many MapReduce jobs and other Hadoop features are written in pig language, Pig is an Open-source language specifically designed for Hadoop. Hive was originally an open source data warehouse developed by Facebook, which can be modeled in Hadoop.
See the article: Hadoop components and subproject instruction manuals: Hbase,sqoop,flume etc: Apache Hadoop definition (http://wikibon.org/wiki/v/HBase,_Sqoop,_Flume_and_ more:_apache_hadoop_defined)
Hadoop: Pros and cons
The main benefit of Hadoop is that it allows businesses to process and analyze large amounts of unstructured and semi-structured data in a cost-effective and efficient manner, and that data has so far not been processed in any other way. Because Hadoop clusters can be scaled to petabytes or even EB-level data, enterprises no longer have to rely on sample datasets to process and analyze all relevant data. Data scientists can use iterative methods to analyze, constantly improve and test query statements, thus discovering previously unknown insights. The cost of using Hadoop is also cheap. Developers can download Apache's Hadoop distributed platform for free and begin to experience Hadoop in less than a day.
The disadvantage of Hadoop and its myriad components is that they are immature and still in a developmental phase. Like all new, original technologies, the implementation and management of the Hadoop cluster, advanced analysis of a large amount of unstructured data, requires a great deal of expertise, skills, and training. Unfortunately, the current lack of hadoop developers and data scientists has made it impractical for many companies to maintain complex hadoop clusters and take advantage of their advantages. In addition, as the many components of Hadoop are improved through the technology community, and new components are constantly being created, there is also the risk of failure as immature open source technology. Finally, Hadoop is a batch-oriented framework, which means that it does not support real-time data processing and analysis.
The good news is that some smart it people are constantly contributing to the Apache Hadoop project, and the new generation of Hadoop developers and data scientists are maturing. As a result, the technology is evolving, becoming more powerful and easier to implement and manage. Vendors (including Hadoop start-ups Cloudera and Hortonworks), as well as sophisticated it backbone companies such as IBM and Microsoft, are working to develop the commercial Hadoop distributed platforms, tools, and services available to the enterprise, Let the deployment and management of this technology become the real reality of the traditional enterprise available. Other start-ups are working to refine nosql (not just SQL) data systems, combining Hadoop with near-real-time analytics solutions.
NoSQL
A new form of database called NoSQL (not just SQL) has emerged, like Hadoop, to handle a large number of structured data. However, if Hadoop is good at supporting large-scale, batch-style historical analysis, in most cases (albeit with some exceptions), the purpose of the NoSQL database is to provide a large number of discrete data stored in multiple structured data for end users and automated large data applications. This capability is lacking in relational databases and it simply cannot maintain basic performance levels on large data scales.
In some cases, NoSQL and Hadoop work together. For example, HBase is a popular NoSQL database, modeled after Google's bigtable, typically deployed on the HDFS (Hadoop Distributed File System), which provides low latency fast lookup for Hadoop.
Currently available NoSQL databases include:
· HBase
· Cassandra
· MarkLogic
· Aerospike
· MongoDB
· Accumulo
· Riak
· CouchDB
· DynamoDB
The disadvantage of most NoSQL databases today is that they comply with the acid (atomicity, consistency, isolation, persistence) principles for performance and scalability. Many NoSQL databases also lack sophisticated management and monitoring tools. These drawbacks are being overcome by the efforts of the Open-source NoSQL community and a handful of vendors, including Datastax,sqrrl,10gen,aerospike and Couchbase, who are trying to commercialize various NoSQL databases.
Large scale Parallel Analysis database
Unlike traditional data warehouses, large scale parallel analysis databases can quickly acquire a large amount of structured data and can be expanded to accommodate TB or even petabytes of data with the minimum required data modeling.
What is most important to end users is that large-scale parallel analysis databases support near-real-time complex SQL query results, also known as interactive query capabilities, which is a significant lack of hadoop. In some cases, large-scale parallel analysis database supports large data applications near real time. The basic features of a large-scale parallel analysis database include:
The ability to massively parallel processing: As its name suggests, large-scale parallel analysis databases use large-scale parallel processing to support data acquisition, processing, and querying on multiple machines. Compared with the traditional data warehouse with faster performance, the traditional data warehouse running on a single machine, will be data acquisition of this single bottleneck limit.
No shared schemas: No shared schemas ensure that there are no single points of failure in the Analysis database environment. Under this architecture, each node is independent of the other nodes, so if a machine fails, other machines can continue to run. This is especially important for large-scale parallel processing environments, where hundreds of of computers handle data in parallel, and the occasional failure of one or more machines is unavoidable.
Column storage structure: most large-scale parallel analysis databases use a column storage structure, while most relational databases store and process data in a row structure. In a column storage environment, the "answer" of the query statement is determined by the columns that contain the necessary data, rather than by the entire row of data, which results in an instantaneous result of the query. This also means that the data does not need to be structured into neat tables like traditional relational databases.
Powerful data compression capabilities: they allow analysis databases to collect and store larger amounts of data, and consume less hardware resources than traditional databases. For example, a database with a 10:1 compression capability can compress terabytes of data to 1 TB. Data coding (including data compression and related technologies) is the key to effectively expanding to massive amounts of data.
Commercial hardware: Like the Hadoop cluster, most (and certainly not all) large-scale parallel analysis databases run on off-the-shelf commercial hardware from vendors such as Dell and IBM, enabling them to scale out in a cost-effective manner.
Data processing in memory: Some (certainly not all) large-scale parallel analysis databases use dynamic RAM or flash memory for real-time data processing. Some, such as SAP Hana and Aerospike, run data entirely in memory, while others use mixed methods, which process "cold" data with a cheaper but less performance disk memory, and process "hot" data with dynamic RAM or flash memory.
However, the large-scale parallel analysis database does have some blind spots. Most notably, they are not designed to store, process, and analyze large amounts of semi-structured and unstructured data.