We have entered the "Big Data Age", IDC Digital Universe reports that data has grown faster than Moore's law. This trend is indicative of a shift in the way enterprises handle data patterns, where isolated islands are being replaced by large cluster servers, which keep data and computing resources together.
From another perspective, this paradigm shift shows that the speed of data growth and the amount of data require a new method of network computing. In this regard, Google is a good example. As early as 1998 when Google launched the beta search engine, Yahoo company dominated, other competitors include InfoSeek, Lycos, and in just two years, Google has become the leading search engine suppliers. It was not until 2003 that Google released a document about MapReduce that we were fortunate to have a glimpse of Google's back-end architecture.
Google's architecture shows how the company can index more data to get search results faster, and to get these results more efficiently and cost-effectively than all competitors. Google's shift is to divide complex data analysis tasks into simple subtasks that are executed in parallel commodity servers. Individual processes are used to map the data and then reduce it to medium or final results. This mapreduce framework is eventually used by Apache Hadoop for enterprise use.
A Brief History of Hadoop
After reading Google's files in 2003, Yahoo engineer Doug Cutting developed a Java based mapreduce named Hadoop. In 2006, Hadoop became a subproject of the Apache Software Foundation Lucene (a popular Full-text retrieval library) and became the top Apache project in 2008.
In essence, Hadoop provides a catch between large commodity computer clusters, organize, store, search, share, analyze, and visualize different data sources (structured, semi-structured, and unstructured) and can scale from dozens of servers to thousands of servers, each providing local computing and storage.
Hadoop contains two basic components: first, the Hadoop Distributed File System (HDFS), which is the primary storage system, HDFS copies and distributes the source blocks to the compute nodes of the server cluster for analysis by one or more applications. Next is MapReduce, which creates a software framework and programming model for writing applications that can handle large amounts of distributed data in parallel.
The open source nature of Apache Hadoop creates an ecosystem that keeps improving its functionality, performance, reliability, and ease of use.
Maintain simplicity and scalability
In an article called "Data irrationality", researchers from Google compared simple physical equations (e = mc2) with other disciplines and pointed out that "science involving humans rather than elementary particles is better suited to using simple mathematical algorithms".
In fact, simple formulas are perfectly capable of interpreting complex natural worlds and understanding elusive human behavior, which is why Hadoop is popular.
The researchers found that relatively simple algorithms were available for large datasets and produced stunning results. One example is the scene completion technology, which uses an algorithm to eliminate something in a picture (such as a car) and then "fix" it from thousands of image repositories, and the algorithm behaves poorly when photos of the picture database increase to millions of. This simple algorithm is extremely good when it has enough data. Finding patterns and patching is a common theme in many data analysis applications today.
Data analysis also faces another inherent complexity: an increase in unstructured data and unstructured data. The size and importance of unstructured data, such as log files, social media, video, and so on, is increasing at the same time, and some of the structures have lost their structure after some change. Traditional analytical techniques require a large amount of preprocessing of unstructured and semi-structured data before the results are produced, and the result may be incorrect if there is some defect in preprocessing.
Hadoop's ability to parse the original form of unstructured, semi-structured, and structured data and produce meaningful results with a simple algorithm is unparalleled at the moment. MapReduce enables us to analyze data in a progressive way, and it requires complex data transformations or other data preprocessing, or creating any schema or consolidated data in advance.
Price and performance of data analysis
Hadoop not only provides superior data analysis capabilities and results, but also is more cost-effective than traditional data analysis tools. The reason is that the extended data analysis capabilities of traditional data analysis tools mainly follow the 80/20 rule: initial effort and pay can bring big gains, but as data sets develop into large data, this return is reduced.
In stark contrast, Hadoop can be scaled linearly, which is a key factor in effective and cost-efficient data analysis. As the data set grows, the traditional data analysis environment is exponentially growing, and it is daunting to put more extra costs into gaining insight. For Hadoop, a server cluster can scale linearly with the number and size of datasets that are directly attached to storage.
These advantages of Hadoop are the main reasons for the rapid popularization of web-based enterprises and data-intensive enterprises.
However, the main challenge facing Hadoop deployments remains its filesystem. HDFs is append-only (only allowed to append data after this file) storage requires data to be installed in the Hadoop cluster, but is then processed for use by other applications that do not support the HDFs API.
Another obstacle to the deployment of Hadoop in larger enterprises is the need to take special measures to make the environment reliable. Hadoop needs to be constantly monitored to ensure that a single point of failure does not cause disaster and that data is reloaded into the Hadoop cluster in the event of data loss.
Break through the obstacles
These problems with Hadoop have become the past. The open source community has created a vibrant ecosystem that keeps Hadoop perfect. Some companies are now offering commercial products based on open source Hadoop.
The introduction of more and more commercial Hadoop products is driving the wider popularity of Hadoop. These commercial products make Hadoop easier to integrate into the enterprise and provide enterprise-class performance and reliability. One way to implement these improvements is to use existing standard communication protocols as the basis for seamless integration of traditional and hadoop environments.
Is it over or just beginning?
The data analysis paradigm is changing, which brings real opportunities for businesses. Hadoop enables all enterprises to gain a significant competitive advantage by providing the insight advantage that this paradigm shift offers.
Hadoop is no doubt a game-changing technology, and with the introduction of enterprise-class business Hadoop products, Hadoop itself is changing. These next-generation solutions are leading the new data analysis paradigm. (邹铮 Compilation)
(Responsible editor: The good of the Legacy)