Microsoft recently announced the development of a open-source version of Hadoop compatible with Windows Server and Windows Azure platform. IBM announced the creation of a new storage architecture on Hadoop to run DB2 or Oracle databases as a cluster to enable applications to support high-performance Analytics, data warehousing applications, and cloud computing purposes. EMC has also launched the world's first custom, high-performance Hadoop dedicated data processing equipment--greenplum HD Data computing equipment, providing customers with the most powerful and efficient way to fully tap the value of large data. Internet search giant Baidu is also considering using Hadoop. However, for performance and security reasons, Baidu rewritten the Hadoop computing layer when it adopted the Hadoop architecture. At the 2011Openworld conference, Oracle announced the release of the Big Data machine, which uses the NoSQL database and the Hadoop framework, and has successfully commercialized it for large analysis. Huawei is on the list of key Hadoop contributors, ahead of Google and Cisco, explaining that Huawei is also actively involved in the open source community. Taobao, Facebook and others have also joined the Hadoop camp. As a behind-the-scenes hero, the Hadoop storm, the future of Hadoop will be used in more and more areas, swept the world is just around the corner.
Hadoop: With Nutch
As we all know, Nutch is a Java-implemented, just born open source (Open-source) Web search engine. Nutch and Hadoop are the same brother Men Shi, from 0. The x version began, in order to solve the nutch of massive data crawling and storage needs, Hadoop will go out of their own, from the Nutch into an open-source subproject. Hadoop is not really a distributed file system for storage alone, but a framework designed to perform distributed applications on a large cluster of common hardware devices. Hadoop consists of two parts, including a distributed file system HDFs and a mapreduce implementation. In short, the core goal of Hadoop is to provide a framework for developing distributed applications. HDFs adopts Master/slave architecture. A HDFs cluster is composed of a namenode and a certain number of datanode. HDFS supports traditional hierarchical file organizations, similar to most other file systems, where users can create directories and create, delete, move, and rename files.
The application case of Hadoop: Nutch and Hadoop a distributed retrieval framework with massive data search
Nutch is an application based on Hadoop development. The architecture of distributed search engine based on Nutch can be divided into distributed crawler, Distributed file Storage System (HDFS), Retrieval service System (Searcher) and so on. The workflow of the Distributed crawler is: first crawler based on WEBDB to generate a URL set to crawl Web page is called fetchlist, then download thread fetcher will start to crawl back according to the Fetchlist page. In Nutch, the implementation of the crawler operation is accomplished through the implementation of a series of child operations. Nutch crawled files are stored on a block of HDFs, and it is noteworthy that nutch distributed retrieval services are not related to HDFs, and that the index blocks providing retrieval services are stored in the local file system, not HDFs.
Combining the advantages of Nutch with Hadoop, we can provide a distributed retrieval architecture that supports massive data search. The main processes are:
1, the use of Heritrix Crawl Web page text;
2, the data obtained is written to the segments of Nutch, which is deposited hdfs storage.
3, on the basis of segments link analysis and text extraction work.
4, build distributed index distribution mechanism and update mechanism;
5, using Nutch to provide distributed retrieval.
6, the implementation of Hadoop bottom principle
Typical Hadoop off-line analysis system architecture
Real-time data analysis is generally used in the financial, mobile and Internet products such as enterprises, often require the return of hundreds of millions of data in a few seconds of analysis, to meet such needs, can be a well-designed traditional relational database composed of parallel processing cluster, but need to spend relatively high hardware and software costs. At present, the new large-scale data real-time analysis tools have EMC Greenplum, SAP Hana and so on.
For most of the feedback time requirements are not so harsh applications, such as off-line statistical analysis, machine learning, search engine reverse index calculation, recommendation engine calculation, etc., should use off-line analysis, through data Acquisition tool to import log data into a dedicated analysis platform. However, in the face of massive data, the traditional ETL tools are often completely ineffective, mainly because the data format conversion cost is too large, in the performance can not meet the needs of massive data collection. Internet enterprise's massive data acquisition tool, has the Facebook open source Scribe, the LinkedIn open source Kafka, the Taobao Open source Timetunnel, the Hadoop Chukwa and so on, all can meet hundreds of MB per second log data collection and transmission demand, and upload the data to the Hadoop central system.
According to the data quantity of large data, divide into memory level, BI level, massive level three kinds.
The level of memory here refers to the amount of data that does not exceed the maximum size of the cluster memory. Facebook is cached in the memory of the memcached data up to 320TB, and the current PC server, memory can also be more than hundred GB. Therefore, some memory database can be used, the hotspot data resident in the memory, so as to obtain very fast analysis capabilities, is very suitable for real-time analysis business. Figure 1 is a practical and feasible MongoDB analysis architecture.
Figure 1 MongoDB architecture for real-time analysis
There are some stability problems in the MongoDB cluster, which can cause periodic write jam and master-slave synchronization failure, but it is still a potential nosql for high-speed data analysis.
The bi level refers to the amount of data that is too large for memory, but can typically be analyzed in a traditional bi product and a specially designed BI database. The current mainstream BI products have support for TB-level data analysis solutions. A wide variety, not specifically enumerated.
The mass level refers to the amount of data that has been completely invalidated or cost prohibitive for the database and BI products. There are also many excellent enterprise-class products with massive data levels, but based on the cost of hardware and software, most Internet companies currently use Hadoop's HDFs Distributed File system to store data and use MapReduce for analysis.
(Responsible editor: The good of the Legacy)