The web crawler architecture, on top of Nutch+hadoop, is a typical distributed Offline batch processing architecture with excellent throughput and crawl performance and a large number of configuration customization options. Because the crawler is only responsible for the crawling of network resources, a distributed search engine is needed for real-time indexing and searching of the network resources crawled by the crawler. Search engine architecture on the Elasticsearch, is a typical distributed online real-time interactive query architecture, no single point of failure, high scalability, high availability. Indexing and searching of large amounts of information can be done in near real time, with fast real-time search of billions of of files and petabytes of data, while providing a full range of options to customize almost every aspect of the engine. Supporting restful APIs, you can use JSON to invoke its various functions via HTTP, including search, analysis, and monitoring. In addition, native Client class libraries are available in a variety of languages, including Java, PHP, Perl, Python, and Ruby. The crawler submits the crawled data to the search engine for indexing after it has been structured to extract it for query analysis. Because search engines are designed to be near real-time complex interactive queries, search engines do not save the original content of indexed pages, so a near-real-time distributed database is needed to store the original content of the Web page. The distributed database architecture is a typical distributed online real-time random read-write architecture on the Hbase+hadoop. Very strong horizontal scalability, support billions of rows and millions of of columns, the data submitted by the web crawler can be written in real-time, and with the search engine, based on search results real-time access to data. The network crawler, the distributed database, the search engine all run in the common commercial hardware composition cluster. The cluster uses a distributed architecture that scales to thousands of machines and has a fault-tolerant mechanism in which some machine nodes fail without data loss and do not cause the compute task to fail. Not only high availability, when the node failure can be rapid failover, and high scaling, simply to increase the machine can be horizontal scaling, improve data storage capacity and computing speed. network crawler, distributed database, search engine relationship between:1, the web crawler will crawl to the HTML page after parsing, the parsed data into the buffer queue, the other two threads responsible for processing the data, a thread is responsible for saving data to the distributed database, a thread responsible for the data submitted to the search engine for indexing. 2, the search engine processing the user's condition, and return the search results to the user, if the user to view a snapshot of a webpage, the original content of the Web page from the distributed database. The overall architecture looks like this:
crawler clusters, distributed database clusters, search engine clusters in the physical deployment, can be deployed to the same hardware cluster, can also be deployed separately, forming 1-3 hardware clusters. The network crawler cluster has a specialized network crawler configuration management system to be responsible for the configuration and management of the crawler, as shown in:
Search engines achieve high performance, high scalability, and high availability through sharding (shard) and replicas (replica). Sharding technology provides support for large-scale parallel indexing and searching, greatly improves the performance of index and search, and greatly improves the ability of horizontal expansion; Replica technology provides redundancy for data, some machine failures do not affect the normal use of the system, ensuring the continuous high availability of the system. the index structure with 2 shards and 3 copies is as follows:
a complete index is cut into 0 and 12 separate sections, each with 2 copies, the gray part below. in the production environment, as the size of the data increases, simply increase the hardware machine node, the search engine will automatically adjust the number of shards to adapt to the increase in hardware, when some of the nodes retired, the search engine will automatically adjust the number of shards to adapt to the reduction of hardware, At the same time, according to the hardware reliability level and storage capacity changes at any time to change the number of copies, which are dynamic, do not need to restart the cluster, which is a high availability of important protection.
Web crawler and search engine based on Nutch+hadoop+hbase+elasticsearch