The Web Crawler architecture is a typical distributed offline batch processing architecture on top of nutch + hadoop. It has excellent throughput and capture performance and provides a large number of configuration customization options. Because web crawlers only capture network resources, a distributed search engine is required to index and search network resources captured by web crawlers in real time.
The search engine architecture is built on elasticsearch. It is a typical distributed online real-time interactive query architecture without spof, high scalability, and high availability. Indexing and searching of a large amount of information can be completed in near real time. It can quickly search billions of files and Pb-level data in real time, and provides comprehensive options, you can customize almost every aspect of the engine. Supports restful APIs. You can use JSON to call its functions over HTTP, including search, analysis, and monitoring. In addition, it provides native client class libraries for Java, PHP, Perl, Python, Ruby, and other languages.
Web Crawlers extract the captured data in a structured manner and submit it to the search engine for indexing for query and analysis. Because the search engine is designed to implement near-real-time complex interactive queries, the search engine does not store the original content of the indexed web page. Therefore, an almost real-time distributed database is required to store the original content of webpages.
The distributed database architecture is built on hbase + hadoop and is a typical distributed online real-time random read/write architecture. It is highly horizontally scalable and supports billions of rows and millions of columns. It can write data submitted by web crawlers in real time. It can be used with search engines to obtain data in real time based on search results.
Web crawlers, distributed databases, and search engines all run on clusters composed of common commercial hardware. The cluster adopts a distributed architecture that can be expanded to thousands of machines and has a fault tolerance mechanism. Failure of some machine nodes will not cause data loss or computing task failure. In addition to high availability, failover can be performed quickly when a node fails, and high scalability can be achieved through horizontal linear scaling by simply adding machines, improving data storage capacity and computing speed.
Relationship between web crawlers, distributed databases, and search engines:
1. After the web crawler parses the captured HTML page, it adds the parsed data to the buffer queue, and the other two threads process the data, A thread stores data to a distributed database, and a thread submits the data to the search engine for indexing.
2. The search engine processes the user's search conditions and returns the search results to the user. If the user views a webpage snapshot, the original content of the webpage is obtained from the distributed database.
Shows the overall architecture:
Crawler clusters, distributed database clusters, and search engine clusters can be deployed on the same hardware cluster or separately to Form 1-3 hardware clusters.
The Web Crawler cluster has a special web crawler Configuration Management System for Crawler configuration and management, as shown in:
Search index Optimus and replica provide high performance, high scalability, and high availability. The sharding technology provides support for large-scale Parallel Indexing and search, greatly improving the index and search performance, and greatly improving the horizontal scalability. The copy technology provides redundancy for data, some machine faults do not affect the normal use of the system, ensuring the continuous high availability of the system.
The index structure with two shards and three replicas is as follows:
A complete index is split into two independent parts: 0 and 1. Each part has two copies, namely the gray part below.
In the production environment, as the data size increases, you only need to simply add hardware machine nodes. The search engine automatically adjusts the number of partitions to adapt to the increase of hardware. When some nodes are retired, the search engine automatically adjusts the number of parts to adapt to the reduction of hardware, and can change the number of copies at any time according to the hardware reliability level and storage capacity changes, all of which are dynamic, you do not need to restart the cluster, which is also an important obstacle to high availability.
Web Crawler and search engine based on nutch + hadoop + hbase + elasticsearch