Generally, the performance of a single crawler is limited, and a sufficient number of related webpages cannot be crawled within a reasonable time range. Therefore, in practice, a distributed crawler is generally designed to allow each crawler node to crawl a website close to it, and then integrate the structure to the user. Distributed Technology Applies to web crawlers, which not only reduces operation costs, but also greatly improves crawler performance. Especially in today's rapid development of cloud computing, it is driving the development of distributed technology.
The so-called distributed technology is essentially a network-based computer processing technology. A distributed system is a set of logically and physically interconnected processing units. The essence is the decentralized control of the system scope of resources to achieve collaborative execution of applications. This system does not require a single computer to have very powerful functions, so it can reduce costs. The distributed system has the advantages of fast access and multi-user usage. Each computer in the system can conveniently and quickly access information files of other internal nodes. It can serve both the special requirements of local users and other users in the network, communication and collaboration between different computers.
Cloud computing is the development of distributed processing, parallel processing, and grid computing. It is the result of a mixture of virtualization, utility computing, IAAs, paas, and SaaS concepts. The basic principle is to distribute computing tasks on a large number of distributed computers on the cloud and store data on the cloud, so that enterprises can switch limited resources to the desired applications, reducing the operating costs of enterprises. The result is that small and medium-sized enterprises do not need to purchase a dedicated computer system to meet the needs of an application. They only need to pay service fees for the cloud computing center to receive response services, while the cloud computing center is a large-scale cloud, to provide services to users. In general, cloud computing has the following features: ultra-large scale cloud computing clusters, virtualization, high reliability, versatility, on-demand services, and extremely cheap.
The development of search engines has brought about the concept of cloud computing. In turn, cloud computing and distributed technology have profoundly influenced the development of search engines. For example, Google took the lead in proposing the concept of cloud computing and maintaining a leader in the cloud computing field, so as to remain dominant in search engines for many years. Google's three core technologies constitute the foundation of cloud computing services: GFS (Google File System), mapreduce (Distributed Computing System), and bigtable (Distributed Storage System ). GFS is located at the bottom layer for data storage management. It splits big data into fixed-size data blocks and stores them on two or three servers, this ensures good data error tolerance. Mapreduce is a distributed programming tool developed by Google. It is used for parallel operations on large-scale datasets of 1 TB of data. Its essence is to divide massive data into small data and perform computation on different servers, integrate the calculation results and return them to the user. Bigtable is a service of Google that stores and accesses semi-structured data. It is a structured distributed storage system built on the basis of GFS and mapreduce, this allows Google to maximize the use of existing resources and reduce operation costs while providing services.
Google's three core technologies did not disclose its internal detailed design technology, but through existing papers, Apache has implemented them one by one: the open-source project corresponding to GFS is HDFS; bigtable corresponds to hbase, and hadoop is a distributed open-source computing framework under Apache. For these open-source technologies, detailed technical documents are available on the Internet and can be learned through these technical documents. Some well-known Internet companies in China have successfully applied the above technologies and achieved good results.