Search engine principle (Basic Principles of web spider)

Source: Internet
Author: User

Abstract: High-Performance Network robots are the core of the new generation of Web intelligent search engines. Whether they are efficient directly affects the performance of search engines. The key technologies and algorithms involved in the development of high-performance network robots are analyzed in detail. Finally, the key program classes are given to help the actual application and development of the project. Keywords: Web; search engine; network robot; Java 1 Research Significance of high-performance network robot programsWeb search engine technology is a hot and difficult topic in today's network information processing field. Web can be seen as a huge distributed network database. For such a database with a rapid increase in information, it will be a huge project to manually search and classify the entire web, therefore, in the search engine technology, we must use network robots to complete this task. The network robot program we study is such a professional program that can efficiently Scan Web sites and retrieve their content. Network robots are a key part of Web search engine technology. A general search engine consists of five parts: network robots, indexers, searchers, user interfaces, and Internet networks. Simply put, the function of the network robot program is to automatically search and collect web pages on the web. However, as users' requirements continue to increase, the search engine based on keyword query cannot meet users' requirements for more accurate search results and a wider search range. Therefore, the new generation of Intelligent Search Engines requires that network robots have higher performance and can update webpages and search more webpages, therefore, the study of high-performance network robots has direct practical significance and important academic value for the development of search engines. 2 Java Socket programmingThe Network is a world of customers and servers. Almost all programs on the network are dealing with conversations between customer processes and server processes, the network robot program we study is the client/server program that browses the internet. Interne will naturally think of Web. Web is a protocol built on HTTP, while HTTP is a protocol built on TCP/IP, and it is also a SOCKET protocol, therefore, we can say that the essence of Internet is to use sockets to connect to TCP/IP networks. Java has very simple socket programming. Java defines two classes: socket and serversccket, which are important classes for network programming using Java. If the program is written to play the role of the server, it should adopt the serversocket class; if the program is connected to the server, then it plays the role of the client, we should use the socket class, the network robot program we are studying plays the role of the client. 3 Research on Key TechnologiesThe work of the network robot program is unusually heavy, and it seems that it will never end. When the network robot accesses the webpage while searching for the webpage to be accessed next, after visiting a website, there will still be other sites in the queue, and the network robot program jobs are growing exponentially. Therefore, it is very important for large smart search engines to improve the efficiency of network robot programs, the following are indispensable technologies for developing high-performance network robots. 3.1 Multithreading technologyIt is indeed difficult for a programmer to master the multi-threaded programming technology, but it is more difficult to determine when the multi-threaded technology is required and how to divide threads. Multithreading is the ability of an application to run more than one task at a time. multithreading occurs within an application, and they use the same memory space, therefore, all threads of a process can easily share global data and resources. Network robots need to download dozens or even hundreds of web pages. If we use a single thread to complete this task, the efficiency is very low, the bottleneck of the program lies in that the network robot program must wait for the server to respond after sending a request to download the web page to the server. It can be imagined that the single thread technology needs to wait one by one for the server to respond to the request, the wait time is the sum of the waiting time for each webpage request to respond. Network robot programs must adopt multithreading technology. multithreading technology allows the combination of waiting time for hundreds of web pages. A large number of threads allow network robot programs to wait for a large number of web pages at the same time, instead of letting them execute one by one.

3.2 Database TechnologyThe network robot program must track every URL it encounters (Uniform Resource Locator). The management of this URL list is the job management of the network robot program, job Management is very important for an efficient network robot program, because the network robot program must track the data of thousands of web pages accessed. The job management of network robot programs usually adopts two methods: memory-Based Queue Management and structured query language database queue management. If network robots use memory to store and manage the list of large websites when they access large Web servers, it will become slow and consume more and more computer resources, eventually, the efficiency of network robots is greatly reduced. Therefore, the SQL-based database queue management mechanism must be used to manage and maintain the webpage lists of large web sites. Using the database management system (DBMS) to manage large web page lists can greatly reduce memory usage and improve the running efficiency of network robot programs. 3.3 Database Access TechnologyThe network robot program adopts the SQL-based database queue management mechanism and must have the corresponding database access technology. Java provides us with a JDBC (Java database connectivity, Java Database interconnection) class to access DBMS. the purpose of JDBC is to allow you to send SQL statements to the database so that you can specify the data to be returned from the database. In Java, there are four types of database drivers that allow JDBC to effectively access the database, they are JDBC-ODBC bridges, some java and some local drivers, the intermediate Data Access Server and pure Java driver. By effectively combining multithreading, database, and JDBC technologies, we can create high-performance network robots.

4 Design ideas and algorithm analysis 4.1 Webpage link typeWhen a web robot program traverses the internet, it must search for another web page. To achieve this, the Web robot program must be able to find the link saved on each web page it accesses. The Web robot program analyzes the HTML code of the web page to find all the tags in the web page that are linked to other web pages. Based on the href (hypertext reference, hypertext link) value of the tag attribute, network robots may encounter three types of links: internal link, external link, and other links ). An internal link refers to a webpage directed by a hyperlink and a webpage containing the link on the same web server; external links refer to the websites on which the hyperlink points to and the websites that contain the hyperlink. Other links refer to the resources on which the hyperlink points to non-webpage resources, for example, point to the e-mail address. 4.2 Program Design IdeasThere are two ways to develop and design a network robot program: one is to design the program as a recursive program, and the other is to design the program as a non-recursive program. The concept of Recursive Design is clear and simple, but there are two main problems: the first problem is that if the program needs to run many times, the stack that is pushed into recursion will become very large, it may exhaust the memory of the entire stack and terminate program running. The second problem is that multithreading and Recursion are not compatible. Therefore, developing high-performance network robot programs cannot adopt recursive programming ideas. The high-performance network robot we study adopts the non-recursive programming concept. When a non-recursive method is used, a set of web pages to be accessed by the network robot is first given, it adds this set to the queue to which it will access the site. When a network robot discovers a new webpage, it adds the newly found link to the queue instead of calling its own method. After the network robot finishes processing the current webpage, it will find the next page to be processed in the queue. In actual work, the network robot uses a total of four queues, each of which stores URLs in the same processing status. They are as follows: waiting queue: In this queue, the URL is waiting for processing by the network robot. The newly discovered URL is added to this queue. Processing queue: When network robots start processing, they are transmitted to this queue. After a URL is processed, it is transferred to the error queue or completed queue. Error queue: if an error occurs when processing the webpage, its URL will be added to the error queue. The network robot will not further process the webpage that is added to the error queue. Completion queue: If no error occurs while downloading the webpage, the URL will be added to the Completion queue. The URL added to the Completion queue will not be moved into other queues. URL Processing status flowchart: 4.3 Algorithm AnalysisOur algorithm design is mainly based on non-recursive ideas. When a URL is added to the waiting queue, the network robot starts to run. As long as a web page or network robot is processing a web page in the queue, the network robot will continue its work. When the waiting queue is empty and no webpage is processed, the network robot stops working. The basic algorithm is as follows: Initialize URLs; // uses a URL set to initialize the network robot. Queue Enum {waitq, finishq, runq, mistakeq}; // queue type: waiting, completion, processing, and error queue. Filetext; linktype Enum {internallink, externallink, otherlink}; // hyperlink type: internal, external, and other links. Begin for URL in URLs do popqueue (URL, waitq); // The initialization URL set is added to the waiting queue. While waitq is not empty do // determine whether the waiting queue has a URL .. Begin url = pushqueue (waitq); // retrieve the URL from the waiting queue. While runq is not empty do // determine whether the processing queue has a URL. Document = popqueue (URL, runq, linktype); savefiletext (document, filetext); // download and save the webpage corresponding to the URL in the processing queue. If extract (newurls) from document is not null // find a new link from the downloaded webpage. Begin for URL in newurls do begin if URL is not in finishq then // if there is no URL in the Completion queue. If URL linktype is enternallink then // if the link is an external link. Popqueue (URL, waitq, linktype); // Add external links to the waiting queue. Else popqueue (URL, runq, linktype); // otherwise, add the link to the processing queue. End; end; popqueue (URL, finishq, linktype); end while; end;

5 Program ImplementationNetwork robot programs are written in Java. Java is an object-oriented programming language, which encapsulates the main functions of each module in a relatively independent class, they are effectively connected through interface functions to form a complete system. This structure can easily introduce new methods to improve and improve the functions of the system. It can also create new classes to expand the functions of the system. The following describes several key classes for implementing the system: the robot class-network robots are mainly implemented through the robot class. This class contains many interface methods to control the operation of the robot, organizes and manages the list of accessed and accessed sites. The main methods are synchronized public void addworkload (string URL); // Add a job to the Job Manager. Synchronized public void getworkload (string URL); // obtain a job from the job manager. Synchronized public Boolean foundinternallink (string URL); // discovers and processes internal links. Synchronized public Boolean foundexternallink (string URL); // discovers and processes external links. Synchronized public Boolean foundotherlink (string URL); // discover and process other links. Synchronized public void processpage (HTTP page); // used to process web pages. It is the actual work of a network robot. Synchronized public void robotcomplete (); // called when the network robot does not work. Public void setmaxbody (int mx); // you can specify the size of the body to be downloaded. Public void getmaxbody (int mx); // returns the size of the body to be downloaded. Public void run (); // start the robot process. Public void halt (); // stop the robot. The robotsqlworkload class is the job manager of the network robot. jobs can be stored in the SQL database. By using the SQL database, the job manager can process large sites, it is also an important class for implementing high-performance network robots. Main method: Synchronized Public String assignworkload (); // request a URL from the waiting queue and send it to the processing queue. Synchronized public void addworkload (string URL); // send a new URL to the waiting queue. Synchronized public void completeworkload (string URL, Boolean error); // determines whether the queue is sent to the Completion queue or the error queue. Protected void setstatus (sting URL, char status); // set the URL status: waiting, running, complete, and error. Synchronized public char geturlstatus (string URL); // return the URL status type. Synchronized public void clear (); // clear the storage of the Job Manager. Robotworker class--high-performance network robots should be multi-threaded. To divide tasks into many small tasks, there must be a way to allocate tasks among different threads, the basic unit of work is the robotworker class object. Main method: Public Boolean isbusy (); // return whether the State of the object in the thread is busy or idle. Public void run (); // when the thread is idle, wait for the job manager to allocate a job and notify the thread to be busy. Protected void processworkload (); // process jobs in the Job Manager. Public HTTP gethttp (); 6 SummaryDeveloping High-Performance Network robots plays a vital role in improving the overall performance of Web search engines. It is also an inevitable requirement for researching and developing a new generation of intelligent search engines, this article studies the key technologies, programming ideas, and algorithms involved in the development of high-performance network robots, and analyzes in detail some key categories that implement program functions. These are of reference and reference value for Developing Intelligent Web search engines with independent property rights. Using the concept dictionary to build intelligent and higher network robots can better improve the full query rate, which is one of our main research directions in the future.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.