Search engine principle (Basic Principles of web spider) (2)

Last Update:2018-12-03 Source: Internet

Author: User

Tags website server

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Web spider is an image name. Comparing the Internet to a spider, a spider is a web crawler. Web Crawlers use the link address of a webpage to find a webpage. Starting from a webpage (usually the homepage) of a website, they read the content of the webpage and find other link addresses on the webpage, search for the next Webpage through these links until all the webpages of the website are crawled. If the whole Internet is regarded as a website, the web spider can use this principle to capture all the web pages on the Internet.

For search engines, it is almost impossible to capture all the web pages on the Internet. From the data published currently, the largest search engine can only capture about 40% of the total number of web pages. One reason is the bottleneck of capturing technology, zookeeper medical machinery resistant to changxingji Cheng Yu youchebao lip and surname industry sword zookeeper taobaoyu sodium taobaoyu 0k computing (including images ), the size of the 10 billion web page is 100 × 340 GB. Even if the page can be stored, the download still has problems (according to the 20 k download per second on one machine, it takes machines to download continuously for one year, all Web pages can be downloaded ). At the same time, because the data volume is too large, it will also affect the efficiency when providing search. Therefore, web crawlers of many search engines only crawl important web pages, and the main reason for evaluating the importance of web pages during crawling is the link depth of a Web page.

Web Crawlers generally have two strategies when capturing webpages: breadth first and depth first (as shown in ). Breadth First means that the web spider will first capture all the webpages linked to the starting webpage, then select one of them to continue crawling all the webpages linked to this webpage. This is the most commonly used method, because this method can allow network Spider to process in parallel and increase its crawling speed. Depth first means that a web spider will trace a link from the start page, process the line, and then transfer it to the next start page to continue tracking the link. This method has the advantage that web spider is easier to design. The difference between the two policies will be clearer. Because it is impossible to capture all the web pages, some web crawlers set the access layers for some less important websites. For example, in a, a is the starting web page, which belongs to layer 0, B, c, d, e, f belongs to layer 1st, g, h belongs to layer 2nd, And I belongs to layer 3rd. If the number of access layers set by the web spider is 2, the web page I will not be accessed. This also allows some websites to search for some webpages on the search engine, while others cannot. For website designers, the flat website structure design helps search engines capture more webpages. Web Crawlers often encounter encryption data and webpage permissions when accessing websites. Some webpages can be accessed only with membership permissions. Of course, website owners can use the protocol to prevent web spider crawlers from capturing, but for some websites that sell reports, they hope the search engine can find their reports, however, they cannot be viewed by searchers for free. In this way, they need to provide the corresponding user name and password to the web spider. Web Crawlers can crawl These webpages through the given permissions to provide search. When a searcher clicks to view the webpage, the searcher also needs to provide corresponding permission verification. II,Website and web spiderWeb Crawlers need to capture webpages. Different from normal access, poor control may cause excessive load on the website server. Each web spider has its own name. When crawling a webpage, it will indicate its identity to the website. A web spider sends a request when capturing a webpage. The request contains a User-Agent field to identify the web spider. For example, the Google Web Spider is identified as googlebot, the Baidu web spider is identified as baidusp, and the Yahoo web spider is identified as Inktomi slurp. If there is access log records on the website, the website administrator will be able to know which search engines the web spider has come over, when it comes over, and how much data it has read. If a website administrator finds a spider has a problem, contact the owner with the ID. The web spider enters a website and generally stores a special file robots.txt. This file is usually placed in zookeeper, and then the master, Ke, is very stupid. Then obots.txt is used to define which directories cannot be accessed by the web spider or which directories cannot be accessed by certain web spider. For example, if you do not want the executable file directories and temporary file directories of some websites to be searched by the search engine, the website administrator can define these directories as denied access directories. The robots.txt syntax is very simple. For example, if there are no restrictions on the directory, you can use the following two lines to describe: User-Agent :*

Disallow:

Of course, robots.txt is only a protocol. If the designers of the web spider do not follow this protocol, the website administrator cannot prevent the web spider from accessing some pages, but the average web spider will follow these protocols, in addition, the website administrator can use other methods to reject web page crawlers.

When downloading a webpage, a web spider identifies the HTML code of the webpage. A meta ID is displayed in the Code section. By using these identifiers, you can tell the web site spider whether the web page needs to be crawled or whether the links on the web page need to be tracked. For example, the webpage does not need to be crawled, but the links in the webpage need to be tracked.

This section describes the syntax of robots.txt and the meta tag syntax in detail in the previous article "methods for prohibiting search engines from indexing.

Currently, most websites want the search engine to capture the web pages of their websites more comprehensively, because this allows more visitors to find the website through the search engine. In order to make the webpage of this website more fully captured, the website administrator can create a website map, that is, site map. The web spider of Xu duo uses the sitemap.htm file as the portal for crawling a website webpage. The website administrator can put the links of all webpages in the website into this file, so the web spider can easily capture the entire website, avoiding the omission of some webpages also reduces the burden on the website server. (Google provides website administrators with sitemap of XML)3. web spider Content Extraction

The search engine creates a web index and processes text files. Web Crawlers capture webpages in various formats, including HTML, images, Doc, PDF, multimedia, dynamic webpages, and other formats. After these files are captured, you need to extract the text information from these files. Accurately extracting the information of these documents plays an important role in the search accuracy of the search engine, and affects the web spider's correct tracking of other links. For documents such as Doc and PDF, the vendor provides corresponding text extraction interfaces for documents generated by software provided by professional vendors. The web spider can easily extract text and other file-related information by calling the interfaces of these plug-ins. HTML and other documents are different. html has its own syntax. Different command identifiers are used to represent different fonts, colors, locations, and other la s, such, these identifiers must be filtered out when extracting text information. It is not difficult to filter identifiers, because these identifiers have certain rules, as long as the corresponding information is obtained according to different identifiers. However, when identifying this information, you need to synchronize the shovel folder and click "oh" and call the bank "III". The value of the zookeeper is not limited. the hacker said that he was stunned and tried to take the M protection TMl webpage as an example, in addition to the title and body, there are many ad links and public channel links. These links have nothing to do with the text body. When extracting the content of the webpage, you also need to filter these useless links. For example, a website has a "Product Introduction" channel, because the navigation bar is available on every webpage of the website. If you do not filter the navigation bar link, when searching for "Product Introduction, each page in the website will be searched, which will undoubtedly bring a lot of junk information. To filter invalid links, you need to calculate a large number of Web Page Structure rules, extract some commonalities, and filter them in a unified manner. For important websites with special results, you also need to process them individually. This requires that the design of the web spider be scalable. 4. program architecture of web spider Construct a web spider using ASPSo how to use ASP to build a web spider? The answer is: Internet transfer control (ITC ). The control provided by Microsoft enables you to access Internet resources through ASP programs. You can use ITC to search Web pages, access the FTP server, and even send mail titles. In this article, we will focus on the Web page searching function.

Several defects must be described first. First, Asp has no permission to access the Windows registry, which makes the constants and data values that some ITC normally stores unavailable. Generally, you can solve this problem by setting ITC as "do not use the default value", which requires you to specify the value for each operation.
Another more serious problem is the License book. Because ASP does not have the function to call License Manager (a function in Windows that can ensure the legitimate use of components and controls), when License Manager checks the key password of the current component, and compare it with the Windows registry, if they are found to be different, the component will not work. Therefore, if you want to configure ITC to another computer without a required key, it will cause the ITC to crash. One solution is to bind ITC to another VB component. The VB component copies the ITC path and tool for configuration. This job is troublesome, but unfortunately it is essential.

The following are some examples:

You can use the following code to create an ITC:

Set inet1 = Createobject ("inetctls. Inet ")
Inet1.protocol = 4 'HTTP
Inet1.accesstype = 1 'direct connection to Internet
Inet1.requesttimeout = 60' in seconds
Inet1.url = strurl
Strhtml = inet1.openurl 'Grab HTML page
Strhtml now stores the HTML content of the entire page pointed to by strurl. To create a regular web spider, you only need to call the instr () function to check whether the string you are looking for is in the current position. You can also follow the href tag to parse the current URL, set it to the properties of the internet control, and then continue to open another page. The best way to view all links is to use recursion.

It should be noted that although this method is easy to implement, it is not very accurate and powerful. Today, many search engines can perform additional logic checks, such as calculating the number of repetitions of a phrase on a page and the approximate degree of related words, some can even be used to determine the relationship between the searched CIDR Block and the context.

Construct a web spider with VBDescription of the website hierarchy of the Spider Program and its working principle:
No. Website level parent No.

Http://www.netfox.cn/10

2 http://www.sina.com.cn/21

3 http://www.cnnic.cn/21

4 http://www.baidu.cn/32

Http://www.yahoo.cn/32

The spider program first extracts all website links from level 1 (http://www.netfox.cn/) and records all website links to the database (or large array, etc ), and mark these website links as Level 2. When Layer 2 is recorded in the database, the order in the hard drive is the first (here, the website with the serial number 2) the website link starts to extract all the links under it to the database, and mark these website links as Level 3. Then, all the links of the website whose level is 2 are recorded in the database in sequence, at the same time, their hierarchy is marked as Level 3. When Level 3 records all the databases, it starts to extract the links from the first website in the order of Level 3, and so on! Note: The program should keep a pointer to record the serial number currently being operated! In addition, you can add a parent number field to record the inheritance relationship between them! Level 1 indicates the network seed. Here we place the network seed on the first layer. You can set one or more network seed as needed, in fact, we can clearly see through this level chart that low-level URLs are high-level network seeds. That is to say, as long as there is one or several network seeds, we can find more network seeds through their links. Only in this way can our spider run forever! Level 2 is the link captured through level 1 (Network seed), Level 3 is the link captured through level 2, and so on, forming a big tree! Key code of the Spider Program: use VB to implement the code of the core part. Of course, you can easily switch to other language code. For the sake of simplicity, we will not operate on the database here. We will create a two-dimensional array to store our website! Dim web () '// create an array dim pointer' // create a pointer to record the current seed dim ID '// create a sequence number, record the number of the current website in the capture area dim layer '// create a level, record the level of the currently running seed dim running' // create a flag for running, Private function newworkseed_set () as Boolean '// used to set the network seed. For demonstration convenience, we put the seed in the array.' // Of course, you can also directly put them in the database as needed. Web) = 1' // No. Web () = "http://www.netfox.cn/" '// website Web () = 1' // Layered Web () = 0' // parent No, 0 indicates the original network seed web () = "nefook network" // Of course, multiple original network seed web () = 1web () can be set here) = "http://www.aspfaq.cn/?web () = 1web () = 0web () =" ASP technology station "// after you set the network seed, record the seed number after it starts, 2 seeds are set here, So Id = 2 start ID = 2end functionprivate sub spider_work () '// The Spider workprogram crawls the website and records it to the array
'// You can put them in the database as needed dim afor each a in webbrowser. document. allif ucase (. tagname) = "A" thenif isvalidweb (. href) thenid = ID + 1web (0, ID) = id' // record the serial number of the current website Web (1, ID) =. href '// record the current website Web (2, ID) = Layer' // record the current website level if Web (2, pointer) <> layer Then layer = Layer + 1' // when the pointer level is different from the current level, '// indicates that the level has been increased by web (3, ID) = pointer // record the current website's parent number web (4, ID) =. innertext '// record the name of the current website end ifend ifnextpointer = pointer + 1webbrowser. navigate web (1, pointer-1) '// after the current seed is captured, the system automatically jumps to the next seed if running = false then' // if running is no, exit and run exit subend ifend subprivate function spider_init () as Boolean '// Spider Program initialization function pointer = 1' // set the pointer to 1, indicates that the first serial number starts to run id = 2' // The serial number is set to 2, and the record layer = 0' can be read later // The hierarchy is set to 0, indicates the first time the spider runs '// The above pointer, serial number, and level can be recorded and can be easily read later if isvalidweb (Web (1, pointer-1 )) then' // determines whether the seed is correct. If the initialization succeeds correctly, otherwise running = truespider_init = truewebbrowser fails. navigate web (1, pointer-1) elserunning = falsespider_init = falseexit functionend ifend functionprivate sub webbrowser_documentcomplete (byval Pdisp as object, URL as variant) '// call spider_work () end subprivate function isvalidweb (_ href) as Boolean '// determines whether it is a CN Domain Name function' // This function can capture the specified website or data if instr (_ href, "http: // www. ")> 0 and instr (_ href ,". CN/")> 0 and Len (_ href) <60 thenisvalidweb = trueelseisvalidweb = falseend ifend functionprivate sub initcommand_click () '// init initializes the command control if spider_init () thenmsgbox "the spider is initialized successfully and starts to run" elsemsgbox "SPIDER initialization failed" end ifend privateprivate sub stopcommand_click () '// Stop command control running = false' // stop running end privateprivate sub runcommand_click ()' // Run Command Control running = true' // continue running call spider_work () '// The Spider runs the main program end private Specific Network spiderRelatively, a specific web spider is more complicated. As we mentioned earlier, a particular web spider will search for a specific part of a page, so it is required to know the relevant information in advance. Let's take a look at the following HTML:

<HTML>

<Head> <title> my news page </title> <meta name = "keywords" content = "News, headlines "> <meta name =" descr iption "content =" the current news headlines. "> Dim intstart
Intstart = instr (1, strtext, strstarttag, vbtextcompare)
If intstart then
Intstart = intstart + Len (strstarttag)
Intend = instr (intstart + 1, strtext, strendtag, vbtextcompar E)
Gettext = mid (strtext, intstart + 1, intend-intstart & n bsp;-1)
Else
Gettext = ""
End if
The end function follows the example above to build the ITC control. You can easily set the "<! -- Put headlines here --> "and" <! -- End headlines --> "is transmitted as a parameter to gettext.

Note that the start and end tags are not necessarily actual HTML-specific tags-they can be any text delimiters you want to use. In general, it is not easy to find a good HTML tag to define the search area. You can only use easy-to-use tags-for example, your first and last tags can be as follows:

Strstarttag = "/TD> "
Strendtag = " </TD> </tr> <TD> <ums> & quo T;

You must be sure to search for unique HTML pages so that you can get what you need accurately. You can also search for links in the text you return. However, if you do not know the format of those pages, your web crawlers will not return.

Author: singularity, winter Author: Fenglin
Original load: Search Engine Optimization SEOBlog
Copyright. The author and original source and this statement must be indicated in the form of links during reprinting.
Link: http://blog.5ixb.com/seo/search-engine-spider.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More