Design and Analysis of web crawlers in search engines

Source: Internet
Author: User
The following describes how to create a web crawler for a search engine and some basic precautions. The web crawler is similar to the offline reading tool you use. Offline connection is still required to connect to the network; otherwise, how can we capture things? So where are the differences? 1] High configuration of web crawlers. 2] web crawlers can parse links on webpages 3] web crawlers have simple storage configurations 4] web crawlers have smart analysis functions based on web updates 5] web crawlers are quite efficient based on features, in fact, this is the requirement. How to Design crawlers? What steps do you need to pay attention? 1] URL traversal and record This point Larbin is doing very well. In fact, URL traversal is very simple, for example: Cat [What You Got] | tr \ "\ n | gawk '{print $2}' | pcregrep ^ http: // URL list 2) multi-process Vs Multithreading Each has its own advantages. Now a common PC, for example, booso.com, can easily crawl 5 GB of data in a day. About 0.2 million web pages. 3] The most silly way to control the time update is to update the weight without time. you can crawl the weights one by one, and then crawl the weights one by one. The data crawled next time is usually compared with the previous one. If the page is not changed for five times, the interval between crawling the page is doubled. If a webpage is in a continuous If all data is updated during 5 crawling times, the set crawling time is shortened to 1/2 of the original time. Note that efficiency is one of the keys to winning. 4] What is the crawling depth? Check the situation. If you have tens of thousands of servers as web crawlers, I suggest you skip this step. If you only have one server as I do for web crawlers, you should know: webpage depth: webpage count: webpage importance 0: 1: 101: 20 :: 82: 600: 53: 2000: 24 abve: 6000: Generally, it cannot be calculated. It is almost the same as it is when it reaches Level 3. First, it expands the data volume. 3/4 times. Second, the importance has indeed declined a lot. This is called "the seeds are dragons, and the harvest is flea. 5] crawlers generally do not crawl each other's webpage. Proxy goes out, this proxy has the ability to relieve pressure, because when the other side of the web page is not updated, as long as you get the header tag, there is no need to transfer it all once, this greatly saves network bandwidth. Recorded in Apache webserver 304 is generally cached. 6. Please take care of it when you are free. Robots.txt 7] storage structure. Everyone is wise, Google uses the GFS system. If you have 7/8 servers, I advise you to use the NFS system. If you have 70/80 servers, I suggest you use the AFS system. If you only have one server, so casual. To Code The snippet is how the news search engine I wrote stores data: Name = 'echo $ URL | Perl-p-e's/([^ \ W \-\. \ @])/$1 EQ "\ n "? "\ N": sprintf ("% 2.2x", ord ($1)/EG ''mkdir-p $ authornewscrawl. PL $ URL -- User-Agent = "news.booso.com + (+ http://booso.com) "-OUTFILE = $ author/$ name -------------------------------------------------------------------------- the above reposted an article about Web Crawler of search engines (that is, search engine spider) Program ). Article Introduces some common knowledge about Spider design. Seo is very helpful, especially pay attention to the following: 1. The data crawled next time is usually compared with the previous one. If there are no changes for five times, the interval between crawling the web page is doubled. If a Web page is updated for five times in a row, in this case, the set crawling time is reduced to 1/2 of the original time. The frequency of Web Page updates seriously affects the crawling of websites by search engine spider degree. The more crawlers the more the web page is indexed, the more indexed the web page is. The most basic part of SEO. 2. Well, it's almost the same as crawling to Level 3. The first step is to expand the data volume. 3/4 times. Second, the importance has indeed declined a lot. This is called "the seeds are dragons, and the harvest is flea. Keep your website in a third-level directory as much as possible. In-depth web pages will put a lot of pressure on search engines. Of course, I think Google has enough servers to handle this pressure, but on the other hand, the web pages under the three-layer directory are often captured and updated much less frequently. As I mentioned earlier, we have to find a way to make the physical and logical structures of the website consistent, which reflects the good design of the URL, now you can check the actual directory of the static Web page generated at the front end to determine whether the page can be optimized. About the website Logical Structure and For URL design, see "Seo is the first element of internal link optimization" and "how to choose between second-level domain names and first-level directories ?"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.