Design and Analysis of web crawlers in search engines

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The following describes how to create a web crawler for a search engine and some basic precautions. The web crawler is similar to the offline reading tool you use. Offline connection is still required to connect to the network; otherwise, how can we capture things? So where are the differences? 1] High configuration of web crawlers. 2] web crawlers can parse links on webpages 3] web crawlers have simple storage configurations 4] web crawlers have smart analysis functions based on web updates 5] web crawlers are quite efficient based on features, in fact, this is the requirement. How to Design crawlers? What steps do you need to pay attention? 1] URL traversal and record This point Larbin is doing very well. In fact, URL traversal is very simple, for example: Cat [What You Got] | tr \ "\ n | gawk '{print $2}' | pcregrep ^ http: // URL list 2) multi-process Vs Multithreading Each has its own advantages. Now a common PC, for example, booso.com, can easily crawl 5 GB of data in a day. About 0.2 million web pages. 3] The most silly way to control the time update is to update the weight without time. you can crawl the weights one by one, and then crawl the weights one by one. The data crawled next time is usually compared with the previous one. If the page is not changed for five times, the interval between crawling the page is doubled. If a webpage is in a continuous If all data is updated during 5 crawling times, the set crawling time is shortened to 1/2 of the original time. Note that efficiency is one of the keys to winning. 4] What is the crawling depth? Check the situation. If you have tens of thousands of servers as web crawlers, I suggest you skip this step. If you only have one server as I do for web crawlers, you should know: webpage depth: webpage count: webpage importance 0: 1: 101: 20 :: 82: 600: 53: 2000: 24 abve: 6000: Generally, it cannot be calculated. It is almost the same as it is when it reaches Level 3. First, it expands the data volume. 3/4 times. Second, the importance has indeed declined a lot. This is called "the seeds are dragons, and the harvest is flea. 5] crawlers generally do not crawl each other's webpage. Proxy goes out, this proxy has the ability to relieve pressure, because when the other side of the web page is not updated, as long as you get the header tag, there is no need to transfer it all once, this greatly saves network bandwidth. Recorded in Apache webserver 304 is generally cached. 6. Please take care of it when you are free. Robots.txt 7] storage structure. Everyone is wise, Google uses the GFS system. If you have 7/8 servers, I advise you to use the NFS system. If you have 70/80 servers, I suggest you use the AFS system. If you only have one server, so casual. To Code The snippet is how the news search engine I wrote stores data: Name = 'echo $ URL | Perl-p-e's/([^ \ W \-\. \ @])/$1 EQ "\ n "? "\ N": sprintf ("% 2.2x", ord ($1)/EG ''mkdir-p $ authornewscrawl. PL $ URL -- User-Agent = "news.booso.com + (+ http://booso.com) "-OUTFILE = $ author/$ name -------------------------------------------------------------------------- the above reposted an article about Web Crawler of search engines (that is, search engine spider) Program ). Article Introduces some common knowledge about Spider design. Seo is very helpful, especially pay attention to the following: 1. The data crawled next time is usually compared with the previous one. If there are no changes for five times, the interval between crawling the web page is doubled. If a Web page is updated for five times in a row, in this case, the set crawling time is reduced to 1/2 of the original time. The frequency of Web Page updates seriously affects the crawling of websites by search engine spider degree. The more crawlers the more the web page is indexed, the more indexed the web page is. The most basic part of SEO. 2. Well, it's almost the same as crawling to Level 3. The first step is to expand the data volume. 3/4 times. Second, the importance has indeed declined a lot. This is called "the seeds are dragons, and the harvest is flea. Keep your website in a third-level directory as much as possible. In-depth web pages will put a lot of pressure on search engines. Of course, I think Google has enough servers to handle this pressure, but on the other hand, the web pages under the three-layer directory are often captured and updated much less frequently. As I mentioned earlier, we have to find a way to make the physical and logical structures of the website consistent, which reflects the good design of the URL, now you can check the actual directory of the static Web page generated at the front end to determine whether the page can be optimized. About the website Logical Structure and For URL design, see "Seo is the first element of internal link optimization" and "how to choose between second-level domain names and first-level directories ?"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Design and Analysis of web crawlers in search engines

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Design and Analysis of web crawlers in search engines

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support