Last week, the boss pressed down the task and asked him to write a Web Crawler to crawl the data information of an industry website. Since I only use the shell programming language, other programming languages have never been involved, so I can only use shell to write.
It has been a week before, and everything has gone from nothing to nothing, and there have been countless ups and downs. I will not talk about it here.
Here, I would like to share with you my personal opinions on how to use shell to write some comments about web crawlers. I hope that friends who have ideas can contact me to exchange ideas, if you have no idea, you can take a look at the essence and discard the dregs.
1. You must understand what web crawler is! I will not copy this definition because it is found in a lot of Google searches.
2. Have a brief understanding of the HTTP protocol, for example, the differences between HTTP 1.0 and 1.1, the HTTP request process, the content of the request message, and the parts of a webpage link. The content of request packets is the focus of web crawlers. If the user name and password of the website you want to crawl need to be logged on, the cookie is very important. If the website you want to crawl has anti-leech protection, then you need to declare the connection from which you came from. At this time, Referer is very important. If the website you want to crawl needs to pass post information, then you need to pay more attention to form data and response, and so on. Here we only list several important points.
3. the HTTP protocol is mentioned above. How can we tell this information to the target site we want to crawl? We usually use a browser for operations, but now we define it as a web crawler, so we must be away from the human. Therefore, we need to use two command tools: curl and wget. In my personal habits, I use curl for page content requests. For some resources such as download, video, and audio, I use wget for operations. For these two commands, we need to pay attention to how to pass cookies, how to pass Referer, how to pass post information, how to set proxy information, and so on. Here I use the curl command as an example. If I need to pass a cookie, I need to use-B (-- cookie) to pass the cookie and use-s (-- slient) reduce Unnecessary output information during the curl page process. Use-E (-- Referer) to specify the URL from which to use it. For details about curl and wget commands, Google will not copy them here.
With the above knowledge, you can obtain the page information of the site crawled by using command requests. What you need later is to filter and filter the crawled information, and how to increase the crawling speed.
1. shell is a good choice for data filtering and filtering. I believe everyone knows this. Common text processing tools in shell programming, such as grep, sed, and awk, and peripheral cut, WC, uniq, and sort. By combining these tools with regular expressions, We can perfectly select information of interest. The usage of the above tools is not described in this article.
2. Build the overall web crawler script. the more skilled you are, the better, this is mainly to grasp and perceive the establishment of the overall shell script framework and the combination and association between various logics. If the script is not efficient, it is not easy to troubleshoot errors.
3. Optimizing the speed of Shell-based Web Crawlers is quite affected by the previous point. To optimize the speed, on the one hand, we need to reduce unnecessary command usage, which can reduce disk I/O consumption and CPU efficiency computing, and on the other hand, we need to use the shell multithreading function, to improve the overall concurrency of the script.
OK! The above is my personal tips on Shell-based web crawlers. Below are some additional optimization ideas!
1. Analyze the area of the website to be crawled, such as domestic or foreign. If you are a foreign server, try to select a foreign server (You know). Otherwise, the speed may make you feel ashamed! In addition, it is also a good choice to locally bind a fixed IP address of the crawled website or select a good DNS server.
2. When using the multi-thread function of shell, remember to control the number of processes. This value should be considered comprehensively. On the one hand, it should be based on the server performance, and on the other hand, it should be based on the load capacity of the crawled site. Both of them are indispensable. A harmonious value requires multiple tests. Remember!
3. In order to improve the scalability of crawlers in the later stage, the framework and variables must be flexible. Otherwise, scripts are dead scripts, which are inconvenient for later extension.
Although shell is indeed a process-oriented programming language, I still hope to use it flexibly from a higher perspective. Finally, I personally think it is better to use advanced languages such as Java and python to write web crawlers. Currently, I am not a senior speaker, so I can only write in shell!
In the next blog, I will share my crawler script with you, hoping to help you!
This article is from the not only Linux blog, please be sure to keep this source http://nolinux.blog.51cto.com/4824967/1550976
How to use shell to write Web Crawlers