Analysis of Shell web crawler instances

Source: Internet
Author: User

The day before yesterday, I briefly shared some opinions on using shell to write web crawlers. Today I specially sent the code and shared it with 51 bloggers. I still love technology, open source, and Linux.

For the annotation and overall conception of the script, I will put it in the script and explain it to you.

#! /Bin/bash # This script is used to grab the data on the specified industry websites # written by sunsky # mail: [email protected] # Date: 3:06:00 # If ['echo $ uid '! = 0]; Then ECHO 'Please use the root to execute the script! 'Fiif [! -F/dataimg/years]; Then ECHO 'Please give date file, the file path for/dataimg/years. 'fiif [! -D $ tmp_dir]; then mkdir-p $ tmp_dirfiif [! -D $ url_md5_dir]; then mkdir-p $ url_md5_dirfiif [! -D $ html_dir]; then mkdir-p $ html_dirfiroot_dir = "/dataimg" # specify the script running root directory tmp_dir = "$ root_dir/tmp" # generate the temporary data storage directory url_md5_dir =" $ root_dir/url_md5 "# directory that records the MD5 value of the product details page url html_dir =" $ root_dir/html "# stores the downloaded product details page Directory url_md5 =" $ url_md5_dir/MD5. $ year "# records the MD5 value of the product detail page url web_url =" https://www.redhat.sx/"# The home page urlreport of the crawled website =" $ root_dir/Report "# records the comprehensive information of the collected URL curl = "curl-a' Mozilla/5.0 (Windows NT 6.1; wow64 Applewebkit/537.36 (khtml, like gecko) chrome/37.0.2062.102 Safari/537.36 '-- Referer http://www.redhat.sx "opt0 ="/dataimg/years "# year information opt1 =" $ tmp_dir/$ {x1_md5} "# brand information opt2 =" $ tmp_dir/$ {xforwarmd5 }_ {x2_md5} "# model information opt3 =" $ tmp_dir/$ {xforwarmd5 }_$ {x2_md5 }_$ {x3_md5} "# decoration information opt4 = "$ tmp_dir/$ {xforwarmd5 }_$ {x2_md5 }_$ {x3_md5 }_$ {x4_md5}" # opt5 = "$ tmp_dir/$ {xforwarmd5 }_$ {x2_md5 }_$ {x3_md5 }_$ {x4_md5 }_$ {url_list_md5 }" # Product details page URL Information export o_file = "/tmp/$. FIFO "mkfifo $ export o_fileexec 9 <> $ export o_filerm-F $ export o_filenum = 10for (I = 0; I <$ num; I ++ )); doechodone> & 9 While read x1; do {url1 = "$ {web_url}/model/ymmtselects. CFC? Method = getmakes & passyear = $ X1 "xforwarmd5 = 'echo $ url1 | cksum | cut-d'-f1' if! Ls $ opt1> &/dev/NULL; then $ curl-S $ url1 | awk 'in in {rs = "<"} {print $0} '| awk-F'> ''{print $2}' | SED '1, 9d' | sed '$ d' | grep-V' ^ $ '> $ opt1 fi while read X2; do X2 = 'echo $ X2 | SED's ##% 20 # g'' url2 = "$ {url1} & passmakename = $ X2" # x2_md5 = 'echo $ URL | cksum | cut-D ''-F1 'If! Ls $ opt2> &/dev/NULL; then $ curl-S $ url2 | awk 'in in {rs = "<"} {print $0} '| awk-F'> ''{print $2}' | SED '1, 6D '| sed' $ d' | grep-V '^ $'> $ opt2 fi while read X3; do X3 = 'echo $ X3 | SED's #[[: space:] # % 20 # g'''url3 = "$ {url2} & passmodel = $ X3" x3_md5 = 'echo $ url3 | cksum | cut-d'-f1' if! Ls $ opt3> &/dev/NULL; then $ curl-S $ url3 | SED's # [[: Space:] # G' | awk 'in in {rs = "<| = |>"} {print $0} '| egrep' ^ [0-9] + $ '> $ opt3 FI while read X4; do X4 = 'echo $ X4 | SED's ##% 20 # G' url4 = "$ {url3} & passvehicleid = $ X4" x4_md5 = 'echo $ url4 | cksum | cut-D ''-F1 'If! Ls "$ {opt4}"> &/dev/NULL; then $ curl-S $ url4 | awk 'in in {rs = "<"} {print $0} '| awk-F' [>;] ''{print $2} '| sed-e '1, 3D'-e '$ d'-E'/^ $/d'> $ opt4 fi while read X5; do X5 = 'echo $ X5 | SED's ##% 20 # G' url_list = "$ {web_url} index. CFM? Fuseaction = store. sectionsearch & ymmtyears = $ X1 & ymmtmakenames = $ X2 & ymmtmodelnames = $ X3 & ymmttrimnames = $ X4 & templates = $ X5 "url_list_md5 = 'echo" $ url_list "| md5sum | awk' {print $1} ''if! Grep-Q $ url_list_md5 "$ url_md5 "; then $ curl-s "$ url_list"> "$ url_md5_dir/$ url_list_md5" num = 'grep' view page' "$ url_md5_dir/$ url_list_md5" | WC-l 'num2 = $ ($ num/2 )) echo> $ opt5 grep 'a href = "index. CFM? Fuseaction = store. partinfo & partnumbe '"$ url_md5_dir/$ url_list_md5" | cut-d' "'-F2> $ opt5 while [$ num2-ge 2]; do url_list = 'grep "view page $ num2" "$ url_md5_dir/$ url_list_md5" | awk-f '["]'' {A [$9] = $9} end {for (I in) print a [I]} ''$ curl-s" $ url_list "| grep 'a href =" index. CFM? Fuseaction = store. partinfo & partnumbe '| cut-d' "'-F2> $ opt5 num2 =$ ($ NUM2-1 )) done echo $ url_list_md5> "$ url_md5" fi while read X6; do url_detail = "$ {web_url }$ {X6}" url_detail_md = 'echo $ url_detail | md5sum | awk '{print $1} ''if! Grep-Q $ url_detail_md "$ url_md5"> &/dev/NULL; then # this judgment is based on the MD5 value of the URL on the details page of the product list, determination of repeated URL items $ curl-s "$ url_detail"> "$ html_dir/$ url_detail_md" label = 'grep' digoal-label '"$ html_dir/$ url_detail_md" | awk -F' [<>] ''{print $5}'' # product tag gif_url = 'grep-B 10 partinfo "$ html_dir/$ url_detail_md" | grep-o "https. * GIF "| awk '{A = $0} end {print a}'' # The image URL product_id = 'grep' productid' corresponding to the product "$ html_dir/$ url_detail_md" | Awk-F' [<>] ''{print $3}'' # product part number gifile =$ {gif_url # */} # Remove https: the URL of the image after/, as:/a/B .gif gif_img = "$ {root_dir }$ {gifile}" # the absolute path of the image after it is saved locally,: /dataimg/A/B .gif U4 = 'grep-B 10' <! -- Start opentop --> '"$ html_dir/$ url_detail_md" | grep JavaScript | awk-F' [<>] ''{print $3 }''! Ls $ gif_img> &/dev/null & wget-Q-m-K-P "$ root_dir" "$ gif_url" Echo $ url_detail_md> "$ url_md5" Echo "$ (date + % m % d % T) ++ $ X1 ++ $ X2 ++ $ X3 ++ $ U4 ++ $ X5 ++ $ url_detail ++ $ url_detail_md ++ $ label ++ $ product_id ++ $ gif_img ++ $ url_list ">" $ Report "fi done <$ opt5 # specify the URL Information of the detailed product list, cyclically done <$ opt4 # Input Product location category information, cyclically done <$ opt3 # input decoration information, and cyclically done <$ opt2 # input model information, cycle done <$ opt1 # input brand information and perform loop echo> & 9} & done <$ opt0 # input the year information and perform a loop if [$? -EQ 0]; then Echo "-------- finished --------"> $ reportelse echo "-------- unfinished --------" >>$ reportfiwaitexec 9 <&-

OK!

The above is all the content of the script. The overall script mainly contains the combination of the target URL and the capture target URL. around these two directions, it mainly uses curl for data capture, uses SED, awk, grep, and cut to extract interest data.

The target URL to be crawled must be matched by several options before the desired result can be obtained. Therefore, we added the combined target URL operation before capturing the target URL. In both directions, I use multi-layer while loop nesting to reuse parameters and perform layer-by-layer input mining.

To optimize the speed and control speed, shell multi-process and Intelligent Data determination methods are adopted.

The purpose of using shell multi-process is to increase the number of operations to shorten the overall completion time and improve the capture efficiency.

Shell multi-process mainly relies on loop + {}+. If the number of multi-process processes has a specified value, we can use both for and while. If the number of multi-process processes is not specified, we 'd better use the while loop statement. By embedding {} & in the loop, you can put the Command Group in {} into the background for automatic execution, and then complete the {} & operation so that the loop can enter the next time.

The above does not implement the shell to enable process count control in the background. Suppose you need to execute 10 thousand times. If you do not control the speed, 10 thousand operations may be directly triggered, at the same time, it is executed in the background, which is fatal to the system. On the other hand, as a crawler, you cannot have too many concurrent requests to the target website. For these two considerations, we need to control the number of multiple shell processes executed in the background each time. For this line, we mainly implement it through the file descriptor. Create a temporary MPs queue file, open a file descriptor for the file, and pass the specified number of blank lines (10 blank lines are passed in this article ), this aims to control the concurrency of processes. Next, in the following loop, use read-U9 before {} & Operation (here 9 is the file descriptor used in this article) to get a row from the file descriptor 9, if it is obtained, it can continue execution. If it cannot be obtained, it will be waiting here.

Through the combination of the two above, you can achieve intelligent control over shell multi-process.

The purpose of Intelligent Data determination is to find that the speed bottleneck during script debugging is the curl speed, that is, the network speed. Therefore, once the script is interrupted due to an exception, repeat the curl operation, which greatly increases the script execution time. Therefore, through intelligent determination, the problem of curl time consumption and repeated data collection is perfectly realized. The following figure shows the logic diagram that can only be used to determine the weight of data:

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M01/49/91/wKioL1QUnWzyFIJZAAMW8rN7OK8488.jpg "Title =" qq20140914033817.png "alt =" wkiol1qunwzyfijzaamw8rn7ok8488.jpg "/>

For the value of the variables in the script, I have already commented out in detail in the script above, and I will not repeat it here.

I will not explain the usage and skills of other details here. If you are interested in shell, you can communicate with me and make progress together.

This article is from the not only Linux blog, please be sure to keep this source http://nolinux.blog.51cto.com/4824967/1552472

Analysis of Shell web crawler instances

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.