"Python network crawler four" multi-threaded crawl multiple pictures of Baidu pictures

Source: Internet
Author: User

Recently saw the new play of the Goddess "escape, though shameful but useful"

By the big Lori Meng's face one face, we come to climb a climb the goddess of soap flakes.

Baidu Search results: new permanent knot clothing

1. Download the simple page

By looking at the HTML source of the webpage, the analysis shows that there are 4 kinds of links in the same picture:

{"Thumburl": "http://img5.imgtn.bdimg.com/it/u=2243348409,3607039200&fm=23&gp=0. JPG "," Middleurl ":" http://img5.imgtn.bdimg.com/it/u=2243348409,3607039200&fm=21&GP  =0.jpg "," Hoverurl ":" http://img5.imgtn.bdimg.com/it/u=2243348409,3607039200&fm=23 &GP =0.jpg "," Objurl ":" Http://attachments.gfan.com/attachments2/day_110111/1101112033d77a4a8eb2b00eb1.jpg "}

The main difference is the resolution is different, Objurl is the source of the picture is also the clearest one. After testing, the first three kinds have anti-crawler measures, can be opened with a browser, but refreshed once on the 403 Forbidden. Can't get the picture with the crawler

The fourth type of Objurl refers to the source URL of the image, and there are three scenarios for getting the URL:

    1. Normal. Continue to download
    2. 403 Forbidden. Skip over with continue.
    3. An exception occurred. Handled with try except.
1 #Coding:utf-82 ImportOS3 ImportRe4 ImportUrllib5 ImportUrllib26 7 defgethtml (URL):8Page=urllib.urlopen (URL)9Html=Page.read ()Ten     returnHTML One  A defgetimg (HTML): -Reg=r'" Objurl": "(. *?)"'   #Regular -     #parentheses denote grouping, and the contents of parentheses are captured into groups the     #This bracket will also match the URL of the picture in the page. -Imgre=Re.compile (REG) -     PrintImgre -imglist=Re.findall (imgre,html) +L=Len (imglist) -     PrintL +     returnimglist A  at defdownLoad (urls,path): -index = 1 -      forUrlinchURLs: -         Print("Downloading:", URL) -         Try: -res =Urllib2. Request (URL) in             ifSTR (RES.STATUS_CODE) [0] = ="4": -                 Print("not downloaded successfully:", URL) to                 Continue +         exceptException as E: -             Print("not downloaded successfully:", URL) thefilename = os.path.join (path, str (index) +". jpg") *Urllib.urlretrieve (Url,filename)#download Remote data directly to the local.  $Index + = 1Panax NotoginsengHTML =gethtml ("Http://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1 &fm=result&fr=&s" -               "f=1&fmq=1484296421424_r&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width= &height=&face=0&istype=2&ie=utf-8&" the               "word= Shin Yuan knot coat &f=3&oq=xinyuanj&rsp=0") +Savepath="D:\TestDownload" ADownLoad (getimg (HTML), Savepath)

Where the Urlretrieve method
Download remote data directly to the local.

        Urllib.urlretrieve (url[, filename[, reporthook[, data]])        parameter description:        URL: External or local URL        FileName: Specifies the path to be saved locally (Urllib generates a temporary file to hold the data if this parameter is not specified);        Reporthook: A callback function that triggers the callback when the server is connected and the corresponding data block is transferred. We can use this callback function to display the current download progress.        data: Refers to a post to the server. The method returns a tuple of two elements (filename, headers), filename, which represents the local path, and the header represents the server's response header.

The pictures of the goddess were just crawling down.

2. Crawl more images

Wait, this method is not enjoyable Ah, Baidu picture of the page is dynamic loading, you need to slide down to continue to load the back of the photos, we have completed the first step, only climbed to 30 pictures.

Open the browser, press F12, switch to the Network tab, and then pull the page down. Then the URL of the browser address bar does not change, and the page image is an increase, indicating the Web page in the background and server interaction data. All right, that's the guy.

XHR English full name XMLHttpRequest, Chinese can be interpreted as extensible Hypertext transfer request. XML Extensible Markup Language,Http Hypertext Transfer Protocol, request requests. The XMLHttpRequest object can implement a partial update of a Web page without submitting the entire page to the server. When the page is fully loaded, the client requests the data through the object to the server, and after the server accepts the data and processes it, it feeds back the data to the client.

Click on the URL to compare request, you can find that basically the same, except at the end of a little bit.

Just the value of PN is not the same, the test found that PN should be the number of pictures representing the current request, RN indicates the amount of updated display pictures.

pn=90&rn=30&gsm=5a&1484307466221=pn=120&rn=30&gsm=5a&1484307466223=pn=150& rn=30&gsm=78&1484307469213=pn=180&rn=30&gsm=78&1484307469214=pn=210&rn=30& gsm=b4&1484307553244=   

Now we know that as long as we continue to access the request URL, change his PN value in theory can be implemented to download more than one image.

1 defGetmoreurl (word):2Word =urllib.quote (Word)3URL = r"http://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&fp=result& Queryword={word}" 4R"&CL=2&LM=-1&IE=UTF-8&OE=UTF-8&ST=-1&IC=0&WORD={WORD}&FACE=0&ISTYPE=2NC =1&pn={pn}&rn=60"5URLs = (Url.format (Word=word, pn=x) forXinchItertools.count (start=0, step=30))6     #itertools.count 0 Start, step 30, Iteration7     returnURLs

among them, the function of Urllib.quote is to handle the URL on the line.

By standard, URLs allow only a subset of ASCII characters (alphanumeric and partial symbols), and other characters (such as Chinese characters) are not compliant with the URL standard.
Therefore, the use of other characters in the URL requires URL encoding.

The part of the URL that passes the parameter (query String), in the format:
If you have a "&" or "=" symbol in your name or value, there is a problem. Therefore, the parameter string in the URL also needs to encode "&=" symbols.

URL encoding is the way to convert the characters that need to be encoded into%xx form. Usually URL encoding is based on UTF-8 (which is, of course, related to the browser platform).
Example:
such as "I", Unicode is 0x6211, UTF-8 encoded as 0xe6 0x88 0x91,url encoding is
%E6%88%91

Itertools.count (Start,step) is an iterator, 2 parameters are the start and step, so we get a arithmetic progression, the sequence is filled in the URL, we get our URLs.

Tried, go to parse the data, Objurl incredibly not an HTTP picture, check data, show the next thing to do is decoding .

"Objurl": "Ippr_z2c$qazdh3fazdh3fta_z&e3Bi1fsk_z& E3BV54AZDH3FKUFAZDH3FW6VITEJAZDH3F99LWBDUBLLDLNVJAAVMC8C01BA8A8M1MU0VJWAMA_Z&E3B3R2"

3. Page decoding

Refer to Baidu image page decoding, found in fact is a table,key and value corresponding can decode, is simply plaintext password 0.0 ....

So in our program add a dictionary, multiple decoding process.

During the decoding process once error, after careful investigation, is url.translate (char_table) a problem,

The Translate method of STR requires the decimal Unicode encoding of a single character as the key
The number in value is converted to a character as a decimal Unicode encoding
You can also use strings directly as value

So you need to add a line

 for  in Char_table.items ()}

"Python network crawler four" multi-threaded crawl multiple pictures of Baidu pictures

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.