"Graphic detailed" Python crawler--5 minutes to do a picture of the automatic download device

Last Update:2016-04-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python crawler Combat--image Auto Downloader

Before the introduction of so many basic knowledge of "Python crawler" primer, we also estimate preface. Want to actually do a little thing to see, after all:

Talk is cheap show me the code!

basic steps for making crawlers

By the way, this small example allows you to master some basic steps about making a crawler.

In general, making a crawler requires the following steps:

1. Analysis of requirements (yes, requirements analysis is very important, do not tell me that your teacher did not teach you)
2. Analyze the Web source code, with F12 (no F12 so messy Web source code, you want to look dead me? ）
3. Write a regular expression or an XPath expression (the artifact that was mentioned earlier)
4. Formally write Python crawler code

effect

Run:

Well, let me enter the keywords, let me think, what to enter? Seems to have a little exposure to hobbies.

Enter

It seems to have started downloading! Good location. , I look at the downloaded pictures, wow instant I feel I added a lot of expression pack ....

Well, that's pretty much the thing.

Demand Analysis

"I want pictures, I don't want to search the Internet."
"It's best to download it automatically"
......

This is the demand, well, we began to analyze the needs, first of all, search for pictures, the most easy to think of is to crawl the results of Baidu Pictures, good, then we will take a look at Baidu pictures

It's basically that, it's pretty.

We try to search for a thing, I play a word of violence, come out a series of search results, which shows what ....

Grab a carriage return.

Well, we've seen a lot of pictures, and if we can get all the pictures down here. We see the keyword information in the URL.

We try to replace the keyword directly in the URL, jump has not!

In this way, you can find images of specific keywords through this URL, so in theory, we can search for specific images without having to open a Web page. The next question is how to achieve automatic download, in fact, using the previous knowledge, we know that we can use the request, get the image URL, and then crawl it down, save as. jpg on the line.

So the project should be finished.

Analyze Web pages

OK, let's start with the next step, analyzing the source code of the Web page. Here I first switch back to the traditional page, why do this, because the current image of Baidu is Waterfall flow mode, dynamic loading pictures, processing is very troublesome, the traditional page interface is much better.

Here is also a technique, is: can crawl mobile version do not crawl computer version, because the phone version of the code is very clear, it is easy to get the content needed.

OK, switch back to the traditional version, or have the page number of the look comfortable.

We right-click to view the source code

This is what ghosts, how can see!!

This time, will use F12, developer tools! We go back to the previous page, press F12, come out the following toolbar, we need to use the top left corner of the thing, a mouse follow, a switch version of the phone, are useful to us. We'll use the first one here.

Then choose where you want to see the source code, you can find that the following code area is automatically positioned to this location, is not very nb!

We'll copy this address.

Then go to the chaos of the source code to search, found its location! Sample I can't find you! But here we are puzzled, this picture how to have so many addresses, in the end with Which? We can see a thumburl,middleurl,hoverurl,objurl.

Through the analysis can be known, the previous two is a reduced version, hover is the mouse movement after the version of the display, Objurl should be we need, do not believe you can open these URLs to see, found that the largest and most clear obj.

Well, when we find the location of the picture, we begin to analyze its code. I'll see if all the Objurl are full of pictures.

Seems to end in the. jpg format, that should not run, we can see the search for 61, indicating that there should be 61 pictures

Writing Regular Expressions

Through the previous study, write the following a regular expression is not difficult to put?

pic_url = re.findall(‘"objURL":"(.*?)",‘,html,re.S)

Writing crawler Code

Well, I'm officially starting to write the crawler code. Here we use 2 bags, one is regular, one is requests bag, before also introduced, did not look back to see!

#-*- coding:utf-8 -*-import reimport requests

Then we glue the URL, pass in the requests, and write the regular expression.

‘http://image.baidu.com/search/flip?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=result&fr=&sf=1&fmq=1460997499750_R&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%E5%B0%8F%E9%BB%84%E4%BA%BA‘html = requests.get(url).textpic_url = re.findall(‘"objURL":"(.*?)",‘,html,re.S)

Theory has a lot of pictures, so to cycle, we print out the results to look at, and then use request to get the URL, here because some pictures may have a URL can not open, add a 5 seconds timeout control.

pic_url = re.findall(‘"objURL":"(.*?)",‘0forin pic_url:    print each    try:        pic= requests.get(each, timeout=10)    except requests.exceptions.ConnectionError:        print‘【错误】当前图片无法下载‘        continue

OK, and then save the URL, we in the current directory in advance to create a picture directory, the pictures are put in, named when the number of the name of the

    string‘pictures\\‘‘.jpg‘    open(string,‘wb‘)    fp.write(pic.content)    fp.close()    1

This is the whole code:

#-*-coding:utf-8-*-Import Reimport Requestsurl =' Http://image.baidu.com/search/flip?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1 &fm=result&fr=&sf=1&fmq=1460997499750_r&pv=&ic=0&nc=1&z=&se=1&showtab= 0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%e5%b0%8f%e9%bb%84%e4%ba%ba ' HTML = requests.Get(URL).textPic_url = Re.findall (' "Objurl": "(. *?)", ', Html,re. S) i =0 for  each inchPic_url:print each    Try: pic= requests.Get( each, timeout=Ten) except Requests.exceptions.ConnectionError:print' ERROR ' The current picture cannot be downloaded 'Continuestring=' pictures\\ '+str (i) +'. jpg 'FP =Open(string,' WB ') FP.Write(pic.content) FP.Close() i + =1

Let's run it and look at the effect (what do you say this is what the IDE feels so flashy!?) Hurry to install Pycharm,pycharm configuration and use to see this article! ）!

OK, we downloaded 58 pictures, hey, isn't it supposed to be 61?

We see that there are some pictures in the run that can't be downloaded

We also saw that there are no pictures displayed, open the URL to see, it is not true.

So, Baidu has some pictures it caches to its own machine, so you can still see, but the actual connection has failed

OK, now the automatic download problem solved, then search for pictures according to keywords? Just change the URL, I'll write the code down here.

    word = raw_input("Input key word: ")    ‘http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=‘+word+‘&ct=201326592&v=flip‘    result = requests.get(url)

Well, enjoy your first image download crawler!!

"Graphic detailed" Python crawler--5 minutes to do a picture of the automatic download device

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Graphic detailed" Python crawler--5 minutes to do a picture of the automatic download device

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support