Python crawler [one] bulk download sister pictures

Source: Internet
Author: User
Fried eggs on the internet, the girl has a high-quality beauty http://www.php.cn/css/css-rwd-images.html "target=" _blank "> pictures, today to share the way to download these girls in bulk in Python.

Knowledge and tools to understand:

#1 need to know the basic syntax of Python, for this article you just have to know how to manipulate list, for......in ... and how to define a function is enough. Web pages crawl, analyze, and save files with the edge of the function to understand.

#2 need to install the third-party library BEAUTIFULSOUP4. Using PIP to install is a convenient way. The latest version of Python comes with the Pip tool, and Windows presses the Windows+x shortcut key, opens a command prompt (administrator), enters

Pip Install Beautifulsoup4

Enter run



A message such as successfully installed indicates that the installation is complete.

#3 do not need to have knowledge of HTML. However, viewing the source code and viewing the elements of the browser is still needed, such as Chrome and Firefox.

(If you don't have a pip, search yourself how to install PIP.)

I. Download Web page

To download all the pictures on more than 2000 pages, first you have to learn to download a webpage:). The URL for the practice download is: jandan.net/ooxx/page-2397#comments. With Chrome or Firefox browser open, right-click-View Web page source code. The pages we see are the browsers that parse these using the source code written by html,js,css and so forth to us. The address of the picture is included in the source code, so the first step is to download the HTML code.







Part of the code was intercepted

Use Python to bring your own library urllib.request to download Web pages. Urllib.request is an extensible library that uses a variety of protocols to access open URLs.

Import Urllib.request

url = ' http://jandan.net/ooxx/page-2397#comments '

res = urllib.request.urlopen (URL)

Urllib.request.urlopen () What does this function do? Like its name, it can be used to open a URL. It can either accept a str (which is what we preach) or accept a request object. The return value of this function is always an object that can work like a context manager, and comes with Geturl (), info (), GetCode (), and so on.

In fact, we do not have to worry about so much, just remember that this function can accept a URL, and then return to us an object containing all the information of this URL, we work on this object.


Now read the HTML code in the Res object and assign it to the variable HTML. Use the Res.read () method.

html = Res.read ()

At this time, HTML is stored in the HTML source code!

Try print (HTML)




Part of the code was intercepted.

At this point you find that the result and the right mouse button--Viewing the source code of the Web page is not the same content. The return value of the original read () method is n bytes ... What the hell is this? Well, we can actually parse the return value and get the image address. But if you want the same HTML code that you see in your browser, you can change the previous line of code to

html = Res.read (). Decode (' Utf-8 ')

Then print (HTML)




Part of the code was intercepted.

Ok! , this is because the decode (' Utf-8 ') of Read () can encode the return value of Read () Utf-8. But we still use HTML = Res.read (), because it also contains the information we need.

So far we have used only 4 lines of Python code to download and store the HTML code of the Web page http://jandan.net/ooxx/page-2397#comments in the variable HTML. As follows:

Import Urllib.request

#下载网页

url = ' http://jandan.net/ooxx/page-2397#comments '

res = urllib.request.urlopen (URL)

html = Res.read ()

Two. Resolve the address

Below, the HTML is parsed using BEAUTIFULSOUP4.

How do you determine where the HTML code for a picture is? right mouse button on the page--check. Then the left half screen is the original page, the right half screen is the HTML code and a bunch of function buttons.




Elements a selection arrow to the left, click to turn blue, and then click on the picture on the left page to see the code automatically highlighted in the right HTML code. This part of the code is the corresponding HTML code for this image! This arrow is used to locate the corresponding code of an element in a webpage.




Take a closer look at this piece of code:

You can see the src= "//wx2.sinaimg.cn/mw600/66b3de17gy1fdrf0wcuscj20p60zktad.jpg" section is the address of this image, SRC is source. The style behind the SRC is its pattern, not to mention it. At this point you can experiment, add HTTP before src:, access http://wx2.sinaimg.cn/mw600/66b3de17gy1fdrf0wcuscj20p60zktad.jpg should be able to see the original picture.


Therefore, SRC corresponds to the content of the image we need to link the address. Note that in the image, SRC and image address links, style, and max-width are similar to the Key-value correspondence. This is related to the method used to extract the address of the picture later.

To see the code for the other pictures, you can see that they are all in the same format, which is included in .

Use BeautifulSoup () to parse the HTML. In addition to passing in the HTML, we also passed a ' html.parser ' parameter, which indicates that we want the BeautifulSoup () function to parse the variable HTML in the HTML parsing way. Parser is the meaning of syntactic analysis.

Soup = beautifulsoup (html, ' Html.parser ')

This line of code parses the HTML into a soup object. We can easily manipulate this object. For example, only the text content containing ' img ' is extracted:

result = Soup.find_all (' img ')

Use the Find_all () method.

Print (result) can see that result is a list, each element is a src-picture address key value pair, but contains And so on we do not need.




Part of the code was intercepted.

Use the Get method to extract the address in double quotation marks and add HTTP at the beginning:.

Links=[]

For content in Result:

Links.append (' http: ' +content.get (' src '))

Content.get (' src ') is the value in the content that corresponds to the key SRC, which is the address in double quotes.

Links.append () is a common method of adding elements to a list.

Print (links) can see that each element in the list is a picture address in the original double quotation marks. Such as:




Part of the code was intercepted

Use the browser to open any address can see the corresponding picture! Yo! That means we're on the last step, download them!

The Extract Address section is complete. The code is also fairly concise, as follows:

#解析网页

From BS4 import BeautifulSoup

Soup = beautifulsoup (html, ' Html.parser ')

result = Soup.find_all (' img ')

Links=[]

For content in Result:

Links.append (' http: ' +content.get (' src '))

Three. Download the image

Finally, you should go to the address in the links and download the pictures!

At the beginning

Import OS

To create a photo folder to hold the downloaded picture, the following code creates the photo folder in the same location as the program. py file.

If not os.path.exists (' photo '):

Os.makedirs (' photo ')

We know that links is a list, so it's best to use loops to download, name, and store one.



I=0

For link in Links:

I+=1

filename = ' photo\\ ' + ' photo ' +str (i) + '. png '

with open (filename, ' W ') as file:

Urllib.request.urlretrieve (Link,filename)

I is a cyclic variable, and i+=1 is the statement that controls the loop.

FileName is named for the image, but in fact it creates a file with this name and then writes the picture in. As can be seen from the assignment statement of filename, ' photo\\ ' indicates that it is located in the photo folder, the back of the ' photo ' +str (i) is in order, the full download will be Photo1,photo2,photo3 such a feeling ~ '. png ' is the suffix. Connecting strings with the + sign is also a common practice in Python.

With these two lines, take the picture in link to local and then save to filename.

Open (filename, ' W '), opens the FileName folder, and ' W ' indicates that the open mode is write writing. That is, here open () accepts two parameters, one is the filename (file path), and the other is open mode.

The function of Urllib.request.urlretrieve (Link,filename) is to access link and then retrieve a copy into filename.

After 3 parts are written, click Run! You can find the photo folder on the path where the. py file is located, all of which we downloaded.




The complete code is as follows:

Import Urllib.request

From BS4 import BeautifulSoup

Import OS

#下载网页

url = ' http://jandan.net/ooxx/page-2397#comments '

res = urllib.request.urlopen (URL)

html = Res.read ()

#解析网页

Soup = beautifulsoup (html, ' Html.parser ')

result = Soup.find_all (' img ')

Links=[]

For content in Result:

Links.append (' http: ' +content.get (' src '))

#下载并存储图片

If not os.path.exists (' photo '):

Os.makedirs (' photo ')

I=0

For link in Links:

I+=1

filename = ' photo\\ ' + ' photo ' +str (i) + '. png '

with open (filename, ' W ') as file:

Urllib.request.urlretrieve (Link,filename)

This applet is a process-oriented notation, from top to bottom, there is no function defined. This may be easier to understand for beginners who are just getting started.

A link to a sister map

Http://jandan.net/ooxx/page-2397#comments only the median number will change between 1-2xxx.

url = ' http://jandan.net/ooxx/page-' +str (i) + ' #comments '

Change the value of I can be downloaded in bulk. However, there are comments that frequent visits to the site may be blocked IP, which I do not understand, please try it yourself!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.