Python crawler [1] downloading pictures in batches

Source: Internet
Author: User
My sister pictures on the egg network have beautiful pictures of high quality. today, I will share with you how to batch download these sister pictures using python. I watched the video www. bilibili. comvideoav3286793 from master Zhe of Station B before I wrote such a small program. thank you ~ Knowledge and tools to be understood: #1 understand the basic syntax of python. for this article, you only need to know how to operate list, ...... In ......, Defining a function is enough. Web page capture... fried eggs on the internet sister map topic has a high quality Beauty # css/css-rwd-images.html "target =" _ blank "> picture, today to share with python batch download these sister figure method.

Knowledge and tools:

#1You need to understand the basic syntax of python. for this article, you only need to know how to operate list, ...... In ......, Defining a function is enough. You can use functions to capture, analyze, and save files on the web page.

#2You need to install the third-party library BeautifulSoup4. Installing with pip is very convenient. The latest version of python comes with pip. in windows, press the windows + x shortcut key to open a command prompt (administrator) and enter

Pip install beautifulsoup4

Press enter to run



If a message such as Successfully installed is displayed, the installation is complete.

#3No knowledge of html. However, browsers that View Source code and elements still need to, such as chrome and firefox.

(If you do not have pip, search for how to install pip .)

1. download the webpage

To download all the images on more than two thousand web pages, you must first learn to download a web page :). The url of the exercise download url is jandan.net/ooxx/page-2?#comments. right-click chrome or firefox and choose view webpage source code. The web pages we see are displayed to us after the source code written in html, js, css, and so on is parsed by a browser. The image address is included in the source code, so the first step is to download the html code.







Part of the code is truncated.

Use the python library urllib. request to download the webpage. Urllib. request is an extensible library that uses multiple protocols to access URLs.

Import urllib. request

Url = 'http: // jandan.net/ooxx/page-2?#comments'

Res = urllib. request. urlopen (url)

What does the urllib. request. urlopen () function do? Like its name, it can be used to open a url. It can accept both a str (which we pass) and a Request object. The return value of this function is always an object that can work like context manager and comes with methods such as geturl (), info (), and getcode.

In fact, we don't need to worry so much about it. we just need to remember that this function can accept a website address, and then return to us an object containing all the information of this website. we can operate on this object.


Now, the html code in the res object is read and assigned to the variable html. Use the res. read () method.

Html = res. read ()

In this case, the html source code is stored in html!

Try print (html)




Part of the code is truncated.

At this time, you find that the result is not the same as that of right-click-View web page source code. The original return value of the read () method is n bytes ...... What is this? Well, we can parse the returned value and get the image address. But if you want to get the same html code as the code in the browser, you can change the previous line of code

Html = res. read (). decode ('utf-8 ')

Then print (html)




Part of the code is truncated.

OK! This is because the read () decode ('utf-8') can encode the returned value of read () in UTF-8. However, we still use html = res. read () because it also contains the information we need.

So far, we have only used four lines of python code to download the html code of the Web http://jandan.net/ooxx/page-2397#comments and store it in the variable html. As follows:

Import urllib. request

# Download webpage

Url = 'http: // jandan.net/ooxx/page-2?#comments'

Res = urllib. request. urlopen (url)

Html = res. read ()

2. parse the address

Next, use beautifulsoup4 to parse html.

How can I determine where the html code of an image is? Right-click the webpage and choose check. At this time, the left half is the original page, and the right half is the html code and a bunch of functional buttons.




There is a selection arrow on the left of Elements. click it to turn blue, and then click the picture on the left page to see a part of the code automatically highlighted in the html code on the right. This part of the code is the html code corresponding to this image! This arrow is used to locate the code corresponding to an element in a webpage.




Take a closer look at this code:

We can see thatSrc = "// wx2.sinaimg.cn/mw600/66b3de17gy1fdrf0wcuscj20p60zktad.jpg"Part of the image is the address, and src is the source. The style behind src is its style, so you don't need to worry about it. In this case, you can try to add http: before src to access ingress.


Therefore, the src content is the image link address we need. Note that the link between src and image addresses, style, and max-width in the image are similar to the key-value relationship. This is related to the method used to extract the image address.

View the code corresponding to other images, and you can see that their formats are the same, that is, they are included in.

Use BeautifulSoup () to parse html. In addition to html, we have also passed in a 'HTML. parser 'parameter, which indicates that we want the BeautifulSoup () function to parse the Variable html by html parsing. Parser is the meaning of syntactic analysis.

Soup = BeautifulSoup (html, 'HTML. parser ')

This line of code parses html into a soup object. We can easily operate on this object. For example, extract only the text content containing 'IMG:

Result = soup. find_all ('IMG ')

Use the find_all () method.

Print (result) shows that result is a list. each element is a src-image address key-value pair, which only contains content that we don't need.




Part of the code is truncated.

Use the get method to extract the address in double quotation marks and add http: At the beginning :.

Links = []

For content in result:

Links. append ('http: '+ content. get ('src '))

Content. get ('src') is to get the value corresponding to the src key in content, that is, the address in double quotation marks.

Links. append () is a common method to add elements to a list.

Print (links) shows that each element in this list is the image address in the original double quotation marks. For example:




Part of the code is truncated.

Open any address in the browser and you can see the corresponding picture! YO! This means that we are just the last step. download them!

The extracted address is complete. The code is also quite concise, as shown below:

# Parsing Web pages

From bs4 import BeautifulSoup

Soup = BeautifulSoup (html, 'HTML. parser ')

Result = soup. find_all ('IMG ')

Links = []

For content in result:

Links. append ('http: '+ content. get ('src '))

3. download images

Finally, access the addresses in links in sequence and download the images!

Start

Import OS

First, create a photo folder to store the downloaded image. the following code creates the photo folder in the location of the program. py file.

If not OS. path. exists ('photo '):

OS. makedirs ('photo ')

We know that links is a list, so we 'd better use a loop to download, name, and store them one by one.



I = 0

For link in links:

I + = 1

Filename = 'photo \ 'photo '{'{str( I }}'.png'

With open (filename, 'w') as file:

Urllib. request. urlretrieve (link, filename)

I is a loop variable, and I + = 1 is a statement that controls the loop.

Filename is the name of the image, but it is to first create a file with this name, and then write the image. From the value assignment statement of filename, 'Photo \ 'indicates that it is located in the photo folder, and the 'photo' + str (I) following it is in order, after the download is complete, it will be like photo1, photo2, and photo3 ~ '.Png 'is a suffix. It is also a common practice to connect strings with the + sign in python.

The with statement gets the Image pointed to by the link address to the local device and stores it in filename.

Open (filename, 'w'), open the filename folder, 'W' indicates that the open mode is write. That is to say, here open () accepts two parameters, one is the file name (File path), and the other is the open method.

The function of urllib. request. urlretrieve (link, filename) is to access the link and retrieve a copy and put it in filename.

After writing all the three parts, click "run! You can find the photo folder in the path of the. py file, which contains all the images we downloaded ~




The complete code is as follows:

Import urllib. request

From bs4 import BeautifulSoup

Import OS

# Download webpage

Url = 'http: // jandan.net/ooxx/page-2?#comments'

Res = urllib. request. urlopen (url)

Html = res. read ()

# Parsing Web pages

Soup = BeautifulSoup (html, 'HTML. parser ')

Result = soup. find_all ('IMG ')

Links = []

For content in result:

Links. append ('http: '+ content. get ('src '))

# Download and store images

If not OS. path. exists ('photo '):

OS. makedirs ('photo ')

I = 0

For link in links:

I + = 1

Filename = 'photo \ 'photo '{'{str( I }}'.png'

With open (filename, 'w') as file:

Urllib. request. urlretrieve (link, filename)

This small program is a process-oriented method, and no function is defined from top to bottom. This may be easier for beginners.

Links to sister charts

The http://jandan.net/ooxx/page-2397#comments only has numbers in the middle that change between 1 and 2 xxx.

Url = 'http: // jandan.net/ooxx/page-'{str (I) + '# comments'

You can change the I value to batch download. However, some comments said that frequent access to the website may be blocked by IP addresses, which I do not understand. please try it on your own!

The above is the detailed content of the ING for batch downloading python crawlers [1]. For more information, see other related articles in the first PHP community!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.