Python crawler learning note regular expression, python crawler learning note

Source: Internet
Author: User

Python crawler learning note regular expression, python crawler learning note

Use of Regular Expressions

To learn about Python crawlers, you must first understand the use of regular expressions. Let's take a look at how to use them.

In this case, the vertex is equivalent to a placeholder and can match any character. What does it mean? Let's look at the example.

 import re  content = "helloworld"  b = re.findall('w.',content)  print b`

Note: We first imported the re. In this case, let's guess what the output result is? Because. is equivalent to a placeholder, the output result at this time of course is "wo.

* The usage is different from the preceding. * You can match the previous character any number of times. For more information, see the example.

content = "helloworldhelloworld" b = re.findall('w*',content) print b

The output result is ','','','','', '', 'w, the length is the same as that of the matched string, and the matching characters are printed.

. * Usage. * is a combination that can match as much content as possible. For example, the following example

content = "helloworldhelloworldworld" b = re.findall('he.*ld',content) print b

It will output ['helloworldhelloworldworld']. Why does it print all the helloworld? This is a greedy algorithm, that is, I want to find the longest qualified content.

.*? Is opposite to the above, this symbol will find as short as possible qualified content, and then put it in a list, as shown below

content = 'xxhelloworldxxxxhelloworldxx' b = re.findall('xx.*?xx',content) print b

The output result is ['xxhelloworldxx', 'xxhelloworldxx']. It can be seen that xx is too annoying. How can this problem be removed? It's easy. Just add a bracket. Where can it be added?

content = 'xxhelloworldxxxxhelloworldxx' b = re.findall('xx(.*?)xx',content) print b

All of the above are situations where the content does not contain line breaks. what changes will happen if there is a line break?

content = '''xxhelloworld xx''' b = re.findall('xx(.*?)xx',content) print b

At this time, the output result is an empty list. What should I do? When we write web crawlers, the source code of the web page must be more than one line. If we cannot read one line, it will be embarrassing. Of course, there is a solution ~

content = '''xxhelloworld xx''' b = re.findall('xx(.*?)xx',content,re.S) print b

In this way, there is also a very convenient technique for extracting numbers, as shown below

content = '''xx123456 xx''' b = re.findall('(d+)',content,re.S) print b

Crawl the image link in the source code of the webpage and download it.

This article is only the first step of web crawler, so it is a little bit of explanation. So now we will first use regular expressions to implement a manual web crawler. What is manual? We copy the source code of the webpage and save it in a txt file. Then, we use a regular expression to filter the information and download it.

First, I searched the Linux desktop and found the next webpage, as shown in the following figure.

Right-click to view the network source code, press ctrl + f to search for img src, find the middle part, copy it, and paste it into a txt file,

Then we can use the above knowledge to extract the information we want. The source code is as follows:

import re import requests  f = open('source.txt', 'r')  html = f.read()  f.close()  pattern = '

First open the txt file that saves the source code of the network, read the file, close the file stream, then extract the image link using the regular expression, and finally download the image using the get () method in requests, note that this requests is not self-contained in Python. We need to download the specified file and put it in the Lib directory of Python. Download it here. After Entering the website, press ctrl + f to search for the keyword requests. The following page is displayed:

, We can see that the download is. whl suffix file, manually change it. zip suffix, decompress it, and you will get two directories. copy and paste the directory named requests to the directory mentioned above.

Now, let's take a look at the running results.

 C:Python27python.exe E:/PythonCode/20160820/Spider.py Downloading:http://n1.itc.cn/img8/wb/smccloud/fetch/2015/07/04/112732422680200576.JPG Downloading :http://n1.itc.cn/img8/wb/smccloud/fetch/2015/07/04/112640070563900918.JPG Downloading :http://n1.itc.cn/img8/wb/smccloud/fetch/2015/07/04/112547718465744154.JPG Downloading :http://n1.itc.cn/img8/wb/smccloud/fetch/2015/07/04/112455366330382227.JPG Downloading :http://n1.itc.cn/img8/wb/smccloud/fetch/2015/07/04/112363014254719641.JPG Downloading :http://n1.itc.cn/img8/wb/smccloud/fetch/2015/07/04/112270662197888742.JPG Downloading :http://n1.itc.cn/img8/wb/smccloud/fetch/2015/07/04/112178310031994750.JPG Downloading :http://n1.itc.cn/img8/wb/smccloud/fetch/2015/07/04/112085957910403853.JPG  Process finished with exit code 0

At this time, the download is successful. Go to our picture directory to view the downloaded image.

The download is successful. Note: When you try to find the source code experiment on the webpage, it is best not to include Chinese characters in the Link; otherwise, garbled characters may occur. Because I have learned Python for a short time, Chinese garbled characters may occur, the solution is not so handy, so I will not explain it here. This article will post a paragraph for the time being. If you have any comments or questions, you can leave a message or chat with me privately.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.