Python crawler learning note regular expression, python crawler learning note

Last Update:2016-10-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Use of Regular Expressions

To learn about Python crawlers, you must first understand the use of regular expressions. Let's take a look at how to use them.

In this case, the vertex is equivalent to a placeholder and can match any character. What does it mean? Let's look at the example.

 import re  content = "helloworld"  b = re.findall('w.',content)  print b`

Note: We first imported the re. In this case, let's guess what the output result is? Because. is equivalent to a placeholder, the output result at this time of course is "wo.

* The usage is different from the preceding. * You can match the previous character any number of times. For more information, see the example.

content = "helloworldhelloworld" b = re.findall('w*',content) print b

The output result is ','','','','', '', 'w, the length is the same as that of the matched string, and the matching characters are printed.

. * Usage. * is a combination that can match as much content as possible. For example, the following example

content = "helloworldhelloworldworld" b = re.findall('he.*ld',content) print b

It will output ['helloworldhelloworldworld']. Why does it print all the helloworld? This is a greedy algorithm, that is, I want to find the longest qualified content.

.*? Is opposite to the above, this symbol will find as short as possible qualified content, and then put it in a list, as shown below

content = 'xxhelloworldxxxxhelloworldxx' b = re.findall('xx.*?xx',content) print b

The output result is ['xxhelloworldxx', 'xxhelloworldxx']. It can be seen that xx is too annoying. How can this problem be removed? It's easy. Just add a bracket. Where can it be added?

content = 'xxhelloworldxxxxhelloworldxx' b = re.findall('xx(.*?)xx',content) print b

All of the above are situations where the content does not contain line breaks. what changes will happen if there is a line break?

content = '''xxhelloworld xx''' b = re.findall('xx(.*?)xx',content) print b

At this time, the output result is an empty list. What should I do? When we write web crawlers, the source code of the web page must be more than one line. If we cannot read one line, it will be embarrassing. Of course, there is a solution ~

content = '''xxhelloworld xx''' b = re.findall('xx(.*?)xx',content,re.S) print b

In this way, there is also a very convenient technique for extracting numbers, as shown below

content = '''xx123456 xx''' b = re.findall('(d+)',content,re.S) print b

Crawl the image link in the source code of the webpage and download it.

This article is only the first step of web crawler, so it is a little bit of explanation. So now we will first use regular expressions to implement a manual web crawler. What is manual? We copy the source code of the webpage and save it in a txt file. Then, we use a regular expression to filter the information and download it.

First, I searched the Linux desktop and found the next webpage, as shown in the following figure.

Right-click to view the network source code, press ctrl + f to search for img src, find the middle part, copy it, and paste it into a txt file,

Then we can use the above knowledge to extract the information we want. The source code is as follows:

import re import requests f = open('source.txt', 'r') html = f.read() f.close() pattern = '

First open the txt file that saves the source code of the network, read the file, close the file stream, then extract the image link using the regular expression, and finally download the image using the get () method in requests, note that this requests is not self-contained in Python. We need to download the specified file and put it in the Lib directory of Python. Download it here. After Entering the website, press ctrl + f to search for the keyword requests. The following page is displayed:

, We can see that the download is. whl suffix file, manually change it. zip suffix, decompress it, and you will get two directories. copy and paste the directory named requests to the directory mentioned above.

Now, let's take a look at the running results.

 C:Python27python.exe E:/PythonCode/20160820/Spider.py Downloading:http://n1.itc.cn/img8/wb/smccloud/fetch/2015/07/04/112732422680200576.JPG Downloading :http://n1.itc.cn/img8/wb/smccloud/fetch/2015/07/04/112640070563900918.JPG Downloading :http://n1.itc.cn/img8/wb/smccloud/fetch/2015/07/04/112547718465744154.JPG Downloading :http://n1.itc.cn/img8/wb/smccloud/fetch/2015/07/04/112455366330382227.JPG Downloading :http://n1.itc.cn/img8/wb/smccloud/fetch/2015/07/04/112363014254719641.JPG Downloading :http://n1.itc.cn/img8/wb/smccloud/fetch/2015/07/04/112270662197888742.JPG Downloading :http://n1.itc.cn/img8/wb/smccloud/fetch/2015/07/04/112178310031994750.JPG Downloading :http://n1.itc.cn/img8/wb/smccloud/fetch/2015/07/04/112085957910403853.JPG  Process finished with exit code 0

At this time, the download is successful. Go to our picture directory to view the downloaded image.

The download is successful. Note: When you try to find the source code experiment on the webpage, it is best not to include Chinese characters in the Link; otherwise, garbled characters may occur. Because I have learned Python for a short time, Chinese garbled characters may occur, the solution is not so handy, so I will not explain it here. This article will post a paragraph for the time being. If you have any comments or questions, you can leave a message or chat with me privately.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler learning note regular expression, python crawler learning note

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler learning note regular expression, python crawler learning note

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support