Python crawler regular expression skills and examples of crawling personal blogs, python examples
This blog is about regular expression crawler in Data Mining and analysis. It mainly introduces Python Regular Expression crawler and describes common regular expression analysis methods, finally, the author's personal blog website is crawled through an instance. I hope this basic article will help you. If there are errors or deficiencies in this article, please try again. It's really too busy. I haven't written a blog for a long time. Sorry ~
I. Regular Expression
A Regular Expression (Regex or RE) is also known as a Regular or Regular Expression. It is often used to retrieve and replace texts that conform to a certain pattern, it first sets some special characters and character combinations, and filters the expressions through the combined "rule string" to obtain or match the specific content we want. It is flexible, logical, and functional, and can quickly find the desired information from the string through expressions. However, it is difficult for new contacts.
1. re Module
Python supports regular expressions through the re module. You need to import the library before using regular expressions.
import re
The basic step of import re is to first compile the regular expression string form into a Pattern instance, then use the Pattern instance to process the text and obtain a Match instance, use the Match instance to obtain the required information. The common function is findall. The prototype is as follows:
findall(string[, pos[, endpos]]) | re.findall(pattern, string[, flags])
This function searches for strings and returns all matching substrings in the form of a list.
The re parameter includes three common values:
(1) re. I (re. IGNORECASE): case-insensitive (complete syntax in parentheses)
(2) re. M (re. MULTILINE): multi-row mode is allowed.
(3) re. S (re. DOTALL): Any point matching mode is supported.
The Pattern object is a compiled regular expression. You can use a series of methods provided by Pattern to search for the text. Pattern cannot be directly instantiated and must be constructed using re. compile.
2. complie Method
The re Regular Expression module includes some common operation functions, such as the complie () function. The prototype is as follows:
compile(pattern[,flags] )
This function creates a pattern object based on a string containing a regular expression, and returns a pattern object. The flags parameter is a matching mode. It can take effect by bit or "|", or it can be specified in a regular expression string. The Pattern object cannot be directly instantiated and can only be obtained through the compile method.
For example, use a regular expression to obtain the numeric content in a string, as shown below:
>>> import re>>> string="A1.45,b5,6.45,8.82">>> regex = re.compile(r"\d+\.?\d*")>>> print regex.findall(string)['1.45', '5', '6.45', '8.82']>>>
3. match Method
The match method starts from the position subscript of the string and matches pattern. If pattern matches at the end, a Match object is returned. If pattern does not match during the matching process, or if the match is not completed and the endpos is reached, None is returned. The method is prototype as follows:
match(string[, pos[, endpos]]) | re.match(pattern, string[, flags])
The string parameter indicates a string; the pos parameter indicates a subscript. The default values of pos and endpos are 0 and len (string) respectively. The flags parameter specifies the matching mode when compiling pattern.
4. search Method
The search method is used to find successfully matched substrings in a string. Match pattern from the position subscript of the string. If pattern can still be matched at the end, a Match object is returned. If it cannot be matched, add pos to 1 and try again; if the pos = endpos still does not match, None is returned. The function prototype is as follows:
search(string[, pos[, endpos]]) | re.search(pattern, string[, flags])
The string parameter indicates a string; the pos parameter indicates a subscript. The default values of pos and endpos are 0 and len (string) respectively. The flags parameter specifies the matching mode when compiling pattern.
5. group and groups Methods
Group ([group1,…]) The method is used to obtain one or more string intercepted by a group. If multiple parameters are specified, the string is returned in the form of tuples. The groups ([default]) method returns the string intercepted by all groups in the form of tuples, which is equivalent to calling group (1, 2 ,... Last ). Default indicates that the group that has not intercepted the string is replaced by this value. The default value is None.
Ii. Common Methods for capturing network data using regular expressions
In the third section, the author will introduce some frequently-used techniques for capturing network data using regular expressions. These skills are summarized in the author's natural language processing and data capturing programming. They may not be very systematic, however, it also provides readers with some ideas to capture data and solve some practical problems.
1. Capture the content between tags
The HTML language uses tag pairs to write websites, including start and end tags, for example,
(1) capture the content between title tags
First, crawl the webpage title. The regular expression is '<title> (.*?) </Title> ', the code for crawling Baidu's title is as follows:
# Coding = UTF-8 import re import urllib url = "http://www.baidu.com/" content = urllib. urlopen (url ). read () title = re. findall (R' <title> (. *?) </Title> ', content) print title [0] # Baidu, you will know
The Code calls the urlopen () function of the urllib library to open the hyperlink and uses the findall () function in the Regular Expression Library to find the content between the title tags () the function obtains all the text that meets the regular expression. Therefore, the first title [0] is output. The following is another method for obtaining tags.
Pat = R '(? <= <Title> ).*? (? = </Title>) 'ex = re. compile (pat, re. M | re. s) obj = re. search (ex, content) title = obj. group () print title # Baidu, you will know
(2) capture the content between hyperlink tags
In HTML, <a href = URL> </a> is used to identify the hyperlink. The test03_08.py file is used to obtain the content between the complete Hyperlink and hyperlink <a> and </a>.
# Coding = UTF-8 import re import urllib url = "http://www.baidu.com/" content = urllib. urlopen (url). read () # Get the complete hyperlink res = r "<.*? Href = .*? <\/A> "urls = re. findall (res, content) for u in urls: print unicode (u, 'utf-8 ') # obtain the content res = R' between the hyperlink <a> and </a> <. *?> (.*?) </A> 'texts = re. findall (res, content, re. S | re. M) for t in texts: print unicode (t, 'utf-8 ')
The output content is as follows. If print u or print t is output directly, garbled characters may occur. You need to call the unicode (u, 'utf-8') function for transcoding.
# Obtain the complete hyperlink <a href = "http://news.baidu.com" rel = "external nofollow" rel = "external nofollow" name = "tj_trnews" class = "mnav"> News </a> <a href = "http://www.hao123.com" rel = "external nofollow" rel = "external nofollow" name = "tj_trhao123" class = "mnav"> hao123 </a> <a href = "http://map.baidu.com "rel =" external nofollow "rel =" external nofollow "name =" tj_trmap "class =" mnav "> Map </a> <a href =" http://v.baidu.com "rel =" external nofollow "rel =" external nofollow "name =" tj_trvideo "class =" mnav "> video </a>... # obtain the hao123 map video between the hyperlink <a> and </a>...
(3) capture the content between tr and td tags
Common la s in webpages include table la s or div la S. common table la s include tr, th, and td, and table behavior tr (table row ), table data is td (table data) and table header th (table heading ). So how can we capture the content between these tags? The following code obtains the content between them.
Assume that the HTML code is as follows:
<Html>
The Python code for crawling the corresponding value is as follows:
# Coding = UTF-8 import re import urllibcontent = urllib. urlopen ("test.html "). read () # open a local file # obtain the content between <tr> </tr> res = R' <tr> (. *?) </Tr> 'texts = re. findall (res, content, re. S | re. m) for m in texts: print m # Get the content between <th> </th> for m in texts: res_th = R' <th> (. *?) </Th> 'm_th = re. findall (res_th, m, re. S | re. m) for t in m_th: print t # directly obtain the content of <td> </td> res = R' <td> (. *?) </Td> <td> (.*?) </Td> 'texts = re. findall (res, content, re. S | re. M) for m in texts: print m [0], m [1]
The output result is as follows: Get the content between tr, and then get the value between <th> and </th> In the content between tr, that is, "student ID" and "name ", finally, it describes how to directly obtain the content between two <td>.
>>> <Th> Student ID </th> <th> name </th> <td> 1001 </td> <td> Yang Xiuyu </td> <td> 1002 </td> <td> Yan na </td> Student ID: 1001 Yang Xiuyu 1002 Yan NA>
2. Capture the parameters in the tag
(1) URL for capturing hyperlink tags
The basic format of the HTML hyperlink is "<a href = URL> link content </a>". You need to obtain the URL link address as follows:
# Coding = UTF-8 import recontent = ''' <a href = "http://news.baidu.com" rel = "external nofollow" rel = "external nofollow" name = "tj_trnews" class = "mnav"> news </a> <a href = "http://www.hao123.com" rel = "external nofollow" rel = "external nofollow" name = "tj_trhao123" class = "mnav"> hao123 </a> <a href = "http://map.baidu.com" rel = "external nofollow" rel = "external nofollow" name = "tj_trmap" class = "mnav"> Map </a> <a href =" htt P: // v.baidu.com "rel =" external nofollow "rel =" external nofollow "name =" tj_trvideo "class =" mnav "> video </a> ''' res = r "(? <= Href = \ "). +? (? = \ ") | (? <= Href = \ '). +? (? = \ ') "Urls = re. findall (res, content, re. I | re. S | re. M) for url in urls: print url
The output content is as follows:
>>> http://news.baidu.comhttp://www.hao123.comhttp://map.baidu.comhttp://v.baidu.com>>>
(2) URL for capturing image hyperlink tags
The basic format of the tag used to insert an HTML image is . You need to obtain the URL of the image as follows:
content = ''''''urls = re.findall('src="(.*?)"', content, re.I|re.S|re.M)print urls# ['http://www..csdn.net/eastmount.jpg']
The hyperlink corresponding to the image is ". So how can we get the last parameter in the URL?
(3) obtain the last parameter in the URL
Parameters.
content = ''''''urls = 'http://www..csdn.net/eastmount.jpg'name = urls.split('/')[-1] print name# eastmount.jpg
The code in this section indicates that the character "/" is used to separate the string and obtain the last obtained value, that is, the image name.
3. string processing and replacement
When using a regular expression to crawl webpage text, you usually need to call the find () function to find the specified position and then perform further crawling. For example, you can obtain a table with the class Attribute "infobox, then locate and crawl.
Start = content. find (R' <table class = "infobox" ') # Start position end = content. find (R' </table> ') # key point position infobox = text [start: end] print infobox
At the same time, unrelated variables may be crawled during the crawling process. At this time, irrelevant content needs to be filtered. Here we recommend that you use the replace function and regular expression for processing. For example, the crawling content is as follows:
# Coding = UTF-8 import recontent = ''' <tr> <td> 1001 </td> <td> Yang Xiuyun <br/> </td> </tr> <tr> <td> 1002 </td> <td> Yan na </td> </tr> <td> 1003 </td> <B> Python </B> </td> </tr> ''' res = R' <td> (. *?) </Td> <td> (.*?) </Td> 'texts = re. findall (res, content, re. S | re. M) for m in texts: print m [0], m [1]
The output is as follows:
>>> 1001 Yang Xiuyan <br/> 1002 Yan na 1003 <B> Python </B>
In this case, you need to filter out redundant strings, such as line breaks (<br/>), spaces (), and bold (<B> </B> ).
The filter code is as follows:
# Coding = UTF-8 import recontent = ''' <tr> <td> 1001 </td> <td> Yang Xiuyun <br/> </td> </tr> <tr> <td> 1002 </td> <td> Yan na </td> </tr> <td> 1003 </td> <B> Python </B> </td> </tr> ''' res = R' <td> (. *?) </Td> <td> (.*?) </Td> 'texts = re. findall (res, content, re. S | re. m) for m in texts: value0 = m [0]. replace ('<br/> ',''). replace ('','') value1 = m [1]. replace ('<br/> ',''). replace ('','') if '<B>' in value1: m_value = re. findall (R' <B> (. *?) </B> ', value1, re. S | re. M) print value0, m_value [0] else: print value0, value1
Use replace to replace the string "<br/>" or "" with a blank space to filter out the strings, while use regular expressions to filter the strings (<B> </B>, the output result is as follows:
>>> 1001 Yang Xiuyan 1002 Yan na 1003 Python >>>
Iii. Crawling personal blog instances
After describing the regular expressions, common network data crawling modules, and common data crawling methods for regular expressions, we will describe a simple example of Regular Expressions crawling websites. Here, the author uses regular expressions to crawl a simple example of the author's personal blog website and obtain the required content.
The author's personal website "http://www.eastmountyxz.com/?" is opened as shown in the following figure.
Suppose the content to be crawled is as follows:
1. The title of the blog website.
2. Crawl the hyperlinks of all images, such as the images in ".
3. Crawl the title, hyperlink, and abstract of the four articles on the blog homepage respectively. For example, the title is "Goodbye beitech: recalling the programming time of Beijing Graduate Students ".
1. analysis process
Step 1: locate the source code of the browser
First, the source code of these elements is located in the browser and the rule between them is found. This is called DOM tree document node tree analysis and the attributes and attribute values corresponding to the node to be crawled are found, as shown in Figure 3.6.
The title "Goodbye Beijing Institute of Technology: recalling the programming time of Beijing Graduate Students" is located under the <div class = "essay"> </div> node, it includes a
<Div class = "essay">
<Div class = "essay1"> </div>,
<Div class = "essay2"> </div> and <div class = "essay3"> </div>.
Step 2: Use the regular expression to crawl the title
The website title is usually located in
Step 3: Use the regular expression to crawl all image addresses
Because the format of the HTML inserted Image Tag is "
import reimport urlliburl = "http://www.eastmountyxz.com/"content = urllib.urlopen(url).read()urls = re.findall(r'src="(.*?)"', content)for url in urls: print url
A total of six images are displayed, but each image omits the blog address "Ghost ".
Step 4: use regular expressions to crawl Blog content
The first step describes how to locate the title of the four articles. The first article is located between the <div class = "essay"> and </div> labels, the second part is in <div class = "essay1"> and </div>, and so on. However, this HTML code has an error: the class attribute usually represents a class of tags and their values should be the same. Therefore, the class attribute of these four articles should be "essay ", the name or id can be used to identify its unique value.
Use the find () function to locate the beginning of <div class = "essay">, </div>, and obtain the values between them. For example, the following code obtains the title and hyperlink of the first article:
import reimport urlliburl = "http://www.eastmountyxz.com/"content = urllib.urlopen(url).read()start = content.find(r'<div class="essay">')end = content.find(r'<div class="essay1">')print content[start:end]
The Code consists of three steps:
(1) Call the urlopen () function of the urllib library to open the blog address, and read the content and assign it to the content variable.
(2) Call the find () function to find specific content. For example, the div label with the class Attribute "essay" is used to locate the start and end positions in sequence.
(3) perform the next analysis and obtain the hyperlinks and titles in the source code.
After locating this section, use a regular expression to obtain the specific content. The Code is as follows:
Import reimport urlliburl = "http://www.eastmountyxz.com/" content = urllib. urlopen (url ). read () start = content. find (R' <div class = "essay"> ') end = content. find (R' <div class = "essay1"> ') page = content [start: end] res = r "(? <= Href = \ "). +? (? = \ ") | (? <= Href = \ '). +? (? = \ ') "T1 = re. findall (res, page) # hyperlink print t1 [0] t2 = re. findall (R' <a. *?> (.*?) </A> ', page) # title print t2 [0] t3 = re. findall (' <p style =. *?> (.*?) </P> ', page, re. M | re. S) # Abstract (print t3 [0]
Call the regular expression to obtain the content separately. Because the crawled section (P) contains a line feed, you need to add re. M and re. S to support line feed search. The output result is as follows:
>>> Two years ago, I graduated from an undergraduate course and wrote an article titled "recalling my four-year university Gains and Losses", expressing my feelings about my loss of my four-year income at Beijing University; two years later, I left the imperial capital and went back to my hometown in Guizhou to start a new career as a teacher. I would like to write an article here to commemorate it! Or that sentence: I wrote this article to myself. I hope that many years later, I will recall my six years in Beijing, which is also a wonderful memory. The article may be a bit long, but I hope you can read it as patiently as you read a novel... >>>
2. Code Implementation
For the complete code, see the test03_10.py file. The Code is as follows.
# Coding: utf-8import reimport urlliburl = "http://www.eastmountyxz.com/" content = urllib. urlopen (url ). read () # crawl title = re. findall (R' <title> (. *?) </Title> ', content) print title [0] # crawl the image URL urls = re. findall (r 'src = "(.*?) "', Content) for url in urls: print url # crawl content start = content. find (R' <div class = "essay"> ') end = content. find (R' <div class = "essay1"> ') page = content [start: end] res = r "(? <= Href = \ "). +? (? = \ ") | (? <= Href = \ '). +? (? = \ ') "T1 = re. findall (res, page) # hyperlink print t1 [0] t2 = re. findall (R' <a. *?> (.*?) </A> ', page) # title print t2 [0] t3 = re. findall (' <p style =. *?> (.*?) </P> ', page, re. M | re. s) # print t3 [0] print ''start = content. find (R' <div class = "essay1"> ') end = content. find (R' <div class = "essay2"> ') page = content [start: end] res = r "(? <= Href = \ "). +? (? = \ ") | (? <= Href = \ '). +? (? = \ ') "T1 = re. findall (res, page) # hyperlink print t1 [0] t2 = re. findall (R' <a. *?> (.*?) </A> ', page) # title print t2 [0] t3 = re. findall (' <p style =. *?> (.*?) </P> ', page, re. M | re. S) # Abstract (print t3 [0]
Output result.
Using the code above, you will find it cumbersome to crawl a website using regular expressions, especially when locating a webpage node. Later, we will describe the commonly used third-party extension package provided by Python, use the functions of these packages for targeted crawling.
I hope this article will help you, especially those who are new to crawlers or have encountered similar problems. We recommend that you use databases such as BeautifulSoup, Selenium, and Scrapy to crawl data.
The above python crawler regular expression skills and examples of crawling a personal blog are all the content shared by Alibaba Cloud. I hope to give you a reference and support for the help house.