Python Regular Expressions (Python web crawler)

Source: Internet
Author: User
Tags python web crawler

Yesterday January 31, 2018, lunar calendar 15th. Around 20:00, 152 years a meeting of the Eclipse, Blood moon, Blue Moon will appear in the air tonight, although did not see the Blue Moon, Blood month, Eclipse is also reluctant to, or can imagine a bottle of Blue moon washing liquid suspended in the air, the ear is "Everybody good, I am slag ash, to recommend a fun game--playful blue Moon ......" Around 22:00, the moon has come out and sat down to write a blog.

For computer programmers, what is the most popular direction in the current frontier? I think in the big data, cloud computing, artificial intelligence in these directions, is currently in the leader of the Big Al (Artificial intelligence), a product's popularity will always lead to the popularity of its dependent products. The Python language, in recent years, the more the momentum Python Java in the programming language first. Python is popular not only because of the support of AI machine learning, but also because of its own reasons, Python language learning is very easy to understand, this is not the primary school classes are open Python class. Python is also known as the "glue language" and is well-compatible with other languages. This also makes many programmers turn to Python, not to mention the current AI era is coming.

I've also been learning Python recently, and when it comes to learning python, it's very likely to touch reptiles. As for reptiles, popular is the machine that crawls online data. When it comes to reptiles, it is inseparable from the regular expression of the point in this article.

Regular Expressions: Also known as regular expressions, Baidu Encyclopedia explains that it is commonly used to retrieve and replace text that conforms to a pattern (rule). This is a bit like those brute force password software, through a combination of dictionaries to crack. This is also a test of computer computing ability.

I also learned about Python regular expressions from the video. I do not know whether it is comprehensive or not, so we can comment.

The regular expression has the following special characters:
1) ^. $ * ? + {2} {2,} {2, 5}
2) [] [^] [a-z] | ()
3) \s \s \w \w
4) [\u4e00-\u9fa5] \d

I'll explain each symbol and take an example.

Open Eclipse+pydev, pycharm, etc. can write Python IDE (integrated development environment), I take eclipse+pydev as an example.

(about Pydev configuration use can Baidu under, about this I will write a post in the future to supplement.) In writing the python process to avoid a lot of Chinese characters, and in the initial state of Eclipse+pydev to print through Python on the screen will be garbled, it is necessary to:

Windows > Preference > Workspace > Text file Encoding in Eclipse modify Default (GBK) to other UTF-8

Mouse activated edit box press ALT + Enter key combination in the window that appears Resource > Text file Encoding modify Default to other UTF-8

In Eslipse Windows > Preference > General > Content Types, expand the Text in the list to select Python File, in the box below Default encoding fill in UTF-8 Click the box again to Update the app. You can also set UTF-8 for Java Source Flie, if you need to.

For more details, please go to https://www.cnblogs.com/jackge/archive/2013/05/19/3086944.html

First write in the edit box

# Coding:utf-8 Import RE

  Declare the Code UTF-8 import Regular Expression module

1). ^ indicates that a specific character must start with a ^e, which means that the string to be matched must begin with the E character

$ indicates that the opposite character must be placed at the end of a specific character, as opposed to a $e is to indicate that the string to be matched must end with the E character

. Any character that matches any character indicates that any character can appear at the specified position of the string to be matched

* indicates that the specified character can occur any number of times (n >= 0), as e* expression can occur at a specified position in the string to be matched

Just ^ $. * Give me a chestnut (match function starts with the first letter, regex_str if it contains a line substring, the match succeeds, returns the Match object, fails returns none, and to exactly match, line ends with $)

#Coding:utf-8ImportRe#Import Regular Expression moduleline ="Hello World"#string to matchRegex_str ="(^h.*d$)"ifRe.match (Regex_str, line):#Match function (incoming argument, string to match)
# starting from the beginning of the match, regex_str if the line substring is included, the match succeeds,
# returns the Match object, fails to return none, to exactly match, line to end with $.
Print 'True'Else: Print 'False'

  The run result true indicates that the match was successful, ^h begins with the H character, and the. * Any character appears any time, d$ ends with a D character, and the above satisfies the True.

? A non-greedy match is greedy in a regular expression, and the program matches as many characters as possible, which is sometimes not what we want. To prevent its greedy match, so? It comes in handy. (Group () A string that is used to raise a packet intercept , () to group )

#Coding:utf-8ImportRe#Import Regular Expression moduleline ="heedeeehhlhloooooooo World"Regex_str=". * (h.*h). *"Match_obj=Re.match (regex_str, line)ifMatch_obj:Print(Match_obj.group (1))
#group () A string that is used to raise a packet intercept , () to group #The result is that the ' heedeeeh ' hh ' hhlh ' are satisfied with the condition# But because of the greedy match, match as much of the later characters as possible,
#这就匹配到最后一个满足条件的
' HLH '

The result is: HLH, then the problem came heedeeeh hh Hhlh Hlh So also meet the conditions at the same time also appear first, then why extract is HLH? The original program is greedy match, the program wants to do more than likely to match behind more then just like the monkey Penny Wise lost watermelon, and finally extract the last qualifying HLH, we can also understand the greedy match is reverse (from right to left) to match, the first match to the result of greed.

Prevent greed You can add before? Is that enough? To see

  #  coding:utf-8  import   re  #   import regular expression module  line = "  heedeeehhlhloooooooo World   " regex_str  =  " .*? (h.*h). *   " match_obj  = Re.match (regex_str, line)  if   match_obj:  print  (Match_obj.group (1))   

The result of the operation is HEEDEEEHHLH it seems not the expected result, I just extract a heedeeeh. For, the original to the same as you are to a non-greedy match, and the reverse is not, so you just need to add a greedy match on the back of the line (note added in the next character in front of the code before h). The result of the operation is Heedeeeh.

  #  coding:utf-8  import   re  #   import regular expression module  line =  " heedeeehhlhloooooooo World   " regex_str  = "  Span style= "COLOR: #800000" >.*? (h.*?h). *   " #    Note add before H  match_obj = Re.match (Regex_str, line)  if   match_obj:  print  (Match_obj.group (1))   

+ indicates that the specified character appears at least once and more than once, and H + represents at least one more time at the specified position of the matching string

{2} indicates that the specified character appears 2 times, h{2} indicates the occurrence of H two times at the specified position in the string to be matched

{2,} indicates that the specified character appears at least 2 times and above, (note that there is no space left behind) number is not fixed as h{3,} indicates that the specified position in the string to match appears more than 3 times

{2,6} indicates that a specified character occurs 2 times-6 times, as h{3,5} indicates that the occurrence of a specified position in a string is between 3-5 times

See Code Run Results heeh Others don't show, almost.

# Coding:utf-8 Import  "heehhlhloooooooo World"". * (he{2,}h). *" = Re.match  (regex_str, line)  if  match_obj:    print   (Match_obj.group (1))

2). [ADBC] indicates that the specified character in the string to be matched is either one of the ADBC to satisfy the condition

It can also be written as an interval form such as [A-z] [0-9] and can be superimposed as [a-za-z0-9]

[^h] indicates that the specified position in the string to be matched specifies the word printable h satisfies the condition

Note: [.] [*] in this []. * No longer represents any character or any number of occurrences, this means that they themselves

Look at chestnut. Run result is Hloooo0

# Coding:utf-8 Import Re " heehhlhloooo0 World "  ". * (H[^e][a-z].*[0-9]). *"= re.match (regex_str, line)   if match_obj:     Print (Match_obj.group (1))

| Is or is a relationship, (H|r) indicates that the specified position in the string to be matched matches the condition whenever any one of H and R appears

() As mentioned earlier, the group () used to group, in this instance, the group () is used to propose packet interception of the string group (1) is to intercept the 1th grouping of the string of course, need to have a group ()

Instance

  #  coding:utf-8  import   reline  = "  hello365   " regex_str  =  "  ((hello|hell0o) 365)   " match_obj  = Re.match (regex_str, line)  Span style= "COLOR: #0000ff" >if   match_obj:   Print  (Match_obj.group (1 print  ( Match_obj.group (2))   

Run the input results separately hello365 Hello Group (2) intercepts the string in the 2nd grouping as in Hello or hell0o

3). \s indicates that the specified character in the string to be matched is a space, while \s is a non-whitespace same pair that is used only for the specified single character to be valid for multiple, plus +

\w indicates that the specified character in the string to be matched is a-Z a-Z _ equivalent to [a-za-z0-9_], and \w is the other characters that remove these characters like ~! @ # $% ^ & *

Speak with an instance

# Coding:utf-8 Import  "hello world~""(\wello\sworld\w) "  = re.match (regex_str, line)  if  match_obj:    Print ( Match_obj.group (1))

Running results Hello world~, yes, no problem.

4). \d indicates that the specified position in the string to be matched is a number [\U4E00-\U9FA5] is expressed in Chinese

  #  coding:utf-8  import   reline  = "  hello world365 hi   " regex_str  =  "  (hello\sworld\d+[\u4e00-\u9fa5]+)   " match_obj  = Re.match (regex_str, line)  Span style= "COLOR: #0000ff" >if   match_obj:   Print  (Match_obj.group (1))   

The result of the run is Hello world365 can see \d is match also come out [\u4e00-\u9fa5] also match come out otherwise will not print the result just not display Chinese in print result I also expressed wonder is the reason that changed the encoding?

Well, the above is my study of Python web crawler important section of the regular expression, may not be very comprehensive, even a little bit of a problem, more can go to rookie tutorials and other major well-known information network query. I have written for reference only.

Yes, about the print results of why the Chinese do not display, expect someone to guide twos. /Small Tangle

Python Regular Expressions (Python web crawler)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.