Regular expression basics of using regular expressions with cookies
- Atomic
Atoms are the most basic constituent units of a regular expression, with at least one atom in each regular expression, and common atoms in the following categories:
- Ordinary characters as atoms
- Non-printable characters as atoms
- Universal characters as atoms
- Yard table
- The case of ordinary characters as atoms. Ordinary characters such as numbers, uppercase and lowercase letters, underscores, etc. can all be used as atoms. For example, in the following program, we use "Yue" as an atomic use, respectively, ' Y ', ' u ' e '.
import re
pattern="yue"
string="http://yum.iqianyue.com"
result1=re.search(pattern,string)
print(result1)
- Non-printable characters as atoms
The so-called nonprinting characters, refers to some of the symbols in the string used for formatting control, such as line break.import re
pattern="\n"
string="http://yum.iqianyue.com
http://baidu.com"
result1=re.search(pattern.string)
print(result1)
- The
- Universal character is an atomic
so-called universal character, that is, an atom can match a class of characters, and we often use this type of atom in actual projects.
| symbol | meaning | | \w | match any letter, number, or underscore |
| \w | matches any character except letters, numbers, and underscores |
| \d | match any decimal number |
| \d | matches any other character except a decimal number |
| \s | match any white space character |
| \s | match any other character except for white space characters | import re
pattern= "\w\dpython\w"
string= "Abcdfphp345python_py"
Result1=re.search (pattern,string)
Print (RESULT1)
- Atomic table
Define a set of equal-status atoms and then match any atom in the Atom table, and in Python, the atomic table has a [] expression such as [XYZ] is an atomic table, the atomic table defines 3 atoms, the 3 atoms of equal status, for example, we define a regular expression of "[ Xyz]py ", corresponding to the original string is" Xpython ", if you use the Re.search () function to match, you can match the result" xpy ", because at this time as long as the PY before the first one is the x y z letter one of the letters, you can match the success.
Similarly, [^] stands for a match between the atoms in the brackets, such as "[^xyz]py" can Match "APY", but not the "xpy".
import re
pattern1="\w\dpython[xyz]\w"
pattern2="\w\dpython[^xyz]\w"
pattern3="\w\dpython[xyz]\W"
string="abcdfphp345pythony_py"
result1=re.search(pattern1,string)
result2=re.search(pattern2,string)
result3=re.search(pattern3,string)
print(result1)
print(result2)
print(result3)
- Metacharacters
The so-called meta-character, is the regular expression has some special meanings of characters, such as repeating n times before the characters and so on. | Symbols | meaning |
| . | Match any character except line break |
| ^ | Start position of matching string |
| $ | The end position of the matching string |
| * | Matches 0, 1, or more previous atoms |
| ? | Match 0 or 1 times before the yard |
| + | Match 1 or more previous atoms |
| {N} | The front yard just happens N times |
| {N,} | The front atom appears at least n times |
| {N,m} | The front atom appears at least n times, at most, M times |
| ' | ' | Pattern Selector |
| () | Pattern Cell |
In particular, metacharacters can be divided into: arbitrary matching meta-characters, boundary-restricted meta-characters, qualifiers, pattern selectors, pattern units, and so on. <<<<<<< HEAD
- Arbitrary match meta characters
First, we'll explain the arbitrary match meta-character "." And we can use "." Matches an arbitrary character other than the line break. For example, we can use the regular expression ". Python ..." to match the first 1 bits of a "python" character, followed by a 3-bit formatted character, where the previous and subsequent 3 bits can be any character except newline characters.import re pattern=".python..."string="abcdfphp345pythony_py"result1=re.search(pattern,string)print(result1)
- Boundary-restricted meta-characters
You can use "^" to match the beginning of the string, using "$" to match the end import re
pattern1="^abd"
pattern2="^abc"
pattern3="py$"
pattern4="ay$"
string="abcdfphp345pythony_py"
result1=re.search(pattern1,string)
result2=re.search(pattern2,string)
result3=re.search(pattern3,string)
result4=re.search(pattern4,string)
print(result1)
print(result2)
print(result3)
print(result4)
of the string
- Qualifier
Common qualifiers include * 、?、 +, {n}, {n,}, {n,m}.
import re
pattern1="py.*n"
pattern2="cd{2}"
pattern3="cd{3}"
pattern4="cd{2,}"
string="abcdddfphp345pythony_py"
result1=re.search(pattern1,string)
result2=re.search(pattern2,string)
result3=re.search(pattern3,string)
result4=re.search(pattern4,string)
print(result1)
print(result2)
print(result3)
print(result4)
- Pattern Selector
The pattern selector "|" allows you to set multiple modes from which you can select any pattern match. For example, in the regular expression "python|php", both the string "Python" and "PHP" meet the matching criteria.import re
pattern="python|php"
string="abcdfphp345pythony_py"
result1=re.search(pattern,string)
print(result1)
- Pattern cell character
The Pattern Unit character "()" can combine some atoms into a large atom, and the enclosed parts of parentheses are used as a whole.import re
pattern1="(cd){1,}"
pattern2="cd{1,}"
string="abcdcdcdcdfphp345pythony_py"
result1=re.search(pattern1,string)
result2=re.search(pattern2,string)
print(result1)
print(result2)
- Mode correction
The so-called pattern modifier, which can change the meaning of the regular expression by the pattern modifier without changing the regular expression, realizes some functions such as adjusting the matching result. For example, you can use the pattern modifier I to make it case-insensitive for a module when it is matched.
| Symbols | meaning | | I | Ignore Case when matching |
| M | Multi-line Matching |
| L | Do Localization Identification matching |
| S Based on Unicode characters and parsing characters |
| S | Let the. Match include a newline character, which is corrected with the module, "." Match to match any character |
import re
pattern1="python"
pattern2="python"
string="abcdfphp34Pythony_py"
result1=re.search(pattern1,string)
result2=re.search(pattern2,string,re.I)
print(result1)
print(result2)
- Greedy mode and lazy mode
The core point of greedy mode is to match as many as possible, while the core of lazy mode is to match as few as possible.
import re
pattern1="p.*y"#贪婪模式
pattern2="p.*?y"#懒惰模式
string="abcdfphp345pythony_py"
result1=re.search(pattern1,string)
result2=re.search(pattern2,string)
print(result1)
print(result2)
Regular Expressions Common functions
- Re.match () function
If you want to match a pattern from the starting position, you can use the Re.match () function, which formats
re.match(pattern,string,flag)
The first parameter is a regular expression, the second argument is the source string, and the third is an optional parameter, which represents the flag bit.
import re
string="apythonhellomypythonhispythonourpythonend"
pattern=".python."
result=re.match(pattern,string)
result2=re.match(pattern,string).span()
print(result)
print(result2)
- Re.search () function
The Re.search () function is retrieved and matched in the full text.
import re
string="hellomypythonhispythonourpythonend"
pattern=".python."
result=re.match(pattern,string)
result2=re.search(pattern,string)
print(result)
print(result2)
- Global matching function
Matches multiple results in a string of strings.
- Use Re.compile () to precompile a regular expression.
- After compiling, use FindAll () to find all the matching results from the source string based on the regular expression. 、
import re
string="hellomypythonhispythonourpythonend"
pattern=re.compile(".python.")#预编译
result=pattern.findall(string)#找出符合模式的所有结果
print(result)
import re
string="hellomypythonhistpythonourpythonend"
pattern=".python."
result=re.compile(pattern).findall(string)
print(result)
- Re.sub function
Use regular expressions to replace certain string functions, which can be implemented using the Re.sub () function.
re.sub(pattern,rep,string,max)
The first parameter is a regular expression, the second argument is the string to be replaced, the third argument is the source string, the fourth argument is optional, represents the maximum number of replacements, and if not written, it is all replaced.
import re
string="hellomypytonhispythonourpythonend"
pattern="python."
result1=re.sub(pattern,"php",string)#全部替换
result2=re.sub(pattern,"php",string,2)#最多替换两次
print(result1)
print(result2)
Common instance Parsing
- URL URLs that match the. com or. cn suffix
import re
pattern="[a-zA-Z]+://[^\s]*[.com|.cn]"
string="<a href=‘http://www.baidu.com‘>百度首页</a>"
result=re.search(pattern,string)
print(result)
- Match phone number
import re
pattern="\d{4}-\d{7}|\d{3}-\d{8}"匹配电话号码的正则表达式
string="0216728263682382265236"
result=re.search(pattern,string)
print(result)
- Match e-mail address
import re
pattern="\w+([.+-]\w+)*@\w+([.-]\w+)*\.\w+([.-]\w+)*"#匹配电子邮箱的正则表达式
string="<a href=‘http://www.baidu.com‘>百度首页</a><br><a href=‘mailto:[email protected]‘>电子邮箱地址</a>"
result=re.search(pattern,string)
print(result)
What is a cookie
HTTP is a stateless protocol, which means that the state between sessions cannot be maintained. Cookies can be saved in a state.
Cookiejar Combat
import urllib.request
import urllib.parse
import http.cookiejar
url="post请求地址"
postdata=urllib.parse.urlencode({
"username":"用户名"
"password":"密码"
}).encode("utf-8")
req=urllib.request.Request(url,postdata)
req.add_header(‘User-Agent‘,‘User-Agent的值‘)
cjar=http.cookieJar.CookieJar() #使用http.cookiejar.CookieJar()创建CookieJar对象
opener=urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cjar))#使用HTTPCookieProcessor创建cookie处理器,并以其为参数构建opener对象
urllib.request.install_opener(opener) #将opener安装为全局
file=opener.open(req)
data=file.read()
file=open("文件存储地址","wb")
file.write(data)
file.close()
url2="网页地址"
data2=urllib.request.urlopen(url2).read()
fh=open("文件地址",‘wb‘)
fh.write(data2)
fh.close()
=======
3f577ee4216aa3d991983f31467a7d7578dd9797
Python Crawler Learning Chapter fifth Regular