1. Regular expressions
Common symbols
(1) * Match the preceding characters, sub-expressions, or the characters in parentheses 0 or more times for example: a*b* results: AAA,AAABB,BB
(2) + match the preceding characters, sub-expressions, or the characters in parentheses at least 1 times for example: a+b+ results: AAAB,AABB,ABBB
(3) [] Match any one of the characters (equivalent to "optional") for example: [a-z]* Result: Ap,cap,qwer
(4) () expression grouping (will run first) example: (a*b) * Result: Aabab,abaab,ababaaab
(5) {M,n} matches the preceding characters, sub-expressions, or the characters in parentheses M to n times for example: a{2,3}b{2,3} results: AABB,AAABB,AAABBB
(6) [^] matches any character that is not in brackets such as: [^a-z]* Result: apple,low,qwe
(7) | Matches any character, sub-expression separated by a vertical bar such as: B (a|i|e) d results: bad,bid,bed
(8). Match any single character (including symbols, numbers, spaces, and so on) For example: B.D results: Bad,b d,b$d
(9) ^ refers to the character or subexpression where the string begins. For example: ^a result: ap,asdl,a
(10) \ Escape character (converts a character with a special meaning into a literal form) For example: \.\|\\ result:. | \
(11) $ is often used at the end of a regular expression, which means "to match from the end of a string for example: [a-z]*[a-z]*$ Result: Abcabc,zzyx,bob
(12)?! "Not included". Usually placed before a character or regular expression, for example: ^ (?! [A-z]).) *$ results: No-caps-here, $y 01s,a5d!ne
Example:
The letter "A" appears at least once, followed by the letter "B" repeated 5 times, followed by the letter "C" repeat any even number of times, the last one is the letter "D", also can not.
AA*BBBBB (CC) * (d |)
Identify your email address.
[a-za-z0-9\._+] [Email protected] [A-za-z]+\. (com|org|edu|net)
2. Regular Expressions and BeautifulSoup
There are several product images on the webpage-their source code form:
If you grab all the pictures with FindAll ("img") you will catch many unwanted pictures with regular:
#-*-Coding:utf-8-*-import refrom urllib.request import urlopenfrom bs4 Import beautifulsouphtml = Urlopen ("http://www. Pythonscraping.com/pages/page3.html ") Bsobj = BeautifulSoup (HTML, ' lxml ') images = Bsobj.findall (" img ", {" src "): Re.compile ("\.\.\/img\/gifts\/img.*\.jpg")}) for image in Images:print (image["src"])
This code will print out the relative path of the image, all with the. /img/gifts/img begins with a. jpg ending.
Operation Result:
.. /img/gifts/img1.jpg. /img/gifts/img2.jpg. /img/gifts/img3.jpg. /img/gifts/img4.jpg. /img/gifts/img6.jpg
3. Get Properties
>>> image.attrs{' src ': '. /img/gifts/img6.jpg '}>>> image.attrs["src"] '. /img/gifts/img6.jpg '
4. Lambda expression
A lambda expression is essentially a function that can be used as a variable for other functions.
BeautifulSoup allows us to treat a particular function type as an argument to the FindAll function (a label must be used as a parameter and the result is a Boolean type). BeautifulSoup uses this function to evaluate each tag object it encounters, and finally retains the label that evaluates to "true" and rejects the other tags.
Soup.findall (Lambda tag:len (tag.attrs) = = 2)
This line of code will find a label with two attributes, as follows:
<div class= "Body" id= "content" ></div>
<span style= "color:red" class= "title" ></span>
"Python Network data Acquisition" Reading notes (iii)