Import Urllib.request,osimport re# get content in HTML def gethtml (URL): page=urllib.request.urlopen (URL) html= Page.read () return htmlpath= ' local storage location ' # Save path def saveFile (x): If not os.path.isdir (path): os.makedirs (path t = os.path.join (path, '%s.jpg '%x) return thtml=gethtml (' https://... ' # Get a picture of a webpage def getimg (html): # Regular expression reg=r ' src= ' (https://imgsa[^>]+\. (?: jpeg|jpg)) "' # compiling regular expressions Imgre=re.compile (reg) Imglist=re.findall (Imgre,html.decode (' Utf-8 ')) x=0 for Imgurl in Imglist: # Download Picture Urllib.request.urlretrieve (Imgurl,savefile (x)) print (imgurl) x+=1 if x= =23: Break print (x) return imglistgetimg (HTML) print (' End ')
^: The beginning of the string,
$: End of string
. : matches any character, except line break
*: Any number of characters
+: Any character greater than 1
?: Match 0 or 1, home-?brew:homebrew, or home-brew
[]: Specifies a character category that can be listed separately or used-to represent an interval. [ABC] matches any one of the characters in the a,b,c, or it can represent the character set of [A-c]
[^]: ^ As the first character of the category, [^5] will match any character except 5
\: Escape character
Plus backslash cancellation particularity. \ section, in order to match the backslash, it has to be written as \ \, but \ \ has another meaning. Lots of backslashes ... Using the raw string representation, with R in front of the string, the backslash is not treated as special, \ n means two characters \ and N, instead of a newline.
such as: Https://imgsa[^>]+\. (?: jpeg|jpg) represents Https://imgsa (the extra 1 strings that do not match >).
Method/Property |
Role |
Match () |
Determines if RE is matched at the beginning of the string |
Search () |
Scan the string to find the location of the RE match |
FindAll () |
Find all the substrings that the RE matches and return them as a list |
Finditer () |
Find all the substrings that the RE matches and return them as an iterator |
Method/Property |
Role |
Group () |
Returns the string that is matched by the RE |
Start () |
Returns the position where the match started |
End () |
Returns the position where the match ended |
Span () |
Returns a tuple containing the location of a match (start, end) |
Python crawlers get pictures of web pages