Python crawlers get pictures of web pages

Source: Internet
Author: User

Import Urllib.request,osimport re# get content in HTML def gethtml (URL):    page=urllib.request.urlopen (URL)    html= Page.read ()    return htmlpath= ' local storage location ' # Save path def saveFile (x):    If not os.path.isdir (path):        os.makedirs (path    t = os.path.join (path, '%s.jpg '%x)    return  thtml=gethtml (' https://... ' # Get a picture of a webpage def getimg (html):    # Regular expression    reg=r ' src= ' (https://imgsa[^>]+\. (?: jpeg|jpg)) "'    # compiling regular expressions    Imgre=re.compile (reg)    Imglist=re.findall (Imgre,html.decode (' Utf-8 '))    x=0 for    Imgurl in Imglist:        # Download Picture        Urllib.request.urlretrieve (Imgurl,savefile (x))        print (imgurl)        x+=1        if x= =23:            Break    print (x)    return imglistgetimg (HTML) print (' End ')

  

^: The beginning of the string,

$: End of string

. : matches any character, except line break

*: Any number of characters

+: Any character greater than 1

?: Match 0 or 1, home-?brew:homebrew, or home-brew

[]: Specifies a character category that can be listed separately or used-to represent an interval. [ABC] matches any one of the characters in the a,b,c, or it can represent the character set of [A-c]

[^]: ^ As the first character of the category, [^5] will match any character except 5

\: Escape character

Plus backslash cancellation particularity. \ section, in order to match the backslash, it has to be written as \ \, but \ \ has another meaning. Lots of backslashes ... Using the raw string representation, with R in front of the string, the backslash is not treated as special, \ n means two characters \ and N, instead of a newline.

such as: Https://imgsa[^>]+\. (?: jpeg|jpg) represents Https://imgsa (the extra 1 strings that do not match >).

Method/Property Role
Match () Determines if RE is matched at the beginning of the string
Search () Scan the string to find the location of the RE match
FindAll () Find all the substrings that the RE matches and return them as a list
Finditer () Find all the substrings that the RE matches and return them as an iterator
Method/Property Role
Group () Returns the string that is matched by the RE
Start () Returns the position where the match started
End () Returns the position where the match ended
Span () Returns a tuple containing the location of a match (start, end)

Python crawlers get pictures of web pages

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.