python-the contents of the HTML file to retrieve a specific URL address string, save it as a list, and use each URL to download the picture and save it to the hard drive, the regular re
Reference: http://blog.csdn.net/xwbk12/article/details/72734930
1, the target address: https://xianzhi.aliyun.com/forum/topic/1805/
The contents of the following figure
Remove the contents of the target back package like this:
https://xianzhi.aliyun.com/forum/media/upload/picture/20171215230019-ab0e46aa-e1a8-1.png
2. Python Script
Running on Kali Linux
root@kali:~/python# cat downloadxianzhi-re.py
#coding =utf-8
import urllib
import re
import sys
def gethtml (URL):
page = urllib.urlopen (URL)
html = page.read () return
HTML
def getimg (HTML):
reg = R ' src= ' (. +?\.png) ' ></p> '
imgre = Re.compile (reg)
imglist = Re.findall (imgre,html)
x = 0
for Imgurl in imglist:
urllib.urlretrieve (Imgurl, '%s100.jpg '% x)
x+=1 return
imglist
html = gethtml ("https://xianzhi.aliyun.com/forum/topic/1805/")
print getimg (HTML)
3. Operation situation
Src= "(. +?\.png)" ></p>
Explanation:
src= " #匹配src ="
(. +?\.jpg)
# brackets denote grouping, capturing the contents of parentheses into groups
#. + to match at least one arbitrary character, question mark? Indicates lazy matching, which is to match as few strings as possible.
# . +?\.jpg together to represent as few matching characters as possible to. jpg, to avoid matching the range beyond the SRC range
# This bracket can also match the URL of the picture in the page
"" ></p > #匹配 "></p>