Using Python to achieve such a simple crawler function, the image we want to crawl to local. (Python version is 3.6.0)
I. Getting the entire page data
def gethtml (URL): page=urllib.request.urlopen (URL) html=page.read () return HTML
Description
By passing a URL to the gethtml () function, you can download the entire page.
The Urllib.request module provides an interface for reading Web page data, and we can read the data on WWW and FTP as if it were a local file.
Two. Filter the data you want in the page
In Baidu Bar found a few beautiful pictures, want to download down. Using Firefox, right-click the View elements option in the image location, click inside to enter developer mode, and navigate to the front code of the image
Now we mainly look at the regular characteristics of images and write regular expressions.
Reg=r'src= ' (https://imgsa[^>]+\. (?: jpeg|jpg)) "'
#参考正则
Writing code
def getimg (HTML): reg=r'src= ' (https://imgsa[^>]+\. (?: jpeg|jpg)) "' = Re.compile (reg) = Re.findall (Imgre,html.decode ('utf-8' )) return imglist
Description
Re.compile () can compile a regular expression into a regular expression object.
The Re.findall () method reads the data in the HTML that contains the Imgre (regular expression).
Run the script to get the URL address of the entire page that contains the picture.
Three. Save page-filtered data to a local
Write a saved function
def saveFile (x): if Not Os.path.isdir (path): os.makedirs (path) = os.path.join (path,'%s.img' %x) return t
Full code:
" "Created on July 15, 2017 @author:administrator" "Import urllib.request,osimport redef gethtml (URL): page=urllib.request.urlopen (URL) HTML=Page.read ()returnHtmlpath='d:/workspace/python1/reptile/__pycache__/img'def saveFile (x):ifNot Os.path.isdir (path): os.makedirs (path) T= Os.path.join (Path,'%s.img'%x)returnthtml=gethtml ('https://tieba.baidu.com/p/5248432620') print (HTML) print ('\ n') def getimg (HTNL): Reg=r'src= "(https://imgsa[^>]+\. (?: jpeg|jpg))"'Imgre=Re.compile (reg) Imglist=re.findall (Imgre,html.decode ('Utf-8')) x=0 forImgurlinchImglist:urllib.request.urlretrieve (Imgurl,savefile (x)) print (imgurl) x+=1 ifx== at: Breakprint (x)returnimglistgetimg (HTML) print ('End')
The core is the use of the urllib.request. Urlretrieve () method to directly download remote data to a local
Finally, a bit of a problem has not been fully resolved, here is also to ask you.
When you download more than 23 photos, you will get an error:
Urllib.error.HTTPError:HTTP Error 500:internal Server error
Don't know what the problem is, ask for help.
Python simple crawler crawl Baidu post photos