A few days ago, the Friends share a paste page, there are many beautiful pictures, think of the previous time to learn the Python simple crawler, just can practice.
Here's an easy way to find it online:
1 #Coding=utf-82 ImportUrllib3 ImportRe4 5 defgethtml (URL):6page =urllib.urlopen (URL)7HTML =Page.read ()8 returnHTML9 Ten defgetimg (HTML): OneReg = R'src= "(. +?\.jpg)" Pic_ext' AImgre =Re.compile (REG) -Imglist =Re.findall (imgre,html) -x =0 the forImgurlinchimglist: -Urllib.urlretrieve (Imgurl,'%s.jpg'%x) -X+=1 - + -html = gethtml ("http://tieba.baidu.com/p/2460150866") + A PrintGetimg (HTML)
The code I wrote was much the same, but it didn't work, and I even copied the code directly to run it, but it wasn't successful.
No way, had to debug one by one.
First I wrote the acquired HTML code into the Html.txt file, so that I could see the comparison, and then I found the first question: the HTML code obtained through Urllib is not the same as the code ctrl+u in the browser.
then I used the regular '; src= (. *?imgsrc.*?\.jpg) ' to match the code in Html.txt, when the key issue arose: The match was http%3a%2f% 2fxx.jpg Such an address, the problem is obvious, when using Urllib to get HTML, ': ' and '/' were transcoded. The use of the transcoded address to download the image is of course not feasible, you need to transcode the address back to UTF8 encoding.
Here are my changes to the gethtml (URL):
def gethtml (URL): page =urllib.urlopen (URL) HTML =page.read () HTML =re.sub ( ' , " / ,html ' return html
The way seems to be stupid some, look at you more advice. However, this program will be able to run successfully ~, share the download of the picture, and attach the address:HTTP://TIEBA.BAIDU.COM/P/3604860421?LP=5027&MO_DEVICE=1&PN =0&
Try python simple crawlers for yourself