Try python simple crawlers for yourself

Source: Internet
Author: User

A few days ago, the Friends share a paste page, there are many beautiful pictures, think of the previous time to learn the Python simple crawler, just can practice.

Here's an easy way to find it online:

1 #Coding=utf-82 ImportUrllib3 ImportRe4 5 defgethtml (URL):6page =urllib.urlopen (URL)7HTML =Page.read ()8     returnHTML9 Ten defgetimg (HTML): OneReg = R'src= "(. +?\.jpg)" Pic_ext' AImgre =Re.compile (REG) -Imglist =Re.findall (imgre,html) -x =0 the      forImgurlinchimglist: -Urllib.urlretrieve (Imgurl,'%s.jpg'%x) -X+=1 -  +  -html = gethtml ("http://tieba.baidu.com/p/2460150866") +  A PrintGetimg (HTML)

The code I wrote was much the same, but it didn't work, and I even copied the code directly to run it, but it wasn't successful.

No way, had to debug one by one.

First I wrote the acquired HTML code into the Html.txt file, so that I could see the comparison, and then I found the first question: the HTML code obtained through Urllib is not the same as the code ctrl+u in the browser.

then I used the regular '; src= (. *?imgsrc.*?\.jpg) ' to match the code in Html.txt, when the key issue arose: The match was http%3a%2f% 2fxx.jpg Such an address, the problem is obvious, when using Urllib to get HTML, ': ' and '/' were transcoded. The use of the transcoded address to download the image is of course not feasible, you need to transcode the address back to UTF8 encoding.

Here are my changes to the gethtml (URL):

 def   gethtml (URL): page  =urllib.urlopen (URL) HTML  =page.read () HTML  =re.sub ( '   ,  " /  ,html '  return  html 

The way seems to be stupid some, look at you more advice. However, this program will be able to run successfully ~, share the download of the picture, and attach the address:HTTP://TIEBA.BAIDU.COM/P/3604860421?LP=5027&MO_DEVICE=1&PN =0&

Try python simple crawlers for yourself

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.