Python Learning-Implementing a simple crawler

Source: Internet
Author: User

In order to speed up learning python3.x so directly to see a lot of practical small projects, looked up a lot of information after writing this script, this script is to crawl Baidu Image ' Oriental Fantasy township ' pictures, but found a few problems:

1. The picture will be repeated two times.

2. There are only 81 pictures, only matching the pictures of fm=27 ...

The following code is given:

From urllib import requestimport reclass crawljpg: #定义一个爬取图片的类 def __init__ (self): # constructor Print (' Link St    Art! ') def __gethtml (self, html): Post = request.urlopen (HTML) page = Post.read () return page def __getim g (self, HTML): page = self.__gethtml (HTML) # Get HTML page data page = Page.decode (' utf-8 ') # Convert format to Utf-8 grid Type Typeerror:cannot use a string pattern on a Bytes-like object recomp = Re.compile (R ' https://\w{3}.\w{8}.\w{3}/\w{         27}/\w{2}/u=[0-9]{9,10},[0-9]{9,10}&fm=\w{2}&gp=0.jpg ') imgurllist = Recomp.findall (page) # and HTML page regular match        Return imgurllist # Returns a list of URLs that match the resulting JPG def run (self, html): Imgurllist = self.__getimg (HTML)            imgname = 0 fp = open (' C:\\users\\adimin\\desktop\\crawlimg\\imgurl.txt ', ' W ') for Imgurl in Imgurllist: Request.urlretrieve (Imgurl, ' c:\\users\\adimin\\desktop\\crawlimg\\{}.jpg '. Format (str (imgname))) print (' Downloads: '+ imgurl) fp.write (str (imgurl)) Imgname + = 1 fp.close () def __del__ (self): # destructor Print ("Download finished!") def main (): url = ' https://image.baidu.com/search/index?tn=baiduimage&ct=201326592&lm=-1&cl=2&ie= gbk&word=%b6%ab%b7%bd%bb%c3%cf%eb%cf%e7&fr=ala&ala=1&alatpl=adress&pos=0&hs=2& xthttps=111111 ' getimg = crawljpg () getimg.run (URL) If __name__ = = ' __main__ ': Main ()

A number of blogs and materials have been consulted, mainly:

1.http://blog.csdn.net/clj198606061111/article/details/50816115

2.https://www.cnblogs.com/speeding/p/5097790.html

3.http://urllib3.readthedocs.io/en/latest/

4.https://pyopenssl.org/en/stable/

5.https://docs.python.org/3.6/library/urllib.html

6.https://segmentfault.com/q/1010000004442233/a-1020000004448440

7.http://urllib3.readthedocs.io/en/latest/user-guide.html

8. Novice Tutorial-python3

And some of them don't remember ...

Then, through this study to learn a lot, basically familiar with the basic grammar of Python3, but also understand the expression of regular expressions and so on, so the object-oriented approach to programming.

You can see in the code: A class that crawls pictures, constructors, destructors, and so on.

In fact, for the URLLIB3 package I still have a lot of places do not understand ... For example, I also wrote another version of the URL request, using the URLLIB3. Poolmanager (), run no problem, but no way to download pictures

From urllib import requestimport urllib3import certifiimport reclass crawljpg: #定义一个爬取图片的类 def __init__ (self):    # constructor Print (' Link start! ') def __gethtml (self, html): Post = urllib3. Poolmanager (# Initialization, in order to solve a certificate problem installed PYOPENSSL will have the Certifi package, this will solve theInsecureRequestWarning的Warning cert_reqs= ' cert_required ', Ca_certs=certifi.where ()) post = Post.urlopen (' GET ', HT ML) # Request to open a webpage page = Post.read () # Read page data return pages def __getimg (self, HTML): page = Self.__get HTML (HTML) # Gets the HTML page data page = Page.decode (' utf-8 ') # Converts the format to UTF-8 format Typeerror:cannot use a string patte RN on a Bytes-like object recomp = Re.compile (R ' https://\w{3}.\w{8}.\w{3}/\w{27}/\w{2}/u=[0-9]{9,10},[0-9]{9,10}&am P;fm=\w{2}&gp=0.jpg ') imgurllist = Recomp.findall (page) # and HTML page regular match return Imgurllist # returns the match obtained JPG URL list def run (self, html): Imgurllist = self.__getimg (html) imgname = 0 fp = open (' C:\\user S\\adimin\\desktop\\crawlimg\\imgurl.txt ', ' W ') for Imgurl in ImgUrlList:request.urlretrieve (Imgurl, ' C : \\users\\adimin\\desktop\\crawlimg\\{}.jpg '. Format (str (imgname))) print (' Downloads: ' + imgurl) FP.      Write (str (imgurl))      Imgname + = 1 fp.close () def __del__ (self): # destructor print ("Download finished!") def main (): url = ' https://image.baidu.com/search/index?tn=baiduimage&ct=201326592&lm=-1&cl=2&ie= gbk&word=%b6%ab%b7%bd%bb%c3%cf%eb%cf%e7&fr=ala&ala=1&alatpl=adress&pos=0&hs=2& xthttps=111111 ' getimg = crawljpg () getimg.run (URL) If __name__ = = ' __main__ ': Main ()

Let's study for a while.

The last time I said I couldn't write with pycharm, I solved it. But the Python keyword is not very familiar, or with the sublimb text better ...

In the end, this article sums it up.

Python Learning-Implementing a simple crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.