In order to speed up learning python3.x so directly to see a lot of practical small projects, looked up a lot of information after writing this script, this script is to crawl Baidu Image ' Oriental Fantasy township ' pictures, but found a few problems:
1. The picture will be repeated two times.
2. There are only 81 pictures, only matching the pictures of fm=27 ...
The following code is given:
From urllib import requestimport reclass crawljpg: #定义一个爬取图片的类 def __init__ (self): # constructor Print (' Link St Art! ') def __gethtml (self, html): Post = request.urlopen (HTML) page = Post.read () return page def __getim g (self, HTML): page = self.__gethtml (HTML) # Get HTML page data page = Page.decode (' utf-8 ') # Convert format to Utf-8 grid Type Typeerror:cannot use a string pattern on a Bytes-like object recomp = Re.compile (R ' https://\w{3}.\w{8}.\w{3}/\w{ 27}/\w{2}/u=[0-9]{9,10},[0-9]{9,10}&fm=\w{2}&gp=0.jpg ') imgurllist = Recomp.findall (page) # and HTML page regular match Return imgurllist # Returns a list of URLs that match the resulting JPG def run (self, html): Imgurllist = self.__getimg (HTML) imgname = 0 fp = open (' C:\\users\\adimin\\desktop\\crawlimg\\imgurl.txt ', ' W ') for Imgurl in Imgurllist: Request.urlretrieve (Imgurl, ' c:\\users\\adimin\\desktop\\crawlimg\\{}.jpg '. Format (str (imgname))) print (' Downloads: '+ imgurl) fp.write (str (imgurl)) Imgname + = 1 fp.close () def __del__ (self): # destructor Print ("Download finished!") def main (): url = ' https://image.baidu.com/search/index?tn=baiduimage&ct=201326592&lm=-1&cl=2&ie= gbk&word=%b6%ab%b7%bd%bb%c3%cf%eb%cf%e7&fr=ala&ala=1&alatpl=adress&pos=0&hs=2& xthttps=111111 ' getimg = crawljpg () getimg.run (URL) If __name__ = = ' __main__ ': Main ()
A number of blogs and materials have been consulted, mainly:
1.http://blog.csdn.net/clj198606061111/article/details/50816115
2.https://www.cnblogs.com/speeding/p/5097790.html
3.http://urllib3.readthedocs.io/en/latest/
4.https://pyopenssl.org/en/stable/
5.https://docs.python.org/3.6/library/urllib.html
6.https://segmentfault.com/q/1010000004442233/a-1020000004448440
7.http://urllib3.readthedocs.io/en/latest/user-guide.html
8. Novice Tutorial-python3
And some of them don't remember ...
Then, through this study to learn a lot, basically familiar with the basic grammar of Python3, but also understand the expression of regular expressions and so on, so the object-oriented approach to programming.
You can see in the code: A class that crawls pictures, constructors, destructors, and so on.
In fact, for the URLLIB3 package I still have a lot of places do not understand ... For example, I also wrote another version of the URL request, using the URLLIB3. Poolmanager (), run no problem, but no way to download pictures
From urllib import requestimport urllib3import certifiimport reclass crawljpg: #定义一个爬取图片的类 def __init__ (self): # constructor Print (' Link start! ') def __gethtml (self, html): Post = urllib3. Poolmanager (# Initialization, in order to solve a certificate problem installed PYOPENSSL will have the Certifi package, this will solve theInsecureRequestWarning的
Warning cert_reqs= ' cert_required ', Ca_certs=certifi.where ()) post = Post.urlopen (' GET ', HT ML) # Request to open a webpage page = Post.read () # Read page data return pages def __getimg (self, HTML): page = Self.__get HTML (HTML) # Gets the HTML page data page = Page.decode (' utf-8 ') # Converts the format to UTF-8 format Typeerror:cannot use a string patte RN on a Bytes-like object recomp = Re.compile (R ' https://\w{3}.\w{8}.\w{3}/\w{27}/\w{2}/u=[0-9]{9,10},[0-9]{9,10}&am P;fm=\w{2}&gp=0.jpg ') imgurllist = Recomp.findall (page) # and HTML page regular match return Imgurllist # returns the match obtained JPG URL list def run (self, html): Imgurllist = self.__getimg (html) imgname = 0 fp = open (' C:\\user S\\adimin\\desktop\\crawlimg\\imgurl.txt ', ' W ') for Imgurl in ImgUrlList:request.urlretrieve (Imgurl, ' C : \\users\\adimin\\desktop\\crawlimg\\{}.jpg '. Format (str (imgname))) print (' Downloads: ' + imgurl) FP. Write (str (imgurl)) Imgname + = 1 fp.close () def __del__ (self): # destructor print ("Download finished!") def main (): url = ' https://image.baidu.com/search/index?tn=baiduimage&ct=201326592&lm=-1&cl=2&ie= gbk&word=%b6%ab%b7%bd%bb%c3%cf%eb%cf%e7&fr=ala&ala=1&alatpl=adress&pos=0&hs=2& xthttps=111111 ' getimg = crawljpg () getimg.run (URL) If __name__ = = ' __main__ ': Main ()
Let's study for a while.
The last time I said I couldn't write with pycharm, I solved it. But the Python keyword is not very familiar, or with the sublimb text better ...
In the end, this article sums it up.
Python Learning-Implementing a simple crawler