Python3 crawlers crawl the 1024 image area,
I have been dealing with python for a while and have been trying to write a crawler. However, there is really no time near the end of the term recently. I just made a demo and sometimes there will be some errors, but it can still run, and there are still no problems with the next several hundred images. It is estimated that the remaining problems will be solved only after the holiday. Put the code first for communication, you are welcome to provide guidance.
Enter the subject
I wrote this crawler when referring to the pure smile blog, the idea is almost the same, his blog also posted: http://www.cnblogs.com/ityouknow/p/6013074.html
My code is as follows:
From bs4 import BeautifulSoup
import re
import os
import requests
import json
import time
import OpenSSL
Mainsite = "http: // 1024 URLs will not be pasted with. com /"
Def getbs (url ):
Header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36 ",
"Referer": "http://t66y.com//thread0806.php? Fid = 16 & search = & page = 1 ",
"Host": "t66y.com"
}
Req = requests. get (url, headers = header)
Req. encoding = "gbk" # here, because the encoding in the 1024 image post is gbk, if encoding is not specified, garbled characters are obtained.
Bsobj = BeautifulSoup (req. text, "html5lib ")
Return bsobj
Def getallpage (start, end ):
Urls = []
For I in range (start, end + 1 ):
Url = "http: // address human bypass/thread0806.php? Fid = 16 & search = & page = {} ". format (str (I ))
Bsobj = getbs (url)
Urls + = bsobj. find_all ("a", {"href": re. compile ("^ htm_data .*")})
Return urls
Def getpicofpage (url ):
Bsobj = getbs (url)
Div = bsobj. find ("div", {"class": "tpc_content do_not_catch "})
If div = None:
Print ("content cannot be obtained, skip ")
Return-1
Inputs = div. find_all ("input ")
Title = bsobj. find ("h4"). text
If inputs = []:
Print ("no images on this page, skip ")
Return-1
Num = 1
If OS. path. exists (path + "new \ tupian \" + "\" + title) = False:
OS. mkdir (path + "new \ tupian \" + "\" + title)
Else:
Print ("this folder already exists, skip ")
Return-1
For I in inputs:
Try: # The problem mainly lies here
Res = requests. get (I ["src"], timeout = 25)
With open (path + "new \ tupian \" + "\" + title + "\" + str (time. time () [: 10] + ". jpg ", 'wb ') as f:
F. write (res. content)
Except t requests. exceptions. Timeout :#Some times out when you crawl images. If you do not set a time-out, it may be stuck there all the time.
Print ("time-out, skip this page ")
Return-1
Failed t OpenSSL. SSL. WantReadError: # this is also a problem. Sometimes it will jump out of this exception, but I cannot catch it here. What is the exception? I haven't figured out yet.
Print ("OpenSSL. SSL. WantReadError, skip ")
Return-1
Print (num)
Num + = 1
L = getallpage (5, 10)
Page = 1
Ed = []
For I in l:
Url = mainsite + I ["href"]
If url in ed:
Print (url + "this page has been collected, skip ")
Continue
Print (url)
Getpicofpage (url)
Ed. append (url)
Print ("page {} collected". format (page ))
Page + = 1
Time. sleep (3)
In addition, we also paste the ssl exception mentioned above:
Traceback (most recent call last ):
File "D: \ python \ Lib \ site-packages \ urllib3 \ contrib \ pyopenssl. py", line 441, in wrap_socket
Cnx. do_handshake ()
File "D: \ python \ Lib \ site-packages \ OpenSSL \ SSL. py", line 1806, in do_handshake
Self. _ raise_ssl_error (self. _ ssl, result)
File "D: \ python \ Lib \ site-packages \ OpenSSL \ SSL. py", line 1521, in _ raise_ssl_error
Raise WantReadError ()
OpenSSL. SSL. WantReadError
During handling of the above exception, another exception occurred:
Traceback (most recent call last ):
File "D: \ python \ Lib \ site-packages \ urllib3 \ connectionpool. py", line 595, in urlopen
Self. _ prepare_proxy (conn)
File "D: \ python \ Lib \ site-packages \ urllib3 \ connectionpool. py", line 816, in _ prepare_proxy
Conn. connect ()
File "D: \ python \ Lib \ site-packages \ urllib3 \ connection. py", line 326, in connect
Ssl_context = context)
File "D: \ python \ Lib \ site-packages \ urllib3 \ util \ ssl _. py", line 329, in ssl_wrap_socket
Return context. wrap_socket (sock, server_hostname = server_hostname)
File "D: \ python \ Lib \ site-packages \ urllib3 \ contrib \ pyopenssl. py", line 445, in wrap_socket
Raise timeout ('select timed out ')
Socket. timeout: select timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last ):
File "D: \ python \ Lib \ site-packages \ requests \ adapters. py", line 440, in send
Timeout = timeout
File "D: \ python \ Lib \ site-packages \ urllib3 \ connectionpool. py", line 639, in urlopen
_ Stacktrace = sys. exc_info () [2])
File "D: \ python \ Lib \ site-packages \ urllib3 \ util \ retry. py", line 388, in increment
Raise MaxRetryError (_ pool, url, error or ResponseError (cause ))
Urllib3.exceptions. maxRetryError: HTTPSConnectionPool (host = 'www .srimg.com ', port = 443): Max retries exceeded with url:/u/20180104/11315126 .jpg (Caused by ProxyError ('cannot connect to proxy. ', timeout ('select timed out ',)))
During handling of the above exception, another exception occurred:
Traceback (most recent call last ):
File "D: \ PyCharm 2017.3.1 \ helpers \ pydev \ pydev_run_in_console.py", line 52, in run_file
Pydev_imports.execfile (file, globals, locals) # execute the script
File "D: \ PyCharm 2017.3.1 \ helpers \ pydev \ _ pydev_imps \ _ pydev_execfile.py", line 18, in execfile
Exec (compile (contents + "\ n", file, 'exec '), glob, loc)
File "D:/learnPython/crawler. py", line 301, in <module>
Getpicofpage (url)
File "D:/learnPython/crawler. py", line 281, in getpicofpage
Res = requests. get (I ["src"], timeout = 25)
File "D: \ python \ Lib \ site-packages \ requests \ api. py", line 72, in get
Return request ('get', url, params = params, ** kwargs)
File "D: \ python \ Lib \ site-packages \ requests \ api. py", line 58, in request
Return session. request (method = method, url = url, ** kwargs)
File "D: \ python \ Lib \ site-packages \ requests \ sessions. py", line 508, in request
Resp = self. send (prep, ** send_kwargs)
File "D: \ python \ Lib \ site-packages \ requests \ sessions. py", line 618, in send
R = adapter. send (request, ** kwargs)
File "D: \ python \ Lib \ site-packages \ requests \ adapters. py", line 502, in send
Raise ProxyError (e, request = request)
Requests. exceptions. proxyError: HTTPSConnectionPool (host = 'www .srimg.com ', port = 443): Max retries exceeded with url:/u/20180104/11315126 .jpg (Caused by ProxyError ('cannot connect to proxy. ', timeout ('select timed out ',)))
PyDev console: starting.
Another point is that although I opened a vpn, I couldn't get the content through direct crawling. I will prompt that the host didn't respond, but later I found that I could crawl through fiddler. It is probably because of the ip address, I have not studied this carefully. Please kindly advise me.