Python3 crawlers crawl the 1024 image area,

Source: Internet
Author: User

Python3 crawlers crawl the 1024 image area,

I have been dealing with python for a while and have been trying to write a crawler. However, there is really no time near the end of the term recently. I just made a demo and sometimes there will be some errors, but it can still run, and there are still no problems with the next several hundred images. It is estimated that the remaining problems will be solved only after the holiday. Put the code first for communication, you are welcome to provide guidance.

Enter the subject

I wrote this crawler when referring to the pure smile blog, the idea is almost the same, his blog also posted: http://www.cnblogs.com/ityouknow/p/6013074.html

My code is as follows:

From bs4 import BeautifulSoup

import re
import os
import requests
import json
import time

import OpenSSL
Mainsite = "http: // 1024 URLs will not be pasted with. com /"
Def getbs (url ):
Header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36 ",
"Referer": "http://t66y.com//thread0806.php? Fid = 16 & search = & page = 1 ",
"Host": "t66y.com"
}
Req = requests. get (url, headers = header)
Req. encoding = "gbk" # here, because the encoding in the 1024 image post is gbk, if encoding is not specified, garbled characters are obtained.
Bsobj = BeautifulSoup (req. text, "html5lib ")
Return bsobj

Def getallpage (start, end ):
Urls = []
For I in range (start, end + 1 ):
Url = "http: // address human bypass/thread0806.php? Fid = 16 & search = & page = {} ". format (str (I ))
Bsobj = getbs (url)
Urls + = bsobj. find_all ("a", {"href": re. compile ("^ htm_data .*")})
Return urls
Def getpicofpage (url ):
Bsobj = getbs (url)
Div = bsobj. find ("div", {"class": "tpc_content do_not_catch "})
If div = None:
Print ("content cannot be obtained, skip ")
Return-1
Inputs = div. find_all ("input ")
Title = bsobj. find ("h4"). text
If inputs = []:
Print ("no images on this page, skip ")
Return-1
Num = 1
If OS. path. exists (path + "new \ tupian \" + "\" + title) = False:
OS. mkdir (path + "new \ tupian \" + "\" + title)
Else:
Print ("this folder already exists, skip ")
Return-1
For I in inputs:
Try: # The problem mainly lies here
Res = requests. get (I ["src"], timeout = 25)
With open (path + "new \ tupian \" + "\" + title + "\" + str (time. time () [: 10] + ". jpg ", 'wb ') as f:
F. write (res. content)
Except t requests. exceptions. Timeout :#
Some times out when you crawl images. If you do not set a time-out, it may be stuck there all the time.

Print ("time-out, skip this page ")
Return-1
Failed t OpenSSL. SSL. WantReadError: # this is also a problem. Sometimes it will jump out of this exception, but I cannot catch it here. What is the exception? I haven't figured out yet.
Print ("OpenSSL. SSL. WantReadError, skip ")
Return-1
Print (num)
Num + = 1
L = getallpage (5, 10)
Page = 1
Ed = []
For I in l:
Url = mainsite + I ["href"]
If url in ed:
Print (url + "this page has been collected, skip ")
Continue
Print (url)
Getpicofpage (url)
Ed. append (url)
Print ("page {} collected". format (page ))
Page + = 1
Time. sleep (3)

In addition, we also paste the ssl exception mentioned above:

 

Traceback (most recent call last ):
File "D: \ python \ Lib \ site-packages \ urllib3 \ contrib \ pyopenssl. py", line 441, in wrap_socket
Cnx. do_handshake ()
File "D: \ python \ Lib \ site-packages \ OpenSSL \ SSL. py", line 1806, in do_handshake
Self. _ raise_ssl_error (self. _ ssl, result)
File "D: \ python \ Lib \ site-packages \ OpenSSL \ SSL. py", line 1521, in _ raise_ssl_error
Raise WantReadError ()
OpenSSL. SSL. WantReadError
During handling of the above exception, another exception occurred:
Traceback (most recent call last ):
File "D: \ python \ Lib \ site-packages \ urllib3 \ connectionpool. py", line 595, in urlopen
Self. _ prepare_proxy (conn)
File "D: \ python \ Lib \ site-packages \ urllib3 \ connectionpool. py", line 816, in _ prepare_proxy
Conn. connect ()
File "D: \ python \ Lib \ site-packages \ urllib3 \ connection. py", line 326, in connect
Ssl_context = context)
File "D: \ python \ Lib \ site-packages \ urllib3 \ util \ ssl _. py", line 329, in ssl_wrap_socket
Return context. wrap_socket (sock, server_hostname = server_hostname)
File "D: \ python \ Lib \ site-packages \ urllib3 \ contrib \ pyopenssl. py", line 445, in wrap_socket
Raise timeout ('select timed out ')
Socket. timeout: select timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last ):
File "D: \ python \ Lib \ site-packages \ requests \ adapters. py", line 440, in send
Timeout = timeout
File "D: \ python \ Lib \ site-packages \ urllib3 \ connectionpool. py", line 639, in urlopen
_ Stacktrace = sys. exc_info () [2])
File "D: \ python \ Lib \ site-packages \ urllib3 \ util \ retry. py", line 388, in increment
Raise MaxRetryError (_ pool, url, error or ResponseError (cause ))
Urllib3.exceptions. maxRetryError: HTTPSConnectionPool (host = 'www .srimg.com ', port = 443): Max retries exceeded with url:/u/20180104/11315126 .jpg (Caused by ProxyError ('cannot connect to proxy. ', timeout ('select timed out ',)))
During handling of the above exception, another exception occurred:
Traceback (most recent call last ):
File "D: \ PyCharm 2017.3.1 \ helpers \ pydev \ pydev_run_in_console.py", line 52, in run_file
Pydev_imports.execfile (file, globals, locals) # execute the script
File "D: \ PyCharm 2017.3.1 \ helpers \ pydev \ _ pydev_imps \ _ pydev_execfile.py", line 18, in execfile
Exec (compile (contents + "\ n", file, 'exec '), glob, loc)
File "D:/learnPython/crawler. py", line 301, in <module>
Getpicofpage (url)
File "D:/learnPython/crawler. py", line 281, in getpicofpage
Res = requests. get (I ["src"], timeout = 25)
File "D: \ python \ Lib \ site-packages \ requests \ api. py", line 72, in get
Return request ('get', url, params = params, ** kwargs)
File "D: \ python \ Lib \ site-packages \ requests \ api. py", line 58, in request
Return session. request (method = method, url = url, ** kwargs)
File "D: \ python \ Lib \ site-packages \ requests \ sessions. py", line 508, in request
Resp = self. send (prep, ** send_kwargs)
File "D: \ python \ Lib \ site-packages \ requests \ sessions. py", line 618, in send
R = adapter. send (request, ** kwargs)
File "D: \ python \ Lib \ site-packages \ requests \ adapters. py", line 502, in send
Raise ProxyError (e, request = request)
Requests. exceptions. proxyError: HTTPSConnectionPool (host = 'www .srimg.com ', port = 443): Max retries exceeded with url:/u/20180104/11315126 .jpg (Caused by ProxyError ('cannot connect to proxy. ', timeout ('select timed out ',)))
PyDev console: starting.

Another point is that although I opened a vpn, I couldn't get the content through direct crawling. I will prompt that the host didn't respond, but later I found that I could crawl through fiddler. It is probably because of the ip address, I have not studied this carefully. Please kindly advise me.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.