Python3 crawlers crawl the 1024 image area,

Last Update:2018-01-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I have been dealing with python for a while and have been trying to write a crawler. However, there is really no time near the end of the term recently. I just made a demo and sometimes there will be some errors, but it can still run, and there are still no problems with the next several hundred images. It is estimated that the remaining problems will be solved only after the holiday. Put the code first for communication, you are welcome to provide guidance.

Enter the subject

I wrote this crawler when referring to the pure smile blog, the idea is almost the same, his blog also posted: http://www.cnblogs.com/ityouknow/p/6013074.html

My code is as follows:

From bs4 import BeautifulSoup

import re
import os
import requests
import json
import time

import OpenSSL

Mainsite = "http: // 1024 URLs will not be pasted with. com /"
Def getbs (url ):
Header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36 ",
"Referer": "http://t66y.com//thread0806.php? Fid = 16 & search = & page = 1 ",
"Host": "t66y.com"
}
Req = requests. get (url, headers = header)
Req. encoding = "gbk" # here, because the encoding in the 1024 image post is gbk, if encoding is not specified, garbled characters are obtained.
Bsobj = BeautifulSoup (req. text, "html5lib ")
Return bsobj

Def getallpage (start, end ):
Urls = []
For I in range (start, end + 1 ):
Url = "http: // address human bypass/thread0806.php? Fid = 16 & search = & page = {} ". format (str (I ))
Bsobj = getbs (url)
Urls + = bsobj. find_all ("a", {"href": re. compile ("^ htm_data .*")})
Return urls
Def getpicofpage (url ):
Bsobj = getbs (url)
Div = bsobj. find ("div", {"class": "tpc_content do_not_catch "})
If div = None:
Print ("content cannot be obtained, skip ")
Return-1
Inputs = div. find_all ("input ")
Title = bsobj. find ("h4"). text
If inputs = []:
Print ("no images on this page, skip ")
Return-1
Num = 1
If OS. path. exists (path + "new \ tupian \" + "\" + title) = False:
OS. mkdir (path + "new \ tupian \" + "\" + title)
Else:
Print ("this folder already exists, skip ")
Return-1
For I in inputs:
Try: # The problem mainly lies here
Res = requests. get (I ["src"], timeout = 25)
With open (path + "new \ tupian \" + "\" + title + "\" + str (time. time () [: 10] + ". jpg ", 'wb ') as f:
F. write (res. content)
Except t requests. exceptions. Timeout :#Some times out when you crawl images. If you do not set a time-out, it may be stuck there all the time.


Print ("time-out, skip this page ")
Return-1
Failed t OpenSSL. SSL. WantReadError: # this is also a problem. Sometimes it will jump out of this exception, but I cannot catch it here. What is the exception? I haven't figured out yet.
Print ("OpenSSL. SSL. WantReadError, skip ")
Return-1
Print (num)
Num + = 1
L = getallpage (5, 10)
Page = 1
Ed = []
For I in l:
Url = mainsite + I ["href"]
If url in ed:
Print (url + "this page has been collected, skip ")
Continue
Print (url)
Getpicofpage (url)
Ed. append (url)
Print ("page {} collected". format (page ))
Page + = 1
Time. sleep (3)

In addition, we also paste the ssl exception mentioned above:

Traceback (most recent call last ):
File "D: \ python \ Lib \ site-packages \ urllib3 \ contrib \ pyopenssl. py", line 441, in wrap_socket
Cnx. do_handshake ()
File "D: \ python \ Lib \ site-packages \ OpenSSL \ SSL. py", line 1806, in do_handshake
Self. _ raise_ssl_error (self. _ ssl, result)
File "D: \ python \ Lib \ site-packages \ OpenSSL \ SSL. py", line 1521, in _ raise_ssl_error
Raise WantReadError ()
OpenSSL. SSL. WantReadError
During handling of the above exception, another exception occurred:
Traceback (most recent call last ):
File "D: \ python \ Lib \ site-packages \ urllib3 \ connectionpool. py", line 595, in urlopen
Self. _ prepare_proxy (conn)
File "D: \ python \ Lib \ site-packages \ urllib3 \ connectionpool. py", line 816, in _ prepare_proxy
Conn. connect ()
File "D: \ python \ Lib \ site-packages \ urllib3 \ connection. py", line 326, in connect
Ssl_context = context)
File "D: \ python \ Lib \ site-packages \ urllib3 \ util \ ssl _. py", line 329, in ssl_wrap_socket
Return context. wrap_socket (sock, server_hostname = server_hostname)
File "D: \ python \ Lib \ site-packages \ urllib3 \ contrib \ pyopenssl. py", line 445, in wrap_socket
Raise timeout ('select timed out ')
Socket. timeout: select timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last ):
File "D: \ python \ Lib \ site-packages \ requests \ adapters. py", line 440, in send
Timeout = timeout
File "D: \ python \ Lib \ site-packages \ urllib3 \ connectionpool. py", line 639, in urlopen
_ Stacktrace = sys. exc_info () [2])
File "D: \ python \ Lib \ site-packages \ urllib3 \ util \ retry. py", line 388, in increment
Raise MaxRetryError (_ pool, url, error or ResponseError (cause ))
Urllib3.exceptions. maxRetryError: HTTPSConnectionPool (host = 'www .srimg.com ', port = 443): Max retries exceeded with url:/u/20180104/11315126 .jpg (Caused by ProxyError ('cannot connect to proxy. ', timeout ('select timed out ',)))
During handling of the above exception, another exception occurred:
Traceback (most recent call last ):
File "D: \ PyCharm 2017.3.1 \ helpers \ pydev \ pydev_run_in_console.py", line 52, in run_file
Pydev_imports.execfile (file, globals, locals) # execute the script
File "D: \ PyCharm 2017.3.1 \ helpers \ pydev \ _ pydev_imps \ _ pydev_execfile.py", line 18, in execfile
Exec (compile (contents + "\ n", file, 'exec '), glob, loc)
File "D:/learnPython/crawler. py", line 301, in <module>
Getpicofpage (url)
File "D:/learnPython/crawler. py", line 281, in getpicofpage
Res = requests. get (I ["src"], timeout = 25)
File "D: \ python \ Lib \ site-packages \ requests \ api. py", line 72, in get
Return request ('get', url, params = params, ** kwargs)
File "D: \ python \ Lib \ site-packages \ requests \ api. py", line 58, in request
Return session. request (method = method, url = url, ** kwargs)
File "D: \ python \ Lib \ site-packages \ requests \ sessions. py", line 508, in request
Resp = self. send (prep, ** send_kwargs)
File "D: \ python \ Lib \ site-packages \ requests \ sessions. py", line 618, in send
R = adapter. send (request, ** kwargs)
File "D: \ python \ Lib \ site-packages \ requests \ adapters. py", line 502, in send
Raise ProxyError (e, request = request)
Requests. exceptions. proxyError: HTTPSConnectionPool (host = 'www .srimg.com ', port = 443): Max retries exceeded with url:/u/20180104/11315126 .jpg (Caused by ProxyError ('cannot connect to proxy. ', timeout ('select timed out ',)))
PyDev console: starting.

Another point is that although I opened a vpn, I couldn't get the content through direct crawling. I will prompt that the host didn't respond, but later I found that I could crawl through fiddler. It is probably because of the ip address, I have not studied this carefully. Please kindly advise me.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python3 crawlers crawl the 1024 image area,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python3 crawlers crawl the 1024 image area,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support