: This article mainly introduces [Python] web crawler (3): exception handling and HTTP status code classification. For more information about PHP tutorials, see. Let's talk about HTTP exception handling.
When urlopen cannot process a response, urlError is generated.
However, Python
Python is a powerful computer programming language. It can also be seen as an object-oriented general language. It has outstanding features and greatly facilitates the application of developers. Here, let's take a look at the Python city and county web crawler methods.
Today, I saw a webpage, and it was very troublesom
download and save the picture. Open the file as follows. The next step is to start identifying the verification code in the image, which requires the Pytesser and pil libraries. The first is to install TESSERACT-OCR and install it after downloading it online. The default installation path is C:\Program FILES\TESSERACT-OCR. Add the path to the system property's path path. Then install pytesseract and PIL via Pip . Let's see how it's used. The
, it's just a dict, just a change of name.The role of filed is (see Official documentation):FieldObject indicates the metadata for each field (metadata). For example, in the following examplelast_updatedIndicates the serialization function for the field.You can specify any type of metadata for each field.FieldThe object does not have any restrictions on the accepted values. It is also for this reason that the document cannot provide a key (key) reference list of all available metadata.FieldEach
) print imglist cnt = 1 for Imgurl in imglist:
urllib.urlretrieve (Imgurl, '%s.jpg '%cnt) cnt + 1if __name__ = = ' __main__ ': html = gethtml (' http://www.baidu.com ') getimg (HTML)
According to the above method, we can crawl a certain page, and then extract the data we need.
In fact, we use urllib this module to do web crawler efficiency is extremely low, let us introduce Tornado
The code and tools usedSample site source + Framework + book pdf+ Chapter codeLink: https://pan.baidu.com/s/1miHjIYk Password: af35Environmentpython2.7Win7x64Sample Site SetupWswp-places.zip in the book site source codeFrames used by the Web2py_src.zip site1 Decompression Web2py_src.zip2 then go to the Web2py/applications directory3 Extract the Wswp-places.zip to
1, Python code, such as, we fromhttp://gitbook.cn/Crawl data in this site.2, before running the code to download the installation of Chardet and requests installation package, you can download the two installation packages in my blog for free. Unzip and place in the directory where Python is installed, such as3. Open t
Project
tutorial/: The project's Python module, which will reference the code from here
tutorial/items.py: Project Items file
tutorial/pipelines.py: Project's Pipelines file
tutorial/settings.py: Setup file for Project
tutorial/spiders/: Directory for crawler storage
2. Clear Target (Item)
In Scrapy, items is a container for loading crawling
The crawler simply says it consists of two steps: Get the Web page text, filter the data.
1. Get HTML text.Python is very handy for getting HTML, and just a few lines of code can do what we need.
The code is as follows:
def gethtml (URL):page = Urllib.urlopen (URL)html = Page.read ()Page.close ()return HTML
Such
The reptile simply includes two steps: getting the Web page text, filtering the data.
1, get the HTML text.Python is handy for getting HTML, and a few lines of code can do what we need.
Copy Code code as follows:
def gethtml (URL):
page = Urllib.urlopen (URL)
html = Page.read ()
Page.close ()
Awesome-crawler-cnInternet crawlers, spiders, data collectors, Web parser summary, because of new technologies continue to evolve, new framework endless, this article will be constantly updated ...Exchange Discussion
Welcome to recommend you know the Open source web crawle
just a webpage introduction. Next, let's look at a novel interface: Below is the novel of the fast reading network, the novel text on the left, and the relevant webpage code on the right. No. The text of all novels is contained in the elements whose tags are
If we have a tool, we can automatically download the corresponding HTML code elements. You can automatically download the novel. This is the
about heritrix
Web Crawler frameworkScrapy
Scrapy is a twisted-based Asynchronous Processing framework. It is a crawler framework implemented in Python only. You only need to develop several modules to easily implement a crawler, it is very con
Crawlers take two steps: obtaining webpage text and filtering data.
1. Obtain html text.Python is very convenient in getting html, and a few lines of code can implement the functions we need.
Copy codeThe Code is as follows: def getHtml (url ):
Page = urllib. urlopen (url)
Html = page. read ()
Page. close ()
Return html
The meaning of these lines of code is pr
Reprint please indicate author and source: http://blog.csdn.net/c406495762GitHub Code acquisition: Https://github.com/Jack-Cherish/python-spiderPython version: python3.xRunning platform: WindowsIde:sublime Text3PS: This article for the Gitchat online sharing article, the article published time for September 19, 2017. Activity Address:http://gitbook.cn/m/mazi/acti
Solution to Python web crawler garbled problem, python Crawler
There are many different types of problems with crawler garbled code, including not only Chinese garbled characters, encod
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.