Python Crawler Basics-Urllib Modules (1)

Source: Internet
Author: User
Tags exception handling urlencode

A very broad feature of Python is the crawler. Crawlers can get the information we need, even DDoS tools. Reptiles are now more popular is the scrapy and other modules, but before learning these tools, first understand the Urllib module, know his basic working principle.


Basic idea of crawler:

Scan to get the corresponding URL, scan the contents of the URL page, through regular matching to obtain the required content to download.


Official Help document for Urllib

Https://docs.python.org/3/library/urllib.request.html




The basic syntax and regular use of Python is not discussed here. See the example below:


Example 1 Gets the publisher information for the watercress, and then writes the results to a text file and an Excel file

The key is to parse the HTML page label so that it can match the required string

650) this.width=650; "src=" Https://s3.51cto.com/oss/201711/20/5085895fa86627b9d455ecb615a6dadf.png "title=" 111. PNG "alt=" 5085895fa86627b9d455ecb615a6dadf.png "/>

#! /usr/bin/env python#! -*- coding:utf-8 -*-# author: yuan liimport re, Urllib.requestprint ("Read page ...") #data =urllib.request.urlopen (' Https://read.douban.com/provider/all '). Read (). Decode (' UTF8 ') pat= ' <div class=\ "name\" > (. +?) <\/div> ' Result=re.compile (PAT) findall (data) print ("Writing document ...") Fp=open (' c:\\temp\\ Publisher.txt ', ' a ', encoding= ' UTF8 ') for item in result:    fp.write (item+ "\ n ") Fp.close () print (" Write to Excel ... ") Import xlsxwriterworkbook=xlsxwriter. Workbook (' publisher.xlsx ') worksheet=workbook.add_worksheet (' publisher ') Row=0col=0worksheet.write (Row,col, ' publisher name ') for  item in range (len (result)):    row+=1     Worksheet.write (Row,col,result[item]) workbook.close () 
Read the page ... [',  ',  ' Beijing Normal University Press ',  ' Baihuazhou literary publishing house ',  ' Baihua literary Press ',  ' Yangtze River Digital ',  ' Chongqing University Press ',   ' Oriental extract ',  ' reader's book ',  ' electronic industry Press ',  ' contemporary China Press ',  ' first finance weekly ',  ' watercress reading the same museum ',  ' watercress ',  ' Watercress read ',  ' watercress reading and publishing program ',  ' Oriental Bar Tower Culture ',  ' Phoenix One Force ',  ' Phoenix Yue Shi Wang Wen ',  ' Phoenix linkage ',  ' fiberead ',  ' Fudan University Press ',  ' Phoenix Snow Diffuse ',  ' nutshell read ',  ' Fruit wheat Culture ',  ' Guangxi Normal University Press ',  ' Hangzhou Blue Lion Culture Creative Co., Ltd. ',  ' after Wave Publishing company ',   ' East China Normal University Press ',  ',  ' Han Tang Sunshine ',  ' Chinese Times ',  ' Hubei publishing house ',  ',  ' Chengxuan ' Dolphin Publishing House ',  ' Iris publishing ',  ' Chemical Industry Press ',  ' Huazhong University Press ',  ' Hubei Science and Technology Press ',  ' Heilongjiang Northern Literary publishing house ',  ' Chinese classics ',  ' HarperCollins ',  ' poly Shi Wenhua ',  ' jincheng press ',  ' Jane book ',  ' This ancient legend ',  ' Jiangsu publishing House ',  ' Kyushu Fantasy ',  ' sci fi World ',   ' cool ' culture ',  ' ideal country ',  ' Lijiang Press ',  ' mill iron number au ',  ' Ningbo Press ',  ' Southern character weekly ',  ' one a ',  ' editorial culture ',   ' Tsinghua University Press ',  ' Qingdao Press ',  ' People magazine ',  ' peoples Literature Press ',  ' people post and Telecommunications press ',  ' Confucianism Xin Shin ',  ' People's Oriental publishing media ',   ' People's Literature magazine ',  ' Shanghai nine-long readingPeople ',  ' century Wen Jing ',  ' Sichuan Digital Publishing Media Co., Ltd. ',  ' Shanghai Translation Press ',  ' times Chinese ',  ' Shanghai Ya Culture ',  ' century Wenrui ',  ' Times Chinese ',   ' Commercial Press ',  ' living, reading and learning Joint Publishing ',  ' Shanghai Academy of Social Sciences Publishing house ',  ' Social Sciences Literature Publishing house ',  ' Shanxi Spring Electronic Audio-visual publishing house ',  ' times number,   ' Shaanxi people publishing Beijing branch ',  ' "bookstore" magazine ',  ' world map Beijing ',  ' Sichuan Literary Press ',  ' Shanghai Literary Press ',  ' Shanghai People's publishing house ',  ' Shanghai Jiaotong University Press ',  ',  ' Shanghai People's Art publishing house ',  ' Turing Community ',  ' trajectory ',  ' Wuhan University Press Beijing branch ',  ' million have books ',   ' I and watercress ',  ' New Classic culture ebook ',  ' Nova Press ',  ' Xinhua pioneer Culture Media ',  ' snowball ',  ' suspense world ',  ' modern Press ',  ' Southwest University of Finance and Economics press ',  ' Xinhua Press ',  ' Xinhua Pioneer Publishing Technology ',  ' translation Lin publishing house ',  ' translation words • Things library ',  ' translations • Gutenberg plan ',  ' Yue Ji ',  ' Sunshine Blog ',  ' Yue read famous ',  ' Yanshan Press ',  ' Read Text group Chinese world ',  ' Citic Press ',  ' Renmin University Press ',  ' for Chinese ',  ' China Light Industry publishing house ',  ' Purple book ',  ' Zhejiang edition digital media ',  ' central compiling press ',  ' know ',  ' China National Geographic Book Department ',  ' zhejiang photography Press ',  ' China Economic Press ',   ' China Youth Press ',  ' China Democratic legal system Press ',  ' Communication University publishing house ',  ' Chinese language publishing house ',  ' Zhejiang University Press ',  ' Cham lu Culture ',  ' Zhejiang Literary Press ',  ' Zhonghua Book Company '] Writing documents ... Write to Excel ...

We know that in the HTTP protocol, the two most common methods are get and post. Get is generally a URL to pass parameters, usually to refresh the page to obtain information, post is generally through form or Ajax to submit data. Here's a look at two simple examples


Example 2. Get request, search through Baidu and return the title of the first 5 pages of results

Key points: Analysis of Baidu's URL,WD expression is the keyword, pn is (page number-1) *10 results

650) this.width=650; "src=" Https://s2.51cto.com/oss/201711/20/3b05131ac5f46818b25075b7ce729b1e.png "title=" 2222. PNG "alt=" 3b05131ac5f46818b25075b7ce729b1e.png "/>

import urllib.request,reword= input (' Please enter keywords: ') # The purpose of quote is to transcode Chinese into a format that the URL can recognize Word=urllib.request.quote (word) for i in range (1,5):     print ("page%d"%i)     page= (i-1) *10    url= "https://www.baidu.com /s?wd=%s&pn=%s "% (word,str (page))     pat= '" title ":" (. *) ", '     data= Urllib.request.urlopen (URL). read (). Decode ("Utf-8")     print (len (data))      result=re.compile (PAT). FindAll (data)     for item in result:         print (item) 
Please enter keywords: beanxyzpage1172165 mapo tofu-51CTO Technology Blog-Leading IT technology blog "to play the character biography" BEANXYZ: 8 years overseas experience not for the original choice regret-51cto...beanxyz Weibo _ Microblogging PowerShell crawl Web Form-mapo Tofu-51CTO Technology blog PowerShell multi-Threading usage-mapo Tofu-51CTO Technology blog Xenapp/xendesktop 7.6 Experience an installation, configure the site and serial number server ...


Example 3 POST request, output result

Basic idea: by urllib.parse.urlencode the dictionary format content transcoding, through urllib.request.Request to generate a POST request to submit the object, and then through the Urlopen to send


The following example of the general actual combat seems to be successful, after all, too low, so in order to understand the process, there is a very simple test form page can try


650) this.width=650; "src=" Https://s5.51cto.com/oss/201711/20/ccc50088f03e124174358496a855ed0c.png "title=" 44.PNG "alt=" Ccc50088f03e124174358496a855ed0c.png "/>

Import urllib.request,reurl= ' http://www.iqianyue.com/mypost ' dic={' name ': ' AAA ', ' Pass ': ' KKK '}postdata= Urllib.parse.urlencode (DIC). Encode (' Utf-8 ') r=urllib.request.request (url,postdata) Data=urllib.request.urlopen (r ). Read (). Decode (' Utf-8 ') print (data)


Example 4 crawler exception handling, the reptile anomaly is mainly urlerror or httperror, the latter is the former inheritance class, more than a code property


Tips: Pycharm Inside CTRL + click the corresponding class can view the source code

Class Httperror (Urlerror, Urllib.response.addinfourl): "" "Raised when HTTP error occurs, but also acts like Non-error R Eturn "" "__super_init = urllib.response.addinfourl.__init__ def __init__ (self, URL, code, MSG, HDRs, FP): Sel F.code = Code self.msg = Msg Self.hdrs = HDRs SELF.FP = fp self.filename = URL


We can use Urlerror directly and then determine whether it is httperror or other error based on the results.


For example, I go directly to the Nagios page, because the user name and password are not provided, so directly deny access

650) this.width=650, "src=" Https://s2.51cto.com/oss/201711/20/c7606a0203e6a645d2caeabc4a5a0811.png "title=" Day and day. PNG "alt=" C7606a0203e6a645d2caeabc4a5a0811.png "/>

Import urllib.request,urllib.errortry:urllib.request.urlopen (' Http://sydnagios/nagios ') except Urllib.error.URLError as E:if hasattr (E, ' code '): Print (E.code) if Hasattr (E, ' reason '): Print (E.reason )
401Unauthorized


This article is from the "Mapo Tofu" blog, please be sure to keep this source http://beanxyz.blog.51cto.com/5570417/1983377

Python Crawler Basics-Urllib Modules (1)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.