Python Crawler Basics-Urllib Modules (1)

Last Update:2017-11-20 Source: Internet

Author: User

Tags exception handling urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A very broad feature of Python is the crawler. Crawlers can get the information we need, even DDoS tools. Reptiles are now more popular is the scrapy and other modules, but before learning these tools, first understand the Urllib module, know his basic working principle.

Basic idea of crawler:

Scan to get the corresponding URL, scan the contents of the URL page, through regular matching to obtain the required content to download.

Official Help document for Urllib

Https://docs.python.org/3/library/urllib.request.html

The basic syntax and regular use of Python is not discussed here. See the example below:

Example 1 Gets the publisher information for the watercress, and then writes the results to a text file and an Excel file

The key is to parse the HTML page label so that it can match the required string

650) this.width=650; "src=" Https://s3.51cto.com/oss/201711/20/5085895fa86627b9d455ecb615a6dadf.png "title=" 111. PNG "alt=" 5085895fa86627b9d455ecb615a6dadf.png "/>

#! /usr/bin/env python#! -*- coding:utf-8 -*-# author: yuan liimport re, Urllib.requestprint ("Read page ...") #data =urllib.request.urlopen (' Https://read.douban.com/provider/all '). Read (). Decode (' UTF8 ') pat= ' <div class=\ "name\" > (. +?) <\/div> ' Result=re.compile (PAT) findall (data) print ("Writing document ...") Fp=open (' c:\\temp\\ Publisher.txt ', ' a ', encoding= ' UTF8 ') for item in result:    fp.write (item+ "\ n ") Fp.close () print (" Write to Excel ... ") Import xlsxwriterworkbook=xlsxwriter. Workbook (' publisher.xlsx ') worksheet=workbook.add_worksheet (' publisher ') Row=0col=0worksheet.write (Row,col, ' publisher name ') for  item in range (len (result)):    row+=1     Worksheet.write (Row,col,result[item]) workbook.close ()

Read the page ... [',  ',  ' Beijing Normal University Press ',  ' Baihuazhou literary publishing house ',  ' Baihua literary Press ',  ' Yangtze River Digital ',  ' Chongqing University Press ',   ' Oriental extract ',  ' reader's book ',  ' electronic industry Press ',  ' contemporary China Press ',  ' first finance weekly ',  ' watercress reading the same museum ',  ' watercress ',  ' Watercress read ',  ' watercress reading and publishing program ',  ' Oriental Bar Tower Culture ',  ' Phoenix One Force ',  ' Phoenix Yue Shi Wang Wen ',  ' Phoenix linkage ',  ' fiberead ',  ' Fudan University Press ',  ' Phoenix Snow Diffuse ',  ' nutshell read ',  ' Fruit wheat Culture ',  ' Guangxi Normal University Press ',  ' Hangzhou Blue Lion Culture Creative Co., Ltd. ',  ' after Wave Publishing company ',   ' East China Normal University Press ',  ',  ' Han Tang Sunshine ',  ' Chinese Times ',  ' Hubei publishing house ',  ',  ' Chengxuan ' Dolphin Publishing House ',  ' Iris publishing ',  ' Chemical Industry Press ',  ' Huazhong University Press ',  ' Hubei Science and Technology Press ',  ' Heilongjiang Northern Literary publishing house ',  ' Chinese classics ',  ' HarperCollins ',  ' poly Shi Wenhua ',  ' jincheng press ',  ' Jane book ',  ' This ancient legend ',  ' Jiangsu publishing House ',  ' Kyushu Fantasy ',  ' sci fi World ',   ' cool ' culture ',  ' ideal country ',  ' Lijiang Press ',  ' mill iron number au ',  ' Ningbo Press ',  ' Southern character weekly ',  ' one a ',  ' editorial culture ',   ' Tsinghua University Press ',  ' Qingdao Press ',  ' People magazine ',  ' peoples Literature Press ',  ' people post and Telecommunications press ',  ' Confucianism Xin Shin ',  ' People's Oriental publishing media ',   ' People's Literature magazine ',  ' Shanghai nine-long readingPeople ',  ' century Wen Jing ',  ' Sichuan Digital Publishing Media Co., Ltd. ',  ' Shanghai Translation Press ',  ' times Chinese ',  ' Shanghai Ya Culture ',  ' century Wenrui ',  ' Times Chinese ',   ' Commercial Press ',  ' living, reading and learning Joint Publishing ',  ' Shanghai Academy of Social Sciences Publishing house ',  ' Social Sciences Literature Publishing house ',  ' Shanxi Spring Electronic Audio-visual publishing house ',  ' times number,   ' Shaanxi people publishing Beijing branch ',  ' "bookstore" magazine ',  ' world map Beijing ',  ' Sichuan Literary Press ',  ' Shanghai Literary Press ',  ' Shanghai People's publishing house ',  ' Shanghai Jiaotong University Press ',  ',  ' Shanghai People's Art publishing house ',  ' Turing Community ',  ' trajectory ',  ' Wuhan University Press Beijing branch ',  ' million have books ',   ' I and watercress ',  ' New Classic culture ebook ',  ' Nova Press ',  ' Xinhua pioneer Culture Media ',  ' snowball ',  ' suspense world ',  ' modern Press ',  ' Southwest University of Finance and Economics press ',  ' Xinhua Press ',  ' Xinhua Pioneer Publishing Technology ',  ' translation Lin publishing house ',  ' translation words • Things library ',  ' translations • Gutenberg plan ',  ' Yue Ji ',  ' Sunshine Blog ',  ' Yue read famous ',  ' Yanshan Press ',  ' Read Text group Chinese world ',  ' Citic Press ',  ' Renmin University Press ',  ' for Chinese ',  ' China Light Industry publishing house ',  ' Purple book ',  ' Zhejiang edition digital media ',  ' central compiling press ',  ' know ',  ' China National Geographic Book Department ',  ' zhejiang photography Press ',  ' China Economic Press ',   ' China Youth Press ',  ' China Democratic legal system Press ',  ' Communication University publishing house ',  ' Chinese language publishing house ',  ' Zhejiang University Press ',  ' Cham lu Culture ',  ' Zhejiang Literary Press ',  ' Zhonghua Book Company '] Writing documents ... Write to Excel ...

We know that in the HTTP protocol, the two most common methods are get and post. Get is generally a URL to pass parameters, usually to refresh the page to obtain information, post is generally through form or Ajax to submit data. Here's a look at two simple examples

Example 2. Get request, search through Baidu and return the title of the first 5 pages of results

Key points: Analysis of Baidu's URL,WD expression is the keyword, pn is (page number-1) *10 results

650) this.width=650; "src=" Https://s2.51cto.com/oss/201711/20/3b05131ac5f46818b25075b7ce729b1e.png "title=" 2222. PNG "alt=" 3b05131ac5f46818b25075b7ce729b1e.png "/>

import urllib.request,reword= input (' Please enter keywords: ') # The purpose of quote is to transcode Chinese into a format that the URL can recognize Word=urllib.request.quote (word) for i in range (1,5):     print ("page%d"%i)     page= (i-1) *10    url= "https://www.baidu.com /s?wd=%s&pn=%s "% (word,str (page))     pat= '" title ":" (. *) ", '     data= Urllib.request.urlopen (URL). read (). Decode ("Utf-8")     print (len (data))      result=re.compile (PAT). FindAll (data)     for item in result:         print (item)

Please enter keywords: beanxyzpage1172165 mapo tofu-51CTO Technology Blog-Leading IT technology blog "to play the character biography" BEANXYZ: 8 years overseas experience not for the original choice regret-51cto...beanxyz Weibo _ Microblogging PowerShell crawl Web Form-mapo Tofu-51CTO Technology blog PowerShell multi-Threading usage-mapo Tofu-51CTO Technology blog Xenapp/xendesktop 7.6 Experience an installation, configure the site and serial number server ...

Example 3 POST request, output result

Basic idea: by urllib.parse.urlencode the dictionary format content transcoding, through urllib.request.Request to generate a POST request to submit the object, and then through the Urlopen to send

The following example of the general actual combat seems to be successful, after all, too low, so in order to understand the process, there is a very simple test form page can try

650) this.width=650; "src=" Https://s5.51cto.com/oss/201711/20/ccc50088f03e124174358496a855ed0c.png "title=" 44.PNG "alt=" Ccc50088f03e124174358496a855ed0c.png "/>

Import urllib.request,reurl= ' http://www.iqianyue.com/mypost ' dic={' name ': ' AAA ', ' Pass ': ' KKK '}postdata= Urllib.parse.urlencode (DIC). Encode (' Utf-8 ') r=urllib.request.request (url,postdata) Data=urllib.request.urlopen (r ). Read (). Decode (' Utf-8 ') print (data)



Example 4 crawler exception handling, the reptile anomaly is mainly urlerror or httperror, the latter is the former inheritance class, more than a code property

Tips: Pycharm Inside CTRL + click the corresponding class can view the source code
Class Httperror (Urlerror, Urllib.response.addinfourl): "" "Raised when HTTP error occurs, but also acts like Non-error R Eturn "" "__super_init = urllib.response.addinfourl.__init__ def __init__ (self, URL, code, MSG, HDRs, FP): Sel F.code = Code self.msg = Msg Self.hdrs = HDRs SELF.FP = fp self.filename = URL

We can use Urlerror directly and then determine whether it is httperror or other error based on the results.

For example, I go directly to the Nagios page, because the user name and password are not provided, so directly deny access
650) this.width=650, "src=" Https://s2.51cto.com/oss/201711/20/c7606a0203e6a645d2caeabc4a5a0811.png "title=" Day and day. PNG "alt=" C7606a0203e6a645d2caeabc4a5a0811.png "/>
Import urllib.request,urllib.errortry:urllib.request.urlopen (' Http://sydnagios/nagios ') except Urllib.error.URLError as E:if hasattr (E, ' code '): Print (E.code) if Hasattr (E, ' reason '): Print (E.reason )
401Unauthorized

This article is from the "Mapo Tofu" blog, please be sure to keep this source http://beanxyz.blog.51cto.com/5570417/1983377
Python Crawler Basics-Urllib Modules (1)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More