Python allows you to easily perform web crawlers and python web crawlers.

Source: Internet
Author: User

Python allows you to easily perform web crawlers and python web crawlers.

Not long ago, the DotNet Open Source Base Camp passed.. NET programmers demonstrate how. NET uses C # + HtmlAgilityPack + XPath to capture webpage data. This shows us the advantages and usage skills of HtmlAgilitypack, unfamiliar friends can go to his garden to read this article. It's really good! I am also a. NET programmer. I am only interested in myself and want to learn Python by myself. Before learning it, I heard that it is very convenient to use it for Web Crawler and natural language processing, so I tried it and the results made me really satisfied! This blog post is a summary of my study at this stage!
1. preparations:
To do something better, you must first sharpen the tool. Therefore, we need to configure a development environment suitable for ourselves before performing Coding. the development environment I set up is:
Operating System: Ubuntu 14.04 LTS
Python version: 2.7.6
Code Editor: Sublime Text 3.0


The background of this web crawler demand I plan to continue the demand of DotNet open source base camp in his article. I will not explain it here. We only capture the weather conditions of all major cities in a province from. This example uses Hubei province.
2. Actual Web Crawler:
2. 1. Get the city list:

First, we need to get the webpage of all cities in Hubei province, and then parse the webpage. Network Address: http://www.tianqihoubao.com/weather/province.aspx? Id = 420000
We can view the source code of this page to find that all the city lists are based on <td> def ShowCity (): html = requests. get ("http://www.tianqihoubao.com/weather/province.aspx? Id = 420000 ") citys = re. findall ('<td>', html. text, re. S) for city in citys: print city
The captured results are as follows:
1 top/anlu.html "title =" anlu historical weather query 2 top/badong.html "title =" Badong historical weather query 3 top/baokang.html "title =" Baokang historical weather query 4 top/caidian.html "title =" caidian historical weather query 5 top/changyang.html "title =" Changyang historical weather query 6 top/chibi.html "title =" Chibi historical weather query 7 top/chongyang.html "title =" chongyang historical weather query 8 top/dawu.html "title =" Dawu historical weather query 9 top/daye.html "title =" Daye historical weather Query 10 top/danjiangkou.html "title =" Danjiangkou historical weather Query 11 top/dangyang.html "title =" Dangyang historical weather Query 12 top/ezhou.html "title =" Ezhou historical weather query 13 top/enshi.html "title =" Enshi historical weather query 14 top/fangxian.html "title =" quxian historical weather query 15 top/gongan.html "title =" PSB historical weather query 16 top/gucheng.html "title =" Gucheng historical weather query 17 top/guangshui.html "title =" historical weather query 18 top/hanchuan.html "title =" Hanchuan historical weather query 19 top/hanyang.html "title =" Hanyang historical weather query 20 top/hefeng.html "title =" Hefeng historical weather Query 21 top/hongan.html "title =" hong'an historical weather query 22 top/honghu.html "title =" Hong Lake historical weather query 23 top/huangpi.html "title =" Huang Wei historical weather query 24 top/huanggang.html "title =" Huanggang historical weather query 25 top/huangmei.html "title =" Huangmei historical weather query 26 top/huangshi.html "title =" Huangshi historical weather query 27 top/jiayu.html "title =" jiayu historical weather query 28 top/jianli.html "title =" historical weather query 29 top/jianshi.html "title =" historical weather query 30 top/jiangxia.html "title =" Jiangxia historical weather Query 31 top/jingshan.html "title =" jingshan historical weather query 32 top/jingmen.html "title =" Jingmen historical weather query 33 top/jingzhou.html "title =" Jingzhou historical weather query 34 top/laifeng.html "title =" laifeng historical weather query 35 top/laohekou.html "title =" laohekou historical weather query 36 top/lichuan.html "title =" Lichuan historical weather query 37 top/lvtian.html "title =" roada historical weather query 38 top/macheng.html "title =" Macheng historical weather query 39 top/nanzhang.html "title =" Nanxun historical weather query 40 top/qichun.html "title =" Hunchun historical weather Query 41 top/qianjiang.html "title =" Qianjiang historical weather query 42 top/sanxia.html "title =" Three Gorges historical weather query 43 top/shennongjia.html "title =" Shennongjia historical weather query 44 top/shiyan.html "title =" Shiyan historical weather query 45 top/shishou.html "title =" Shi Shou historical weather query 46 top/songzi.html "title =" song Zi historical weather query 47 top/suizhou.html "title =" suizhou historical weather query 48 top/tianmen.html "title =" Tianmen historical weather query 49 top/hbtongcheng.html "title =" Tongcheng historical weather query 50 top/tongshan.html "title =" Tongshan historical weather Query 51 top/wufeng.html "title =" five peaks historical weather query 52 top/wuchang.html "title =" Wuchang historical weather query 53 top/wuhan.html "title =" Wuhan historical weather query 54 top/wuxue.html "title =" Wuxue historical weather query 55 top/hbxishui.html "title =" wushui historical weather query 56 top/xiantao.html "title =" Xiantao historical weather query 57 top/xianfeng.html "title =" xianfeng historical weather query 58 top/xianning.html "title =" Xianning historical weather query 59 top/xiangyang.html "title =" Xiangyang historical weather query 60 top/xiaogan.html "title =" Xiaogan historical weather Query 61 top/hbxinzhou.html "title =" Xinzhou historical weather query 62 top/xingshan.html "title =" Xingshan historical weather query 63 top/xuanen.html "title =" Xuan en historical weather query 64 top/hbyangxin.html "title =" Yangxin historical weather query 65 top/yiling.html "title =" Yiling historical weather query 66 top/yichang.html "title =" Yichang historical weather query 67 top/yicheng.html "title =" yicheng historical weather query 68 top/yidu.html "title =" Yidu historical weather query 69 top/yingcheng.html "title =" Yingcheng historical weather query 70 top/hbyingshan.html "title =" Yingshan historical weather Query 71 top/yuanan.html "title =" yuan'an historical weather query 72 top/yunmeng.html "title =" yunmeng historical weather query 73 top/yunxi.html "title =" Xi historical weather query 74 top/hbyunxian.html "title =" jinxian historical weather query 75 top/zaoyang.html "title =" Zaoyang historical weather query 76 top/zhijiang.html "title =" zhijiang historical weather query 77 top/zhongxiang.html "title =" zhongxiang historical weather query 78 top/zhushan.html "title =" ZHUSHAN historical weather query 79 top/zhuxi.html "title =" ZhuXi historical weather query 80 top/zigui.html "title =" historical weather Query 81 [Finished in 15.4 s]View Code
. Obtain all weather information about the city:
Then we need to capture the weather conditions of the target City Based on the captured city link. Here we encapsulate a function to show all the weather conditions of the target city:
Def ShowWeather (city): res = str (city ). split ('"title ="') print res [1], '(daytime --> nighttime)' html = requests. get ("http://www.tianqihoubao.com/weather/00000 }". format (res [0]) weather = re. search ('<table width = "100%" border = "0" class = "B" cellpadding = "1" cellspacing = "1"> (. *?) </Table> ', html. text, re. S). group (1) res = re. findall (' <tr> (.*?) </Tr> ', weather, re. S) for x in res [2:]: w = re. findall ('> (.*?) <', X, re. s) for y in w [1:]: if len (y. strip () <= 0: pass else: print y print '--' * 40

In this way, we can get the weather conditions of the city !!

Complete code:

1 # coding: UTF-8 2 import re 3 import requests 4 import sys 5 reload (sys) 6 sys. setdefaultencoding ('utf-8') 7 8 def ShowWeather (city): 9 res = str (city ). split ('"title ="') 10 print res [1], '(daytime --> nighttime) '11 html = requests. get ("http://www.tianqihoubao.com/weather/00000 }". format (res [0]) 12 weather = re. search ('<table width = "100%" border = "0" class = "B" cellpadding = "1" cellspacing = "1"> (. *?) </Table> ', html. text, re. S). group (1) 13 res = re. findall (' <tr> (.*?) </Tr> ', weather, re. S) 14 for x in res [2:]: 15 w = re. findall ('> (.*?) <', X, re. s) 16 for y in w [1:]: 17 if len (y. strip () <= pass19 else: 20 print y21 print '--' * 4022 print '\ n',' * 4023 24 def ShowCity (): 25 html = requests. get ("http://www.tianqihoubao.com/weather/province.aspx? Id = 420000 ") 26 citys = re. findall ('<td>', html. text, re. s) 27 for city in citys: 28 ShowWeather (city) 29 30 def main (): 31 ShowCity () 32 33 if _ name __= = '_ main _': 34 main ()
Yes, you are not mistaken. In just 34 lines of code, you can crawl all the weather conditions in all major cities in Hubei province in one month. Is it amazing !!??? But don't be happy too early. Everything has advantages and disadvantages. Let's take a look at its running results:[Finished in 371.8 s]
3. Knowledge summary: 
3. 1. Encoding Problems:
# In ubuntu, due to encoding problems, we need to add a line of comment at the beginning of the Code to tell the Pyhton interpreter the encoding format we specified: # In addition, we also need to set the default encoding format. Otherwise, Sublime Text cannot recognize Chinese characters and an error is reported: "UnicodeEncodeError: 'ascii 'codec can't encode characters in position"
#-*-Coding: utf8 -*-
Import sysreload (sys) sys. setdefaultencoding ('utf-8 ')
3. 2. Regular Expression:
Import Regular Expression Library: import re
Match any character :.
Match the first character 0 times or unlimited times :*
Match the first character 0 times or once :?
Greedy Algorithm :.*
Non-Greedy Algorithm :.*?
Matching Number: (\ d +)
Common functions:
re.findall(pattern, string)re.search(pattern, string)re.sub(pattern, repl, string)

Finally, if you try to run the complete code I posted, you may encounter the same bottleneck as me, that is, the running speed is not fast enough (especially for computers with poor configuration like me ). It took 371.8 s to run this script on my machine. I have run it many times, and each time it is in the 350 +. Therefore, if your program does not care about the running speed, Python may be quite suitable. After all, you can use it to write less code to do more things !!!!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.