Web Data Crawl Instance tutorial _

Web Data Crawl Instance tutorial __url

Last Update:2018-08-20 Source: Internet

Author: User

Tags ftp protocol

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface to crawl data using the class browser to find the data we need use the DOM to extract the data use the regular expression to parse the data 2018 Baotu will stop spewing? URL Analysis Web page download data analysis of data storage and retrieval of all data

Preface

In general, web data crawling refers to data downloads based on the HTTP/HTTPS/FTP protocol-translated into vernacular, to get the data we need from a particular Web page. Imagine a web browsing process can be roughly divided into two steps: enter the URL in the browser address bar, open this page with the eyes to find the information we need

In fact, the process of crawling data from the web is the same as the process of browsing the web, and it also includes these two steps, but the tools are slightly different. use the equivalent "browser" component to download the Web address (URL) of the corresponding page (source) use technical means to find the data we need from the downloaded Web page (source) class Browser for crawling data

Python has two built-in modules Urllib and URLLIB2 that can be used as a "browser" for crawling data, and Pycurl is also a good choice to deal with more complex requirements.

We know that there are 8 methods of HTTP protocol, the Real browser supports at least two ways to request a Web page: Get and post. In contrast to URLLIB2, the Urllib module accepts only string parameters, cannot specify the method for requesting data, and cannot set the request header. Therefore, URLLIB2 is considered the preferred browser for crawling data.

This is the simplest application of the URLLIB2 module:

Import urllib2
response = Urllib2.urlopen (' http://xufive.sdysit.com/')
if Response.code =:
    html = Response.read () # HTML is the source print HTML of the Web page we are requesting
. else:
    print ' response error. '

Urllib2.urlopen can accept Urllib2.request objects in addition to accepting string arguments. This means that we have the flexibility to set the header (header) of the request.

The Urllib2.urlopen constructor prototype is as follows:

Urllib2.urlopen (url[, data[, timeout[, cafile[, capath[, cadefault[, context]]])

Urllib2. The constructor prototype for the request is as follows:

Urllib2. Request (url[, data][, headers][, origin_req_host][, unverifiable])

find the data we need

Target data often in a variety of various forms hidden in the Web page we download to the source code, which requires us to first analyze the source of the Web page, to find the distribution of the target data, and then choose the appropriate way to obtain. extracting data using the DOM

Beautiful soup as a third-party library of Python, can help us to find the data we need from the source of the Web page. Beautiful soup can extract data from an HTML or XML, which includes simple processing, traversing, searching the document tree, modifying the page elements, and so on. The installation is very simple (also installed if there is no parser):

Pip install beautifulsoup4
pip install lxml

The following example shows how to find the information we need from the http://xufive.sdysit.com.

Import urllib2 from
BS4 import beautifulsoup

response = Urllib2.urlopen (' http://xufive.sdysit.com/')
if Response.code = =:
    html = response.read ()
    soup = beautifulsoup (html, ' lxml ')

    div = soup.find (' div ') # Find the first div node
    Print div # view all of the DIV tag's information print
    div.text # Print div tag contents print
    div.attrs # Print div tag all attributes print
    div[' Style '] # Print div tag styles

    divs = Soup.find_all (' div ') # Find all div nodes for
    div in divs: # Traverse all div node
        print div

    for TR in Divs[1].find_all (' tr '): # traverse all tr for
        TD in Tr.find_all (' TD '):
            print td.text
else:
    print ' Response error. '

using regular expressions to parse data

Sometimes, the target data is hidden in the text of a large paragraph, can not be directly obtained through the HTML tag, or the same number of tags, and the target data accounted for only a small part of it. In this case, regular expressions are usually used. The following code can be directly extracted from the date (hint: When processing Chinese, HTML source code and matching mode must use UTF-8 encoding, otherwise run error):

#-*-coding:utf-8-*-
import re

html = u "" "
    <tr><td>2011 year October 15 </td></tr>
    <tr><td>2011 year October 15 </td></tr>
    <tr><td>2012 year May 9 </td></tr>
    <tr><td>2015 Year November 23 </td></tr>
    <tr><td>2015 year July 8 </td></tr >
    <tr><td>2016 Year December 5 </td></tr>
    <tr><td>2017 year March 22 </td> </tr>
    <tr><td>2017 Year June 17 </td></tr>
"" "
pattern  = Re.compile (R ') <td> (\d{4}) (\d{1,2}) month (\d{1,2}) Date </td> ')
data = Pattern.findall (Html.encode (' Utf-8 '))
print Data

2018 Baotu will stop spewing.

Jinan's website (http://jnwater.gov.cn) publishes Baotu and Black Tiger Spring's underground water tables every day, and can query for earlier water level data. The goal of this tutorial is to: Download all the water level data from this website to draw the water level variation curve of the specified time period , Jinan the regularity of groundwater level change, and make a preliminary judgment on whether spring water will stop gushing in 2018 URL Analysis

A few simple operations, we will find that, since May 2, 2012, the Jinan Urban and Rural Water Bureau website will be released every day Baotu and black Tiger Spring of groundwater level data, all the data in a paging way, each page display 20 days, page numbering starting from 1, the URL of the 1th page is:

Http://www.jnwater.gov.cn/list.php?catid=101&page=1

Web Downloads

If you use the previous code to download this page, you will find an error. This is somewhat discouraging and why. Originally, the Jinan urban and Rural Water Bureau website did a little bit of hands and feet, restricting the non-browser access to the site. What's the difference between using just the code to access and using a browser? This, of course, is the user-agent (user agent) in the request header. When a browser accesses a Web site, it declares in the request header what type of browser it is and what operating system it is running on. We only need to add user-agent in the request header, we can cheat the server. Here is a function that captures the specified page-level data.

Import URLLIB2

def get_data_by_page (page):
    url = ' http://www.jnwater.gov.cn/list.php?catid=101&page=% d '%page
    req = urllib2. Request (URL, ', {' user-agent ': ' mozilla/4.0 ') (compatible; MSIE 5.5; Windows NT) '}
    response = Urllib2.urlopen (req)
    if Response.code =:
        html = response.read ()
        return HTML
    Else: return
        False

Data Parsing

To view the downloaded source code, it is not difficult to find that the Baotu and black Tiger Spring groundwater level data in the HTML is like this:

<tr>
  <td width= "25%" align= "center" bgcolor= "#DAE9F8" class= "S14" > Date </td>
  <td width= " 15% "align=" center "bgcolor=" #DAE9F8 "><span class=" S14 "> Baotu water level </span></td>
  <td width=" 14% "align=" center "bgcolor=" #DAE9F8 "><span class=" S14 "> Black Tiger Spring water level </span></td>
</tr>
<tr>
  <td align= "center" bgcolor= "#DAF8E8" class= "S14" >2017 year October 2 </td>
  <td align= "Center" bgcolor= "#DAF8E8" class= "S14" >28.20 m </td>
  <td align= "center" bgcolor= "#DAF8E8" class= "S14" >28.15 m </td>
</tr>
<tr>
  <td align= "center" bgcolor= "#DAF8E8" class= "S14" > October 1, 2017 </td>
  <td align= "center" bgcolor= "#DAF8E8" class= "S14" >28.16 m </td>
  <td align= "center" bgcolor= "#DAF8E8" class= "S14" >28.10 m </td>
</tr>

Combined with the actual display of the Web page, it is easy to find the water level data extraction conditions: The TR tag contains 3 TD tags, and the TD label background color (Bdcolor) is green (#DAF8E8). Continue to refine our download function so that it can parse out the data we need instead of returning to a Web page source.

def get_data_by_page (page):
    data = list ()
    p_date = Re.compile (R ' (\d{4}) \d+ (\d{1,2}) \d+ (\d{1,2}) ')

    url = ' http://www.jnwater.gov.cn/list.php?catid=101&page=%d '%page
    req = urllib2. Request (URL, ', {' user-agent ': ' mozilla/4.0 ') (compatible; MSIE 5.5; Windows NT) '}
    response = Urllib2.urlopen (req)

    if Response.code =:
        html = response.read ()
        soup = BeautifulSoup (HTML, ' lxml ') for
        tr in Soup.find_all (' tr '):
            tds = Tr.find_all (' TD ')
            if Len (TDS) = = 3 and ' BGC Olor ' in Tds[0].attrs and tds[0][' bgcolor '] = = ' #DAF8E8 ': Year
                , month, day = P_date.findall (Tds[0].text.encode (' UTF8 ')] [0]
                Baotu = float (tds[1].text[:-1])
                Heihu = float (tds[2].text[:-1))
                data.append (['%s-%02d-%0 2d '% (year, Int (month), Int (day)), Baotu, Heihu]) return

    data

Crawl All data

According to the number of pages on the site, the loop call Get_data_by_page function, and the return result summary, get all the water level data. This is a simple job, but in fact it is easy to do well, because: not every page request will be successful data is not always consistent with our expectations, the parsing process will be abnormal date errors, data loss

Currently known errors are: September 9, 2015, wrongly written as September 9, 2014 June 3, 2014, the wrong write June 3, 2016 January 31, 2013, the wrong writing for January 31, 2012 October 30, 2012 data missing March 11, 2014 data missing

The best approach is to start with the latest data date, and use the Timedelta object of the DateTime module, the day-after, date-error, correction date, missing data, with the average of two days before and after. Consideration of data preservation and retrieval

To draw the curve of water level change

To be continued data analysis

Cond

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More