Preface to crawl data using the class browser to find the data we need use the DOM to extract the data use the regular expression to parse the data 2018 Baotu will stop spewing? URL Analysis Web page download data analysis of data storage and retrieval of all data
Preface
In general, web data crawling refers to data downloads based on the HTTP/HTTPS/FTP protocol-translated into vernacular, to get the data we need from a particular Web page. Imagine a web browsing process can be roughly divided into two steps: enter the URL in the browser address bar, open this page with the eyes to find the information we need
In fact, the process of crawling data from the web is the same as the process of browsing the web, and it also includes these two steps, but the tools are slightly different. use the equivalent "browser" component to download the Web address (URL) of the corresponding page (source) use technical means to find the data we need from the downloaded Web page (source) class Browser for crawling data
Python has two built-in modules Urllib and URLLIB2 that can be used as a "browser" for crawling data, and Pycurl is also a good choice to deal with more complex requirements.
We know that there are 8 methods of HTTP protocol, the Real browser supports at least two ways to request a Web page: Get and post. In contrast to URLLIB2, the Urllib module accepts only string parameters, cannot specify the method for requesting data, and cannot set the request header. Therefore, URLLIB2 is considered the preferred browser for crawling data.
This is the simplest application of the URLLIB2 module:
Import urllib2
response = Urllib2.urlopen (' http://xufive.sdysit.com/')
if Response.code =:
html = Response.read () # HTML is the source print HTML of the Web page we are requesting
. else:
print ' response error. '
Urllib2.urlopen can accept Urllib2.request objects in addition to accepting string arguments. This means that we have the flexibility to set the header (header) of the request.
The Urllib2.urlopen constructor prototype is as follows:
Urllib2.urlopen (url[, data[, timeout[, cafile[, capath[, cadefault[, context]]])
Urllib2. The constructor prototype for the request is as follows:
Urllib2. Request (url[, data][, headers][, origin_req_host][, unverifiable])
find the data we need
Target data often in a variety of various forms hidden in the Web page we download to the source code, which requires us to first analyze the source of the Web page, to find the distribution of the target data, and then choose the appropriate way to obtain. extracting data using the DOM
Beautiful soup as a third-party library of Python, can help us to find the data we need from the source of the Web page. Beautiful soup can extract data from an HTML or XML, which includes simple processing, traversing, searching the document tree, modifying the page elements, and so on. The installation is very simple (also installed if there is no parser):
Pip install beautifulsoup4
pip install lxml
The following example shows how to find the information we need from the http://xufive.sdysit.com.
Import urllib2 from
BS4 import beautifulsoup
response = Urllib2.urlopen (' http://xufive.sdysit.com/')
if Response.code = =:
html = response.read ()
soup = beautifulsoup (html, ' lxml ')
div = soup.find (' div ') # Find the first div node
Print div # view all of the DIV tag's information print
div.text # Print div tag contents print
div.attrs # Print div tag all attributes print
div[' Style '] # Print div tag styles
divs = Soup.find_all (' div ') # Find all div nodes for
div in divs: # Traverse all div node
print div
for TR in Divs[1].find_all (' tr '): # traverse all tr for
TD in Tr.find_all (' TD '):
print td.text
else:
print ' Response error. '
using regular expressions to parse data
Sometimes, the target data is hidden in the text of a large paragraph, can not be directly obtained through the HTML tag, or the same number of tags, and the target data accounted for only a small part of it. In this case, regular expressions are usually used. The following code can be directly extracted from the date (hint: When processing Chinese, HTML source code and matching mode must use UTF-8 encoding, otherwise run error):
#-*-coding:utf-8-*-
import re
html = u "" "
<tr><td>2011 year October 15 </td></tr>
<tr><td>2011 year October 15 </td></tr>
<tr><td>2012 year May 9 </td></tr>
<tr><td>2015 Year November 23 </td></tr>
<tr><td>2015 year July 8 </td></tr >
<tr><td>2016 Year December 5 </td></tr>
<tr><td>2017 year March 22 </td> </tr>
<tr><td>2017 Year June 17 </td></tr>
"" "
pattern = Re.compile (R ') <td> (\d{4}) (\d{1,2}) month (\d{1,2}) Date </td> ')
data = Pattern.findall (Html.encode (' Utf-8 '))
print Data
2018 Baotu will stop spewing.
Jinan's website (http://jnwater.gov.cn) publishes Baotu and Black Tiger Spring's underground water tables every day, and can query for earlier water level data. The goal of this tutorial is to: Download all the water level data from this website to draw the water level variation curve of the specified time period , Jinan the regularity of groundwater level change, and make a preliminary judgment on whether spring water will stop gushing in 2018 URL Analysis
A few simple operations, we will find that, since May 2, 2012, the Jinan Urban and Rural Water Bureau website will be released every day Baotu and black Tiger Spring of groundwater level data, all the data in a paging way, each page display 20 days, page numbering starting from 1, the URL of the 1th page is:
Http://www.jnwater.gov.cn/list.php?catid=101&page=1
Web Downloads
If you use the previous code to download this page, you will find an error. This is somewhat discouraging and why. Originally, the Jinan urban and Rural Water Bureau website did a little bit of hands and feet, restricting the non-browser access to the site. What's the difference between using just the code to access and using a browser? This, of course, is the user-agent (user agent) in the request header. When a browser accesses a Web site, it declares in the request header what type of browser it is and what operating system it is running on. We only need to add user-agent in the request header, we can cheat the server. Here is a function that captures the specified page-level data.
Import URLLIB2
def get_data_by_page (page):
url = ' http://www.jnwater.gov.cn/list.php?catid=101&page=% d '%page
req = urllib2. Request (URL, ', {' user-agent ': ' mozilla/4.0 ') (compatible; MSIE 5.5; Windows NT) '}
response = Urllib2.urlopen (req)
if Response.code =:
html = response.read ()
return HTML
Else: return
False
Data Parsing
To view the downloaded source code, it is not difficult to find that the Baotu and black Tiger Spring groundwater level data in the HTML is like this:
<tr>
<td width= "25%" align= "center" bgcolor= "#DAE9F8" class= "S14" > Date </td>
<td width= " 15% "align=" center "bgcolor=" #DAE9F8 "><span class=" S14 "> Baotu water level </span></td>
<td width=" 14% "align=" center "bgcolor=" #DAE9F8 "><span class=" S14 "> Black Tiger Spring water level </span></td>
</tr>
<tr>
<td align= "center" bgcolor= "#DAF8E8" class= "S14" >2017 year October 2 </td>
<td align= "Center" bgcolor= "#DAF8E8" class= "S14" >28.20 m </td>
<td align= "center" bgcolor= "#DAF8E8" class= "S14" >28.15 m </td>
</tr>
<tr>
<td align= "center" bgcolor= "#DAF8E8" class= "S14" > October 1, 2017 </td>
<td align= "center" bgcolor= "#DAF8E8" class= "S14" >28.16 m </td>
<td align= "center" bgcolor= "#DAF8E8" class= "S14" >28.10 m </td>
</tr>
Combined with the actual display of the Web page, it is easy to find the water level data extraction conditions: The TR tag contains 3 TD tags, and the TD label background color (Bdcolor) is green (#DAF8E8). Continue to refine our download function so that it can parse out the data we need instead of returning to a Web page source.
def get_data_by_page (page):
data = list ()
p_date = Re.compile (R ' (\d{4}) \d+ (\d{1,2}) \d+ (\d{1,2}) ')
url = ' http://www.jnwater.gov.cn/list.php?catid=101&page=%d '%page
req = urllib2. Request (URL, ', {' user-agent ': ' mozilla/4.0 ') (compatible; MSIE 5.5; Windows NT) '}
response = Urllib2.urlopen (req)
if Response.code =:
html = response.read ()
soup = BeautifulSoup (HTML, ' lxml ') for
tr in Soup.find_all (' tr '):
tds = Tr.find_all (' TD ')
if Len (TDS) = = 3 and ' BGC Olor ' in Tds[0].attrs and tds[0][' bgcolor '] = = ' #DAF8E8 ': Year
, month, day = P_date.findall (Tds[0].text.encode (' UTF8 ')] [0]
Baotu = float (tds[1].text[:-1])
Heihu = float (tds[2].text[:-1))
data.append (['%s-%02d-%0 2d '% (year, Int (month), Int (day)), Baotu, Heihu]) return
data
Crawl All data
According to the number of pages on the site, the loop call Get_data_by_page function, and the return result summary, get all the water level data. This is a simple job, but in fact it is easy to do well, because: not every page request will be successful data is not always consistent with our expectations, the parsing process will be abnormal date errors, data loss
Currently known errors are: September 9, 2015, wrongly written as September 9, 2014 June 3, 2014, the wrong write June 3, 2016 January 31, 2013, the wrong writing for January 31, 2012 October 30, 2012 data missing March 11, 2014 data missing
The best approach is to start with the latest data date, and use the Timedelta object of the DateTime module, the day-after, date-error, correction date, missing data, with the average of two days before and after. Consideration of data preservation and retrieval
To draw the curve of water level change
To be continued data analysis
Cond