As a result of the need to get today's weather data, and then picked up Python wrote a crawler to get the weather data on the Chinese meteorological network. Because I need the data is relatively simple, because I only need the temperature (the lowest temperature and the highest temperature) and the weather in Beijing, so the code section is relatively simple, the following is about the crawl process.
First Step web analytics
In order to design the crawler, we must first analyze the request process of the Web page. First, open the China Weather Network homepage, search the search box in Beijing, see the weather in Beijing, as shown in the following picture:
Found in today's data bar does not I want the lowest temperature and the highest temperature, so I chose the "7 Days" link, screenshot as follows:
This time I want the data (the lowest temperature, the highest temperature) have, and then the analysis of the Web page request process. By comparing the "Today" page and the "7 Days" page, we found that the request for the site was simple get request.
For example, the requested URL for the "7 Days" page is as follows:
Url= "Http://www.weather.com.cn/weather/101010100.shtml"
Among them, "weather" represents the request is "7 days", if the request "Today" is "weather1d"; the latter "101010100" represents the number of the Beijing area.
Now that the URL has been figured out, then come down to analyze the source of the Web page, find the data in the source of the present position, after a search, has been positioned data in the source location, where the weather data and temperature data in two P tags, and the highest temperature data in the span label, the lowest temperature in the I tag.
But one of the things that needs to be noticed here is that at night time, there's a change, which is that without the highest temperature, the results of the Web interface are:
The result of rendering in code is a missing span label, leaving only the I tag containing the lowest temperature data. Since I have to have the highest temperature in the data application scenario, to avoid the highest temperature, the method I take is to use the highest temperature of the second day instead (though it is more rough).
Here, the Web analytics work is over, and the next step is to get the data.
Second Step data acquisition
Given the elegance of the Python language, this simple crawler chooses python+beautiful Soup 4 for page parsing. Beauti soup is a python library that extracts data from HTML or XML files, and its powerful parsing function can quickly and easily solve many problems. About Beautisoup, you can refer to the official documents or other documents, here directly posted my code.
From urllib.request import Urlopen from
BS4 import beautifulsoup
import re
resp=urlopen (' http:// Www.weather.com.cn/weather/101010100.shtml ')
soup=beautifulsoup (resp, ' html.parser ')
tagtoday=soup.find (' P ', class_= "tem") #第一个包含class = The P label of "TEM" is the label that holds today's weather data
try:
temperaturehigh=tagtoday.span.string #有时候这个最高温度是不显示的, Use the highest temperature of the second day instead.
except Attributeerror as E:
temperaturehigh=tagtoday.find_next (' P ', class_= "tem"). Span.string # Gets the highest temperature of the second day instead of
temperaturelow=tagtoday.i.string #获取最低温度
weather=soup.find (' P ', class_= "WEA"). String #获取天气
print (' Minimum temperature: ' + Temperaturelow ') print ('
maximum temperature: ' + Temperaturehigh ')
print (' weather: ' + weather)
The results of the program operation are as follows: