Python crawler instance collects 2345 weather forecast, python2345

Source: Internet
Author: User

Python crawler instance collects 2345 weather forecast, python2345

During the winter vacation, I learned Python crawlers and used the simplest method to obtain the required weather data. Right, no error, the simplest method. There is no function encapsulation ..

Web: http://tianqi.2345.com/wea_history/53892.htm

Right-click Firefox to view the source code of the webpage, and no weather data is found. Therefore, the webpage adopts json format data.

Right-click and choose View element> network> JS. The location is found.

Download and store data in json format using Python crawlers. The Code is as follows:

#-*-Coding: UTF-8-*-import urllib2 import json months = [,] years =, 53892] city = [53892] # Handan code for y in years: for m in months: for c in city: url = "http://tianqi.2345.com/t/wea_history/js/" + str (c) + "_" + str (y) + str (m) + ". js? Qq-pf-to = pcqq. c2C "print url html = urllib2.urlopen (url) srcData = html. read () # JsonData = json. loads (srcData) file = open ("d:/json/" + str (c) + "handan/weather" + str (c) + "_" + str (y) + str (m) + ". json "," w ") file. write (srcData) file. close ()
Store the data to a local device:

Because I just learned something, I learned a little bit and started to practice it. I haven't learned how to convert json. I directly use regular expression matching to extract json data and print it directly.

Extract the Python code from the converted json file:

#-*-Coding: UTF-8-*-import json import re import time Year = [2014] Month = [1] for y in Year: for m in Month: "The change was successful in February 15, 2016. It is because of the encoding problem after regular expression matching that the output cannot be displayed. After each RegEx-matched tuples, Add. decode ('gbk '). encode ('utf-8'), "" content = fRead. read () pattern = re. compile ('{ymd :\'(. *?) \ ', BWendu :\'(.*?) \ ', YWendu :\'(.*?) \ ', Tianqi :\'(.*?) \ ', Fengxiang :\'(.*?) \ ', Fengli :\'(.*?) \ '},', Re. s) items = re. findall (pattern, content) for item in items: print item [0]. decode ('gbk '). encode ('utf-8'), "," + item [1]. decode ('gbk '). encode ('utf-8'), "," + item [2]. decode ('gbk '). encode ('utf-8'), "," + item [3]. decode ('gbk '). encode ('utf-8'), "," + item [4]. decode ('gbk '). encode ('utf-8'), "," + item [5]. decode ('gbk '). encode ('utf-8') time. sleep (0.1) fRead. close ()

Run with Sublime Text 3

A major problem with Regular Expression Processing is that the format is not neat and some data is always missing. The reason may be that the matching speed is too fast, leading to a lack of data. However, sleeping through time. sleep () still cannot solve the problem.

We can see the defects in regular expression matching. After using Python to process json data packets, try again.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.