Python Beautiful Soup Crawl parsing Web page

Source: Internet
Author: User
Tags tag name

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Anyway, it's a library of parsing XML and HTML, which is handy. 。

Website address: http://www.crummy.com/software/BeautifulSoup/

Below is an introduction to using Python and beautiful Soup to crawl PM2.5 data on a Web page.

PM2.5 Data's website: http://www.pm25.com/city/wuhan.html

This site has the corresponding PM2.5 data, they are arranged in several places with a monitor, about every one hours update (sometimes, the instrument data will be lost). The data we want to crawl is some of the air quality indicators of several monitoring points.

1 defgetPM25 ():2URL ="http://www.pm25.com/city/wuhan.html"3 4headers = {5             "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",6             "Accept-language":"zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3",7             "Connection":"keep-alive",8             "user-agent":"mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) gecko/20100101 firefox/31.0",9         }Ten     Try: Onereq = Urllib2. Request (url,headers=headers) AResponse =Urllib2.urlopen (req) -Content =Response.read () - response.close () thePM = Bsoup (content,from_encoding="Utf-8") -Logging.info (Pm.select (". Citydata_updatetime") [0].get_text () + u" ") -With open ('Pm2dot5.txt','a') as F: -             Print>>f, Pm.select (". Citydata_updatetime") [0].get_text () +              forLocateinchPm.select (". Pj_area_data ul:nth-of-type (1) Li"): -                 Print>>f, Locate.select (". Pjadt_location") [0].get_text (). Rjust (15),"\ t", +Locate.select (". Pjadt_aqi") [0].get_text (). Rjust (15),"\ t", ALocate.select (". Pjadt_quality") [0].get_text (). Rjust (15),"\ t", atLocate.select (". Pjadt_wuranwu") [0].get_text (). Rjust (15),"\ t", -Locate.select (". PJADT_PM25") [0].get_text (). Rjust (15),"\ t", -Locate.select (". PJADT_PM10") [0].get_text (). Rjust (15) -             Print>>f,"\n\n\n" -         return0 -     exceptexception,e: in Logging.error (e) -         return1

The main use of Python library urllib2

    • Constructs the sent headers, disguised as Firefox browser
    • Build a request with the above data and open the network request
    • Call Response.read () to get the contents of the HTML

Extract label contents

The following is the use of beautiful soup to parse the HTML content, extract the values in the tag. The specific function should also refer to the official documentation.

The Select method and the Get_text method are mainly used here.

The Select method selects elements based on the tag name (tag, such as A,li,body) or the CSS class or ID.

Get_text method can obtain the corresponding text, such as "

A specific element class that needs to be viewed with the help of a browser's review element

Write text:

  Python's with syntax is used primarily, with the ability to ensure that open files are closed when an exception occurs. At the same time using a stream redirection trick,

Print >> F, "Hello" F is an open file stream, which means to redirect print-printed things to a file.

Log records:

Because this program to run in the background for a long time, so it is best to record the error information, convenient debugging. The Python-brought logging module is used.

1Logging.basicconfig (level=logging. DEBUG,2format='% (asctime) s% (filename) s[line:% (lineno) d]% (levelname) s% (message) s',3datefmt='%a,%d%b%Y%h:%m:%s',4Filename='Debug.log',5Filemode='W')6console =logging. Streamhandler ()7 console.setlevel (logging.info)8Formatter = logging. Formatter ('% (name) -12s:% (levelname) -8s% (message) s')9 Console.setformatter (Formatter)TenLogging.getlogger ("'). AddHandler (console) OneRthandler = Rotatingfilehandler ('Debug.log', maxbytes=1*1024*1024,backupcount=5) A rthandler.setlevel (logging.info) -Formatter = logging. Formatter ('% (name) -12s:% (levelname) -8s% (message) s') - Rthandler.setformatter (Formatter) theLogging.getlogger ("'). AddHandler (Rthandler)

Some of these include the format of the log and the maximum size of the log file.

Timed Operation:

Timed operation, you can fetch the PM2.5 data for a specified time each day, combined with the satellite transit time for further analysis. The Sched module used by Python is also a regular use.

1 defrun ():2      whileTrue:3s =Sched.scheduler (Time.time, Time.sleep)4S.enterabs (Each_day_time (9,50,30), 1, GetPM25, ())5         Try:6 S.run ()7         except:8 S.run ()9Time.sleep (60*60)TenLogging.info ("Second Run") One          whilegetPM25 (): A             Pass -Time.sleep (60*60) -Logging.info ("Third Run") the          whilegetPM25 (): -             Pass -Time.sleep (60*60) -Logging.info ("Fourth Run") +          whilegetPM25 (): -             Pass +Logging.info (U"\ nyou Wait for the next run ...")

Where Each_day_time is a function that gets the specified time

1 def Each_day_time (hour,minute,sec): 2         Today = Datetime.datetime.today ()3         today = Datetime.datetime (today.year,today.month, TODAY.DAY,HOUR,MINUTE,SEC)4         tomorrow = Today + Datetime.timedelta (Days=1)5          xtime = time.mktime (tomorrow.timetuple ())6         #xtime = Time.mktime (Today.timetuple ())7         return xtime

In addition, if the specified time has passed, he will continue to run.

Full code Download (Python 2.7): Http://files.cnblogs.com/files/pasion-forever/pm2-5.v1.rar

Another: Directly double-click the Pyw file, will call Pythonw.exe to execute, if there is no GUI, the default is to run the background.

The result of the crawl:

  

Python Beautiful Soup Crawl parsing Web page

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.