Python writes web crawler scripts and implements Apscheduler scheduling _python

Source: Internet
Author: User
Tags closing tag html tags list of attributes mongodb processing instruction python script vmail

Some time ago self-study of Python, as a novice thinking of writing something to be able to practice, understand Python to write a reptile script is very convenient, and recently learned MongoDB related knowledge, everything has only owe the East wind.

The requirements of the program is this, the crawler crawling page is the Beijing-East ebook website page, will update some free ebook every day, the crawler will update the free title of the day with the first time by email to me, notify me to download.

First, the preparation of ideas:

1. Reptile script Get free book information of the day

2. Compare the acquired book information with the existing information in the database, if the book exists do not do any operation, the book does not exist, perform the operation of inserting the database, the information of the data deposit MongoDB

3. When performing database insert operation, send the updated data in the form of mail

4. Using Apscheduler scheduling framework to complete the Python script scheduling

Second, the main knowledge of the script:

1.python Simple Crawler

This use of the module has URLLIB2 used to crawl the page, the import module is as follows:

Import urllib2 from
sgmllib import Sgmlparser

Urlopen () method to obtain HTML source page, are stored in the content, Listhref () class main function is to parse HTML code, processing HTML type semi-structured document.

Content = Urllib2.urlopen (' http://sale.jd.com/act/yufbrhZtjx6JTV.html '). Read ()
listhref = Listhref ()
Listhref.feed (content)

The Listhref () class code can be queried in all of the following code, and here are just a few key points:

The Listhref () class inherits the Sgmlparser class and writes its internal methods. Sgmlparser breaks HTML into useful fragments, such as start and end tags. Once a piece of data is successfully decomposed into a useful fragment, it invokes an internal method based on the data found. To use this parser, you need to subclass the Sgmlparser class and override these methods of the parent class.

Sgmlparser parse HTML into different classes of data and tags, and then invoke separate methods for each class:
Start tag (start_tag)
is an HTML tag that starts with a block, like End tag (end_tag)
is to end a block of HTML tags, such as Textual data (text)
Gets the text block, calling Handle_data to get the text when it does not meet any other kinds of markup.

The following classes are not used in this article
Character reference (Character reference)
The escape character, represented by the decimal or equivalent hexadecimal of a character, when the character is found, and Sgmlparser calls Handle_charref with the character.
Entity references (Entity Reference)
HTML entities, like &ref, when the entity is found, sgmlparser the name of the entity called Handle_entityref.
Note (Comment)
HTML comments, including between <!--...-->. When found, Sgmlparser calls handle_comment with the annotation content.
Processing instructions (processing instruction)
HTML processing instructions, including in the;? .. > Between. When found, sgmlparser with the content of the instruction to tune HANDLE_PI.
Statement (Declaration)
HTML declarations, such as DOCTYPE, included in the <! .. > Between. When found, Sgmlparser invokes handle_decl with the content of the declaration.

Specific instructions for reference Api:http://docs.python.org/2/library/sgmllib.html?highlight=sgmlparser#sgmllib.sgmlparser

2.python Operation MongoDB Database

First to install the Python driver Pymongo for MongoDB, download the address: https://pypi.python.org/pypi/pymongo/2.5

Import Module

Import Pymongo

Connect the database server 127.0.0.1 and switch to the database you are using MyDatabase

Mongocon=pymongo. Connection (host= "127.0.0.1", port=27017)
db= mongocon.mydatabase

Find database related books information, book for lookup collection

BookInfo = Db.book.find_one ({"href": Bookitem.href})

Insert book information for the database, Python support Chinese, but for Chinese encoding and decoding is still more complex, related decoding and coding please refer to http://blog.csdn.net/mayflowers/article/details/1568852

b={
"BookName": BookItem.bookname.decode (' GBK '). Encode (' UTF8 '),
"href": Bookitem.href,
"date": Bookitem.date
}
db.book.insert (b,safe=true)

For Pymongo please refer to the API documentation HTTP://API.MONGODB.ORG/PYTHON/2.0.1/

3.python Send mail

Import Mail Module

# import Smtplib for the actual sending function
import smtplib from
email.mime.text import Mimetext

"localhost" is the mail server address

msg = Mimetext (context) #文本邮件的内容
msg[' Subject ' = Sub #主题
msg[' from ' = ' my@vmail.cn ' #发信人
Msg[' to '] = Commaspace.join (mailto_list) #收信人列表

def send_mail (Mailto_list, Sub, context): 
commaspace = ', '
mail_host = ' localhost '
me = ' my@vmail.cn '
# Create a text/plain message
msg = mimetext (context) 
msg[' Subject '] = Sub 
msg[' from '] = "my@vmail.cn"
Msg[' to '] = Commaspace.join (mailto_list)

send_smtp = Smtplib. SMTP (mail_host) 

send_smtp.sendmail (Me, Mailto_list, msg.as_string ()) 
Send_smtp.close ()

Application Document: http://docs.python.org/2/library/email.html?highlight=smtplib#

4.Python Scheduling Framework Apscheduler

Download Address https://pypi.python.org/pypi/APScheduler/2.1.0

Official Document: http://pythonhosted.org/APScheduler/#faq

Api:http://pythonhosted.org/apscheduler/genindex.html

Installation method: Unzip after downloading, then execute Python setup.py install, import module

From Apscheduler.scheduler Import Scheduler

The Apscheduler configuration is simpler, in this case only the Add_interval_job method is used, and the task script is executed after each interval, in this case the interval is 30 minutes. Reference to the example article http://flykite.blog.51cto.com/4721239/832036

# Start The Scheduler 
sched = Scheduler ()
sched.daemonic = False 
sched.add_interval_job (job,minutes=30) 
Sched.start ()

About Daemonic Parameters:

Apscheduler creates a thread that is daemon=true by default and is thread-guarded by default.

In the code above, if you don't add Sched.daemonic=false, the script won't run at the same time.

Because the script does not have sched.daemonic=false, it creates a daemon thread. In this procedure, you create an instance of scheduler. But because the script runs very fast, the main thread mainthread will end immediately, and at this time the thread of the scheduled task has not yet been executed, followed the main thread end. (The relationship between the daemon thread and the main thread is determined). To make the script work correctly, you must set the script to be a non-daemon thread. Sched.daemonic=false

Attach: All script code

All Code

#-*-coding:utf-8-*-Import urllib2 from sgmllib import sgmlparser import Pymongo import time # import smtplib for the Actual sending function Import Smtplib from Email.mime.text import mimetext from Apscheduler.scheduler Import Scheduler # 
Get Freebook HREFs class Listhref (sgmlparser): def __init__ (self): sgmlparser.__init__ (self) self.is_a = "" Self.name = [] Self.freehref= "" self.hrefs=[] def start_a (self, attrs): self.is_a = 1 href = [V for K, v. in attrs if k = = "href"] self. FREEHREF=HREF[0] def end_a (self): Self.is_a = "" Def handle_data (self, text): If self.is_a = 1 and Text.decode (' UTF8 '). Encode (' GBK ') = "free during limited hours": Self.hrefs.append (self.freehref) #get Freebook Info class Freebook (sgmlparser): Def __init__ ( Self): sgmlparser.__init__ (self) self.is_title= "self.name =" "Def start_title (Self, attrs): Self.is_title = 1 def end_t Itle (self): Self.is_title = "" Def handle_data (self, text): if Self.is_title = = 1:self.name=text #Mongo Store Module CLA SS Freebookmod:def __init__ (sElf, date, bookname, href): self.date=date self.bookname=bookname self.href=href def get_book (booklist): content = Urllib 2.urlopen (' http://sale.jd.com/act/yufbrhZtjx6JTV.html '). Read () Listhref = Listhref () listhref.feed (content) for href In listhref.hrefs:content = Urllib2.urlopen (str (HREF)). Read () Listbook=freebook () listbook.feed (content) name = Listbook.name n= Name.index (' ") #print (name[0:n+2)) Freebook=freebookmod (Time.strftime ('%y-%m-%d ', Time.localtime ( Time.time ()), Name[0:n+2],href) booklist.append (Freebook) return Booklist def record_book (Booklist,context, Issendmail): # DataBase Operation Mongocon=pymongo. Connection (host= "127.0.0.1", port=27017) db= mongocon.mydatabase for bookitem in booklist:bookinfo = Db.book.find_one ({ "href": bookitem.href}) if not bookinfo:b={"BookName": BookItem.bookname.decode (' GBK '). Encode (' UTF8 '), "href": Bookitem.href, "date": Bookitem.date} db.book.insert (b,safe=true) issendmail=true context=context+ BookItem.bookname.decode (' GBK '). Encode (' UTF8 ')+ ', ' return context,issendmail #Send message def send_mail (Mailto_list, Sub, context): Commaspace = ', ' mail_host = ' l Ocalhost "me =" my@vmail.cn "# Create a text/plain message msg = Mimetext (context) msg[' Subject '] = Sub msg[' from '] =" m y@vmail.cn "msg[' to"] = Commaspace.join (mailto_list) send_smtp = Smtplib. SMTP (Mail_host) Send_smtp.sendmail (Me, Mailto_list, msg.as_string ()) send_smtp.close () #Main Job for Scheduler Def J 
OB (): booklist=[] issendmail=false; context= "Today free books are" mailto_list=["mailto@mail.cn" Booklist=get_book (Booklist) Context,issendmail=record_ Book (Booklist,context,issendmail) if Issendmail==true:send_mail (Mailto_list, ' free book are Update ', context) if __name_ 
_== "__main__": # Start the Scheduler Sched = Scheduler () Sched.daemonic = False Sched.add_interval_job (job,minutes=30) Sched.start ()

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.