Some time ago self-study of Python, as a novice thinking of writing something to be able to practice, understand Python to write a reptile script is very convenient, and recently learned MongoDB related knowledge, everything has only owe the East wind.
The requirements of the program is this, the crawler crawling page is the Beijing-East ebook website page, will update some free ebook every day, the crawler will update the free title of the day with the first time by email to me, notify me to download.
First, the preparation of ideas:
1. Reptile script Get free book information of the day
2. Compare the acquired book information with the existing information in the database, if the book exists do not do any operation, the book does not exist, perform the operation of inserting the database, the information of the data deposit MongoDB
3. When performing database insert operation, send the updated data in the form of mail
4. Using Apscheduler scheduling framework to complete the Python script scheduling
Second, the main knowledge of the script:
1.python Simple Crawler
This use of the module has URLLIB2 used to crawl the page, the import module is as follows:
Import urllib2 from
sgmllib import Sgmlparser
Urlopen () method to obtain HTML source page, are stored in the content, Listhref () class main function is to parse HTML code, processing HTML type semi-structured document.
Content = Urllib2.urlopen (' http://sale.jd.com/act/yufbrhZtjx6JTV.html '). Read ()
listhref = Listhref ()
Listhref.feed (content)
The Listhref () class code can be queried in all of the following code, and here are just a few key points:
The Listhref () class inherits the Sgmlparser class and writes its internal methods. Sgmlparser breaks HTML into useful fragments, such as start and end tags. Once a piece of data is successfully decomposed into a useful fragment, it invokes an internal method based on the data found. To use this parser, you need to subclass the Sgmlparser class and override these methods of the parent class.
Sgmlparser parse HTML into different classes of data and tags, and then invoke separate methods for each class:
Start tag (start_tag)
is an HTML tag that starts with a block, like End tag (end_tag)
is to end a block of HTML tags, such as Textual data (text)
Gets the text block, calling Handle_data to get the text when it does not meet any other kinds of markup.
The following classes are not used in this article
Character reference (Character reference)
The escape character, represented by the decimal or equivalent hexadecimal of a character, when the character is found, and Sgmlparser calls Handle_charref with the character.
Entity references (Entity Reference)
HTML entities, like &ref, when the entity is found, sgmlparser the name of the entity called Handle_entityref.
Note (Comment)
HTML comments, including between <!--...-->. When found, Sgmlparser calls handle_comment with the annotation content.
Processing instructions (processing instruction)
HTML processing instructions, including in the;? .. > Between. When found, sgmlparser with the content of the instruction to tune HANDLE_PI.
Statement (Declaration)
HTML declarations, such as DOCTYPE, included in the <! .. > Between. When found, Sgmlparser invokes handle_decl with the content of the declaration.
Specific instructions for reference Api:http://docs.python.org/2/library/sgmllib.html?highlight=sgmlparser#sgmllib.sgmlparser
2.python Operation MongoDB Database
First to install the Python driver Pymongo for MongoDB, download the address: https://pypi.python.org/pypi/pymongo/2.5
Import Module
Connect the database server 127.0.0.1 and switch to the database you are using MyDatabase
Mongocon=pymongo. Connection (host= "127.0.0.1", port=27017)
db= mongocon.mydatabase
Find database related books information, book for lookup collection
BookInfo = Db.book.find_one ({"href": Bookitem.href})
Insert book information for the database, Python support Chinese, but for Chinese encoding and decoding is still more complex, related decoding and coding please refer to http://blog.csdn.net/mayflowers/article/details/1568852
b={
"BookName": BookItem.bookname.decode (' GBK '). Encode (' UTF8 '),
"href": Bookitem.href,
"date": Bookitem.date
}
db.book.insert (b,safe=true)
For Pymongo please refer to the API documentation HTTP://API.MONGODB.ORG/PYTHON/2.0.1/
3.python Send mail
Import Mail Module
# import Smtplib for the actual sending function
import smtplib from
email.mime.text import Mimetext
"localhost" is the mail server address
msg = Mimetext (context) #文本邮件的内容
msg[' Subject ' = Sub #主题
msg[' from ' = ' my@vmail.cn ' #发信人
Msg[' to '] = Commaspace.join (mailto_list) #收信人列表
def send_mail (Mailto_list, Sub, context):
commaspace = ', '
mail_host = ' localhost '
me = ' my@vmail.cn '
# Create a text/plain message
msg = mimetext (context)
msg[' Subject '] = Sub
msg[' from '] = "my@vmail.cn"
Msg[' to '] = Commaspace.join (mailto_list)
send_smtp = Smtplib. SMTP (mail_host)
send_smtp.sendmail (Me, Mailto_list, msg.as_string ())
Send_smtp.close ()
Application Document: http://docs.python.org/2/library/email.html?highlight=smtplib#
4.Python Scheduling Framework Apscheduler
Download Address https://pypi.python.org/pypi/APScheduler/2.1.0
Official Document: http://pythonhosted.org/APScheduler/#faq
Api:http://pythonhosted.org/apscheduler/genindex.html
Installation method: Unzip after downloading, then execute Python setup.py install, import module
From Apscheduler.scheduler Import Scheduler
The Apscheduler configuration is simpler, in this case only the Add_interval_job method is used, and the task script is executed after each interval, in this case the interval is 30 minutes. Reference to the example article http://flykite.blog.51cto.com/4721239/832036
# Start The Scheduler
sched = Scheduler ()
sched.daemonic = False
sched.add_interval_job (job,minutes=30)
Sched.start ()
About Daemonic Parameters:
Apscheduler creates a thread that is daemon=true by default and is thread-guarded by default.
In the code above, if you don't add Sched.daemonic=false, the script won't run at the same time.
Because the script does not have sched.daemonic=false, it creates a daemon thread. In this procedure, you create an instance of scheduler. But because the script runs very fast, the main thread mainthread will end immediately, and at this time the thread of the scheduled task has not yet been executed, followed the main thread end. (The relationship between the daemon thread and the main thread is determined). To make the script work correctly, you must set the script to be a non-daemon thread. Sched.daemonic=false
Attach: All script code
All Code
#-*-coding:utf-8-*-Import urllib2 from sgmllib import sgmlparser import Pymongo import time # import smtplib for the Actual sending function Import Smtplib from Email.mime.text import mimetext from Apscheduler.scheduler Import Scheduler #
Get Freebook HREFs class Listhref (sgmlparser): def __init__ (self): sgmlparser.__init__ (self) self.is_a = "" Self.name = [] Self.freehref= "" self.hrefs=[] def start_a (self, attrs): self.is_a = 1 href = [V for K, v. in attrs if k = = "href"] self. FREEHREF=HREF[0] def end_a (self): Self.is_a = "" Def handle_data (self, text): If self.is_a = 1 and Text.decode (' UTF8 '). Encode (' GBK ') = "free during limited hours": Self.hrefs.append (self.freehref) #get Freebook Info class Freebook (sgmlparser): Def __init__ ( Self): sgmlparser.__init__ (self) self.is_title= "self.name =" "Def start_title (Self, attrs): Self.is_title = 1 def end_t Itle (self): Self.is_title = "" Def handle_data (self, text): if Self.is_title = = 1:self.name=text #Mongo Store Module CLA SS Freebookmod:def __init__ (sElf, date, bookname, href): self.date=date self.bookname=bookname self.href=href def get_book (booklist): content = Urllib 2.urlopen (' http://sale.jd.com/act/yufbrhZtjx6JTV.html '). Read () Listhref = Listhref () listhref.feed (content) for href In listhref.hrefs:content = Urllib2.urlopen (str (HREF)). Read () Listbook=freebook () listbook.feed (content) name = Listbook.name n= Name.index (' ") #print (name[0:n+2)) Freebook=freebookmod (Time.strftime ('%y-%m-%d ', Time.localtime ( Time.time ()), Name[0:n+2],href) booklist.append (Freebook) return Booklist def record_book (Booklist,context, Issendmail): # DataBase Operation Mongocon=pymongo. Connection (host= "127.0.0.1", port=27017) db= mongocon.mydatabase for bookitem in booklist:bookinfo = Db.book.find_one ({ "href": bookitem.href}) if not bookinfo:b={"BookName": BookItem.bookname.decode (' GBK '). Encode (' UTF8 '), "href": Bookitem.href, "date": Bookitem.date} db.book.insert (b,safe=true) issendmail=true context=context+ BookItem.bookname.decode (' GBK '). Encode (' UTF8 ')+ ', ' return context,issendmail #Send message def send_mail (Mailto_list, Sub, context): Commaspace = ', ' mail_host = ' l Ocalhost "me =" my@vmail.cn "# Create a text/plain message msg = Mimetext (context) msg[' Subject '] = Sub msg[' from '] =" m y@vmail.cn "msg[' to"] = Commaspace.join (mailto_list) send_smtp = Smtplib. SMTP (Mail_host) Send_smtp.sendmail (Me, Mailto_list, msg.as_string ()) send_smtp.close () #Main Job for Scheduler Def J
OB (): booklist=[] issendmail=false; context= "Today free books are" mailto_list=["mailto@mail.cn" Booklist=get_book (Booklist) Context,issendmail=record_ Book (Booklist,context,issendmail) if Issendmail==true:send_mail (Mailto_list, ' free book are Update ', context) if __name_
_== "__main__": # Start the Scheduler Sched = Scheduler () Sched.daemonic = False Sched.add_interval_job (job,minutes=30) Sched.start ()