python+sqlalchemy+ crawler

Source: Internet
Author: User

python+sqlalchemy+ crawler

Before sharing the knowledge of SQLAlchemy, this time I shared the learning with Python development crawler and then crawled out of the data into the SQLAlchemy database above the knowledge, of course, I this is a test, followed by me and with TDD written test.

Jar. Guo Email:[email protected]date:2016-08-27language:python2.7.10 "" "Import Urllib2import sysfrom lxml Import htmlfrom musicorm import music,musicormhelperreload (SYS) sys.setdefaultencoding (' UTF8 ')

Brief description of the class
This class is mainly used to crawl the top 250 books of the Watercress book name

Attributes:
Cur_url: The URL used to represent the current grab page
Datas: Store a well-handled fetch of the book name

Class Musicpicker (object):    def __init__ (self):        Self.cur_url = "Https://music.douban.com/chart"        Self.datas = []
# stored database Self.db=musicormhelper ("flaskr.db") self.db.create_db () print "Watercress music crawler ready to crawl data ..."


Returns:
Returns the HTML (Unicode encoding) crawled to the entire page
Raises:
Exceptions thrown by Urlerror:url

    def acquire_music_open (self):        try:            html_string = Urllib2.urlopen (Self.cur_url). Read (). Decode ("Utf-8")        except Urllib2. Urlerror, E:            if Hasattr (E, "code"):                print "The server couldn ' t fulfill the request."                Print "Error code:%s"% E.code            elif hasattr (E, "Reason"):                print "We failed to reach a server. Please check your URL and read the Reason "                print" Reason:%s "% E.reason        return html_string


By returning the entire page HTML, regular matches the first 250 of the book name
Args:
Content_items: The HTML text of the incoming page matches, not the regular

    def select_music_content (self, html_string):        tree = html.fromstring (html_string)        content_items = Tree.xpath ( '//a[@href = "javascript:;"] /text () ')        return Content_items

Here's a simple matter of content handling

    def form_music_content (self, content_items):        top_num = 1        temp_data = []        for index, item in enumerate (Content _items):            if (item.find) = =-1 and top_num<=10):                temp_data.append ("+" + str (top_num) + "First Name" + Item) C10/>top_num + = 1        self.datas.extend (temp_data)        return  Self.datas


Crawler entry, and control the Crawler crawl page range

    def start_music_spider (self):        my_page = self. Acquire_music_open ()        content_items = self.select_music_content (my_page)        self.form_music_content (Content_ Items

Here, let's write a way to insert a database.

    def exportdata (self, music):            return Self.db.addmusic (music)

Finally, let's take a look at the front.

def main ():
Print"""
###############################
A simple watercress music top 250 crawler
Jar.guo Email:[email protected]
Date:2016-08-27
###############################
My_spider = Musicpicker ()
My_ Spider.start_music_spider ()
# here iterates over the crawled content and inserts the database, And here we print out the results, it is convenient to check the crawler crawled out of the content is not what they want
for Item in my_spider.datas:
Item_unicode = Span style= "COLOR: #000080" >unicode (item)
My_spider.exportdata (Music (item_unicode,item_unicode))

print Item



print " spider is-done ... "

 if __name__ = =  ' __main__ ': 
Main ()
     
/span>

  I'll send you the relevant tests below

From musicorm import musicfrom musicpicker import musicpickerimport unittest # contains Unit Test module import sysreload (SYS) SYS.SETDEFAU Ltencoding (' UTF8 ') class filmreptiletests (UnitTest. TestCase): def setUp (self): # Unit test Environment Configuration Self.spider= musicpicker () def tearDown (self): # Unit test Environment Clear SEL F.spider =none def testinit (self): Self.assertisnotnone (Self.spider) Self.assertisnotnone (self.spider.cur _url) self.assertequal (Self.spider.cur_url, "Https://music.douban.com/chart") self.assertequal (Self.spider.da Tas,[]) def testget_page_string (self): Self.assertisnotnone (Self.spider.Acquire_music_open ()) def Testfind_ti         Tle (self): Html_string=self.spider.acquire_music_open () titles=self.spider.select_music_content (html_string) Self.assertisnotnone (titles) Titles_length=len (titles) model = Self.spider.form_music_Content (titles ) Model_length=len (model) Self.assertgreater (titles_length, 0) Self.asserteqUAL (titles_length,20) Self.assertisnotnone (model) self.assertequal (Model_length, ten) def Testexportdata (s        ELF): Html_string=self.spider.acquire_music_open () titles=self.spider.select_music_content (html_string)        Self.assertisnotnone (titles) Titles_length=len (titles) model = Self.spider.form_music_Content (titles) Model_length=len (model) Self.assertgreater (titles_length, 0) self.assertequal (titles_length,20) s Elf.assertisnotnone (model) self.assertequal (Model_length, ten) for item in MODEL:ISSUCCESS=SELF.S Pider.exportdata (Music (Unicode (item), Unicode (item))) Self.asserttrue (issuccess)

python+sqlalchemy+ crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.