python+sqlalchemy+ crawler
Before sharing the knowledge of SQLAlchemy, this time I shared the learning with Python development crawler and then crawled out of the data into the SQLAlchemy database above the knowledge, of course, I this is a test, followed by me and with TDD written test.
Jar. Guo Email:[email protected]date:2016-08-27language:python2.7.10 "" "Import Urllib2import sysfrom lxml Import htmlfrom musicorm import music,musicormhelperreload (SYS) sys.setdefaultencoding (' UTF8 ')
Brief description of the class
This class is mainly used to crawl the top 250 books of the Watercress book name
Attributes:
Cur_url: The URL used to represent the current grab page
Datas: Store a well-handled fetch of the book name
Class Musicpicker (object): def __init__ (self): Self.cur_url = "Https://music.douban.com/chart" Self.datas = []
# stored database Self.db=musicormhelper ("flaskr.db") self.db.create_db () print "Watercress music crawler ready to crawl data ..."
Returns:
Returns the HTML (Unicode encoding) crawled to the entire page
Raises:
Exceptions thrown by Urlerror:url
def acquire_music_open (self): try: html_string = Urllib2.urlopen (Self.cur_url). Read (). Decode ("Utf-8") except Urllib2. Urlerror, E: if Hasattr (E, "code"): print "The server couldn ' t fulfill the request." Print "Error code:%s"% E.code elif hasattr (E, "Reason"): print "We failed to reach a server. Please check your URL and read the Reason " print" Reason:%s "% E.reason return html_string
By returning the entire page HTML, regular matches the first 250 of the book name
Args:
Content_items: The HTML text of the incoming page matches, not the regular
def select_music_content (self, html_string): tree = html.fromstring (html_string) content_items = Tree.xpath ( '//a[@href = "javascript:;"] /text () ') return Content_items
Here's a simple matter of content handling
def form_music_content (self, content_items): top_num = 1 temp_data = [] for index, item in enumerate (Content _items): if (item.find) = =-1 and top_num<=10): temp_data.append ("+" + str (top_num) + "First Name" + Item) C10/>top_num + = 1 self.datas.extend (temp_data) return Self.datas
Crawler entry, and control the Crawler crawl page range
def start_music_spider (self): my_page = self. Acquire_music_open () content_items = self.select_music_content (my_page) self.form_music_content (Content_ Items
Here, let's write a way to insert a database.
def exportdata (self, music): return Self.db.addmusic (music)
Finally, let's take a look at the front.
def main ():
Print"""
###############################
A simple watercress music top 250 crawler
Jar.guo Email:[email protected]
Date:2016-08-27
###############################
My_spider = Musicpicker ()
My_ Spider.start_music_spider ()
# here iterates over the crawled content and inserts the database, And here we print out the results, it is convenient to check the crawler crawled out of the content is not what they want
for Item in my_spider.datas:
Item_unicode = Span style= "COLOR: #000080" >unicode (item)
My_spider.exportdata (Music (item_unicode,item_unicode))
print Item
print " spider is-done ... "
if __name__ = = ' __main__ ':
Main ()
/span>
I'll send you the relevant tests below
From musicorm import musicfrom musicpicker import musicpickerimport unittest # contains Unit Test module import sysreload (SYS) SYS.SETDEFAU Ltencoding (' UTF8 ') class filmreptiletests (UnitTest. TestCase): def setUp (self): # Unit test Environment Configuration Self.spider= musicpicker () def tearDown (self): # Unit test Environment Clear SEL F.spider =none def testinit (self): Self.assertisnotnone (Self.spider) Self.assertisnotnone (self.spider.cur _url) self.assertequal (Self.spider.cur_url, "Https://music.douban.com/chart") self.assertequal (Self.spider.da Tas,[]) def testget_page_string (self): Self.assertisnotnone (Self.spider.Acquire_music_open ()) def Testfind_ti Tle (self): Html_string=self.spider.acquire_music_open () titles=self.spider.select_music_content (html_string) Self.assertisnotnone (titles) Titles_length=len (titles) model = Self.spider.form_music_Content (titles ) Model_length=len (model) Self.assertgreater (titles_length, 0) Self.asserteqUAL (titles_length,20) Self.assertisnotnone (model) self.assertequal (Model_length, ten) def Testexportdata (s ELF): Html_string=self.spider.acquire_music_open () titles=self.spider.select_music_content (html_string) Self.assertisnotnone (titles) Titles_length=len (titles) model = Self.spider.form_music_Content (titles) Model_length=len (model) Self.assertgreater (titles_length, 0) self.assertequal (titles_length,20) s Elf.assertisnotnone (model) self.assertequal (Model_length, ten) for item in MODEL:ISSUCCESS=SELF.S Pider.exportdata (Music (Unicode (item), Unicode (item))) Self.asserttrue (issuccess)
python+sqlalchemy+ crawler