Recently learned Python web crawler, so I wrote a simple program to practice practiced hand (hehe. )。 I use the environment is python3.6 and mysql8.0, crawl target site for Baidu hotspot (http://top.baidu.com/). I only grabbed real-time hotspot content, and other columns should be similar. There are two variables in the code seconds_per_crawl and crawl_per_update_to_db, the former is the fetch frequency, the latter is to fetch how many times write once database, can be set freely. The data i crawl is the hotspot information, the link, the number of followers and the time. Its in-memory structure for dict{tuple (hotspot information, links): List[tuple (number of followers, time) ...] ...}, the database store is where the hotspot information and links are placed in the Hotnews table, with the number of followers and the corresponding time placed in a table that is directly in the table name with the hotspot information (because the hotspot information may be present for a long time but the focus changes over time). Below you can see the database storage sample
The code is relatively simple I have not uploaded github directly paste below for your reference:
1 #-*-coding:utf-8-*-2 3 fromBs4ImportBeautifulSoup4 ImportThreading5 ImportRequests6 ImportPymysql7 Importstring8 Import Time9 ImportSYSTen ImportRe One A - #Crawl frequency, units: seconds -Seconds_per_crawl = 10 the - #Update to database frequency, units: Crawl Count -crawl_per_update_to_db = 1 - + #crawl destination Site URL -Crawl_target_url ='http://top.baidu.com/' + A classDataproducer (Threading. Thread): at - #temporarily store crawl results -Newsdict = {} - - def __init__(self): -Threading. Thread.__init__(self) indb = Pymysql.connect (host="localhost", user="Root", port=3306, Passwd=none, db="crawler", charset="UTF8") -cursor =db.cursor () tosql ="""CREATE TABLE IF not EXISTS hotnews ( + Information VARCHAR (+) not NULL, - HYPERLINK VARCHAR (), the PRIMARY KEY (information));""" * cursor.execute (SQL) $ db.close ()Panax Notoginseng - defRun (self): the Print("dataproducer Thread start!") + Crawl_data (self. NEWSDICT) A Print("dataproducer Thread exit!") the + - defcrawl_data (nd): $Count =0; $ while1: -req = Requests.get (url=Crawl_target_url) -Req.encoding =req.apparent_encoding theBF = BeautifulSoup (Req.text,"Html.parser") -Texts = Bf.find_all ('ul', id="hot-list", class_="List")WuyiBFS = BeautifulSoup (str (texts),"Html.parser") thespans = Bfs.find_all ('span', Class_ = Re.compile ("Icon-fall|icon-rise|icon-fair")) -Lis = Bfs.find_all ('a', Class_ ="List-title") Wu forIinchRange (10): -Vtup = (Spans[i].get_text (), Time.strftime ("%y-%m-%d%h:%m:%s", Time.localtime ())) AboutKtup = (Lis[i].get ('title'), Lis[i].get ('href')) $ ifKtupinchNd.keys (): - nd[ktup].append (Vtup) - Else: -Nd[ktup] =[Vtup] ACount = count+1 + ifcount%crawl_per_update_to_db = =0: the update_to_db (ND) - nd.clear () $ time.sleep (seconds_per_crawl) the the defupdate_to_db (nd): thedb = Pymysql.connect (host="localhost", user="Root", port=3306, Passwd=none, db="crawler", charset="UTF8") thecursor =db.cursor () - forKinchNd.keys (): in #inserting hotspots into the main table theSQL1 ="REPLACE into hotnews (information, HYPERLINK) VALUES ('%s ', '%s ');" the #hotspots Create their sub-tables AboutSQL2 ="CREATE TABLE IF not EXISTS '%s ' (numberofpeople INT not NULL, Occurtime DATETIME, PRIMARY KEY (Occurtime));" the Try: theCursor.execute (sql1% (k[0), k[1])) theCursor.execute (sql2%(K[0])) + Db.commit () - except: the Db.rollback ()Bayi #insert each hotspot data into the corresponding Hotspot sub-table the forEinchNd[k]: theInsert_sql ="INSERT into%s (numberofpeople, Occurtime) VALUES (%d, '%s ');" - Try: -Cursor.execute (insert_sql% (k[0], int (e[0]), e[1])) the except: the Db.rollback () the db.close () the - if __name__=='__main__': thet =Dataproducer () the T.start () theT.join ()
Share a simple python+mysql network data crawl