Share a simple python+mysql network data crawl

Source: Internet
Author: User
Tags rollback python web crawler

Recently learned Python web crawler, so I wrote a simple program to practice practiced hand (hehe. )。 I use the environment is python3.6 and mysql8.0, crawl target site for Baidu hotspot (http://top.baidu.com/). I only grabbed real-time hotspot content, and other columns should be similar. There are two variables in the code seconds_per_crawl and crawl_per_update_to_db, the former is the fetch frequency, the latter is to fetch how many times write once database, can be set freely. The data i crawl is the hotspot information, the link, the number of followers and the time. Its in-memory structure for dict{tuple (hotspot information, links): List[tuple (number of followers, time) ...] ...}, the database store is where the hotspot information and links are placed in the Hotnews table, with the number of followers and the corresponding time placed in a table that is directly in the table name with the hotspot information (because the hotspot information may be present for a long time but the focus changes over time). Below you can see the database storage sample

The code is relatively simple I have not uploaded github directly paste below for your reference:

1 #-*-coding:utf-8-*-2 3  fromBs4ImportBeautifulSoup4 ImportThreading5 ImportRequests6 ImportPymysql7 Importstring8 Import Time9 ImportSYSTen ImportRe One  A  - #Crawl frequency, units: seconds -Seconds_per_crawl = 10 the  - #Update to database frequency, units: Crawl Count -crawl_per_update_to_db = 1 -  + #crawl destination Site URL -Crawl_target_url ='http://top.baidu.com/' +  A classDataproducer (Threading. Thread): at  -     #temporarily store crawl results -Newsdict = {} -  -     def __init__(self): -Threading. Thread.__init__(self) indb = Pymysql.connect (host="localhost", user="Root", port=3306, Passwd=none, db="crawler", charset="UTF8") -cursor =db.cursor () tosql ="""CREATE TABLE IF not EXISTS hotnews ( + Information VARCHAR (+) not NULL, - HYPERLINK VARCHAR (), the PRIMARY KEY (information));""" * cursor.execute (SQL) $ db.close ()Panax Notoginseng  -     defRun (self): the         Print("dataproducer Thread start!") + Crawl_data (self. NEWSDICT) A         Print("dataproducer Thread exit!") the  +  - defcrawl_data (nd): $Count =0; $      while1: -req = Requests.get (url=Crawl_target_url) -Req.encoding =req.apparent_encoding theBF = BeautifulSoup (Req.text,"Html.parser") -Texts = Bf.find_all ('ul', id="hot-list", class_="List")WuyiBFS = BeautifulSoup (str (texts),"Html.parser") thespans = Bfs.find_all ('span', Class_ = Re.compile ("Icon-fall|icon-rise|icon-fair")) -Lis = Bfs.find_all ('a', Class_ ="List-title") Wu          forIinchRange (10): -Vtup = (Spans[i].get_text (), Time.strftime ("%y-%m-%d%h:%m:%s", Time.localtime ())) AboutKtup = (Lis[i].get ('title'), Lis[i].get ('href')) $             ifKtupinchNd.keys (): - nd[ktup].append (Vtup) -             Else: -Nd[ktup] =[Vtup] ACount = count+1 +         ifcount%crawl_per_update_to_db = =0: the update_to_db (ND) - nd.clear () $ time.sleep (seconds_per_crawl) the  the defupdate_to_db (nd): thedb = Pymysql.connect (host="localhost", user="Root", port=3306, Passwd=none, db="crawler", charset="UTF8") thecursor =db.cursor () -      forKinchNd.keys (): in         #inserting hotspots into the main table theSQL1 ="REPLACE into hotnews (information, HYPERLINK) VALUES ('%s ', '%s ');" the         #hotspots Create their sub-tables AboutSQL2 ="CREATE TABLE IF not EXISTS '%s ' (numberofpeople INT not NULL, Occurtime DATETIME, PRIMARY KEY (Occurtime));" the         Try: theCursor.execute (sql1% (k[0), k[1])) theCursor.execute (sql2%(K[0])) + Db.commit () -         except: the Db.rollback ()Bayi         #insert each hotspot data into the corresponding Hotspot sub-table the          forEinchNd[k]: theInsert_sql ="INSERT into%s (numberofpeople, Occurtime) VALUES (%d, '%s ');" -             Try: -Cursor.execute (insert_sql% (k[0], int (e[0]), e[1])) the             except: the Db.rollback () the db.close () the  - if __name__=='__main__': thet =Dataproducer () the T.start () theT.join ()

Share a simple python+mysql network data crawl

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.