Share a simple python+mysql network data crawl

Last Update:2018-06-27 Source: Internet

Author: User

Tags rollback python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently learned Python web crawler, so I wrote a simple program to practice practiced hand (hehe. ）。 I use the environment is python3.6 and mysql8.0, crawl target site for Baidu hotspot (http://top.baidu.com/). I only grabbed real-time hotspot content, and other columns should be similar. There are two variables in the code seconds_per_crawl and crawl_per_update_to_db, the former is the fetch frequency, the latter is to fetch how many times write once database, can be set freely. The data i crawl is the hotspot information, the link, the number of followers and the time. Its in-memory structure for dict{tuple (hotspot information, links): List[tuple (number of followers, time) ...] ...}, the database store is where the hotspot information and links are placed in the Hotnews table, with the number of followers and the corresponding time placed in a table that is directly in the table name with the hotspot information (because the hotspot information may be present for a long time but the focus changes over time). Below you can see the database storage sample

The code is relatively simple I have not uploaded github directly paste below for your reference:

1 #-*-coding:utf-8-*-2 3  fromBs4ImportBeautifulSoup4 ImportThreading5 ImportRequests6 ImportPymysql7 Importstring8 Import Time9 ImportSYSTen ImportRe One  A  - #Crawl frequency, units: seconds -Seconds_per_crawl = 10 the  - #Update to database frequency, units: Crawl Count -crawl_per_update_to_db = 1 -  + #crawl destination Site URL -Crawl_target_url ='http://top.baidu.com/' +  A classDataproducer (Threading. Thread): at  -     #temporarily store crawl results -Newsdict = {} -  -     def __init__(self): -Threading. Thread.__init__(self) indb = Pymysql.connect (host="localhost", user="Root", port=3306, Passwd=none, db="crawler", charset="UTF8") -cursor =db.cursor () tosql ="""CREATE TABLE IF not EXISTS hotnews ( + Information VARCHAR (+) not NULL, - HYPERLINK VARCHAR (), the PRIMARY KEY (information));""" * cursor.execute (SQL) $ db.close ()Panax Notoginseng  -     defRun (self): the         Print("dataproducer Thread start!") + Crawl_data (self. NEWSDICT) A         Print("dataproducer Thread exit!") the  +  - defcrawl_data (nd): $Count =0; $      while1: -req = Requests.get (url=Crawl_target_url) -Req.encoding =req.apparent_encoding theBF = BeautifulSoup (Req.text,"Html.parser") -Texts = Bf.find_all ('ul', id="hot-list", class_="List")WuyiBFS = BeautifulSoup (str (texts),"Html.parser") thespans = Bfs.find_all ('span', Class_ = Re.compile ("Icon-fall|icon-rise|icon-fair")) -Lis = Bfs.find_all ('a', Class_ ="List-title") Wu          forIinchRange (10): -Vtup = (Spans[i].get_text (), Time.strftime ("%y-%m-%d%h:%m:%s", Time.localtime ())) AboutKtup = (Lis[i].get ('title'), Lis[i].get ('href')) $             ifKtupinchNd.keys (): - nd[ktup].append (Vtup) -             Else: -Nd[ktup] =[Vtup] ACount = count+1 +         ifcount%crawl_per_update_to_db = =0: the update_to_db (ND) - nd.clear () $ time.sleep (seconds_per_crawl) the  the defupdate_to_db (nd): thedb = Pymysql.connect (host="localhost", user="Root", port=3306, Passwd=none, db="crawler", charset="UTF8") thecursor =db.cursor () -      forKinchNd.keys (): in         #inserting hotspots into the main table theSQL1 ="REPLACE into hotnews (information, HYPERLINK) VALUES ('%s ', '%s ');" the         #hotspots Create their sub-tables AboutSQL2 ="CREATE TABLE IF not EXISTS '%s ' (numberofpeople INT not NULL, Occurtime DATETIME, PRIMARY KEY (Occurtime));" the         Try: theCursor.execute (sql1% (k[0), k[1])) theCursor.execute (sql2%(K[0])) + Db.commit () -         except: the Db.rollback ()Bayi         #insert each hotspot data into the corresponding Hotspot sub-table the          forEinchNd[k]: theInsert_sql ="INSERT into%s (numberofpeople, Occurtime) VALUES (%d, '%s ');" -             Try: -Cursor.execute (insert_sql% (k[0], int (e[0]), e[1])) the             except: the Db.rollback () the db.close () the  - if __name__=='__main__': thet =Dataproducer () the T.start () theT.join ()

Share a simple python+mysql network data crawl

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More