Recently saw a crawler tutorial, think oneself also often in fighting fish to watch live, rather take it to practice practiced hand. So I wrote a crawl to grab the fish all the online host information, respectively, the category, host ID, room title, popularity value, room address.
Need to use the tool Python3 under the Bs4,requests,pymongo. I use the IDE is Pycharm, feel this software is really too powerful, a bit away from it nothing will feel, database MongoDB, combined with pycharm tool can directly display data on the right.
#-*-coding:utf-8-*- fromBs4ImportBeautifulSoupImportrequests, time, DateTimeImportJSONImportPymongoclassdouyu_host_info ():def __init__(self): Self.date_time= Datetime.datetime.now (). Strftime ('%y-%m-%d_%h-%m') Self.host_url='https://www.douyu.com'Self.list_data=[] self.urls_list=[] self.headers= { 'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/55.0.2883.87 safari/537.36', } defGet_url (self):#get web site addresses for all categoriesURLs ='https://www.douyu.com/directory'Data=requests.get (URLs) Soup= BeautifulSoup (Data.text,'lxml') List= Soup.select ('. R-cont.column-cont DL DD ul li a') forIinchList:urls= I.get ('href') self.urls_list.append (URLs)Print(self.urls_list)returnself.urls_listdefget_info (self, url):#querying the required information and writing calls to functions written to the database and local disksTime.sleep (1)#Avoid server pressure too large, each site crawl set 1s intervalurl = self.host_url +URLPrint('Now start Open {}'. Format (URL)) Get_data= Requests.get (URL, headers=self.headers) Soup= BeautifulSoup (Get_data.text,'lxml') Names= Soup.select ('. Ellipsis.fl') Nums= Soup.select ('. dy-num.fr') Titles= Soup.select ('. Mes h3') HREFs= Soup.select ('#live-list-contentbox Li a') #Web site with directory is just a category page no host information if 'Directory' inchURL:Pass #exception handling, there are a few categories of HTML elements are different, this is discarded Try: Category= Soup.select ('. listcustomize-topcon-msg H1') [0].get_text ()except: Category='Rank name category' forName, num, href, titleinchzip (names, nums, HREFs, titles): Data= { 'category': Category,'Host': Name.get_text (),'title': Title.get_text (). Split ('\ n') [-1].strip (),'links':'https://www.douyu.com'+ Href.get ('href'), #Convert the sentiment index to a floating-point type in million, which makes it easy to calculate later 'Popularity Index': Float (Num.get_text () [:-1])if 'million'inchNum.get_text ()ElseFloat (Num.get_text ())/10000, } ifdata['Popularity Index'] > 2: Print(data) self.w_to_local (data) self.w_to_db ( data)defOpen_data (Self, date_time):#when needed, you can enter a specified point in time to read locally saved dataWith open ('D:\douyu_host{}.csv'. Format (date_time),'R') as R_data:r_data=json.load (r_data) forIinchR_data:Print(i)defw_to_local (self,data):#saves a copy of the data to the local disk while it is being written to the databaseWith open ('D:\douyu_host{}.csv'. Format (Self.date_time),'a') as W_data:json.dump (data, W_data)defw_to_db (self, data):#writes data to a time-based database table, data needs to be in a dictionary formatClient = Pymongo. Mongoclient ('localhost', 27017) Walden= client['walden_{}'. Format (self.date_time)] Sheet_tab= walden['Sheet_tab'] ifData is notNone:sheet_tab.insert_one (data)defcheck_from_db (Self, date_time):#input time from database query related popular informationClient = Pymongo. Mongoclient ('localhost', 27017) Walden= client['walden_{}'. Format (date_time)] Sheet_tab= walden['Sheet_tab'] forDatainchSheet_tab.find ({'Popularity Index':{'$gte': 40}}): Print(data)
Originally did not want to use the class to write, later found that a lot of data transmitted to very messy, and important data storage time is not uniform, so it is written in a class. The key point inside, select the element path with the Select module, such as.
Copied out is a long path, you can slowly observe the main key field to choose the best.
Then create a new py file to invoke the previously written class, after instantiation, using the pool multi-threaded run, the result is out.
#-*-coding:utf-8-*-Import Time fromMultiprocessingImportPool fromtest0822ImportDouyu_host_infodouyu=Douyu_host_info ()if __name__=='__main__': #Multi-threaded crawl dataUrls_list =Douyu.get_url () pool=Pool () pool.map (Douyu.get_info, urls_list)
A panoramic view after a run.
Multi-threaded beatiful soup Crawl all Online anchors information