Multi-threaded beatiful soup Crawl all Online anchors information

Source: Internet
Author: User
Tags mongoclient

  Recently saw a crawler tutorial, think oneself also often in fighting fish to watch live, rather take it to practice practiced hand. So I wrote a crawl to grab the fish all the online host information, respectively, the category, host ID, room title, popularity value, room address.

Need to use the tool Python3 under the Bs4,requests,pymongo. I use the IDE is Pycharm, feel this software is really too powerful, a bit away from it nothing will feel, database MongoDB, combined with pycharm tool can directly display data on the right.

#-*-coding:utf-8-*- fromBs4ImportBeautifulSoupImportrequests, time, DateTimeImportJSONImportPymongoclassdouyu_host_info ():def __init__(self): Self.date_time= Datetime.datetime.now (). Strftime ('%y-%m-%d_%h-%m') Self.host_url='https://www.douyu.com'Self.list_data=[] self.urls_list=[] self.headers= {        'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/55.0.2883.87 safari/537.36',        }    defGet_url (self):#get web site addresses for all categoriesURLs ='https://www.douyu.com/directory'Data=requests.get (URLs) Soup= BeautifulSoup (Data.text,'lxml') List= Soup.select ('. R-cont.column-cont DL DD ul li a')         forIinchList:urls= I.get ('href') self.urls_list.append (URLs)Print(self.urls_list)returnself.urls_listdefget_info (self, url):#querying the required information and writing calls to functions written to the database and local disksTime.sleep (1)#Avoid server pressure too large, each site crawl set 1s intervalurl = self.host_url +URLPrint('Now start Open {}'. Format (URL)) Get_data= Requests.get (URL, headers=self.headers) Soup= BeautifulSoup (Get_data.text,'lxml') Names= Soup.select ('. Ellipsis.fl') Nums= Soup.select ('. dy-num.fr') Titles= Soup.select ('. Mes h3') HREFs= Soup.select ('#live-list-contentbox Li a')        #Web site with directory is just a category page no host information        if 'Directory' inchURL:Pass        #exception handling, there are a few categories of HTML elements are different, this is discarded        Try: Category= Soup.select ('. listcustomize-topcon-msg H1') [0].get_text ()except: Category='Rank name category'         forName, num, href, titleinchzip (names, nums, HREFs, titles): Data= {                'category': Category,'Host': Name.get_text (),'title': Title.get_text (). Split ('\ n') [-1].strip (),'links':'https://www.douyu.com'+ Href.get ('href'),                #Convert the sentiment index to a floating-point type in million, which makes it easy to calculate later                'Popularity Index': Float (Num.get_text () [:-1])if 'million'inchNum.get_text ()ElseFloat (Num.get_text ())/10000,            }            ifdata['Popularity Index'] > 2:                Print(data) self.w_to_local (data) self.w_to_db ( data)defOpen_data (Self, date_time):#when needed, you can enter a specified point in time to read locally saved dataWith open ('D:\douyu_host{}.csv'. Format (date_time),'R') as R_data:r_data=json.load (r_data) forIinchR_data:Print(i)defw_to_local (self,data):#saves a copy of the data to the local disk while it is being written to the databaseWith open ('D:\douyu_host{}.csv'. Format (Self.date_time),'a') as W_data:json.dump (data, W_data)defw_to_db (self, data):#writes data to a time-based database table, data needs to be in a dictionary formatClient = Pymongo. Mongoclient ('localhost', 27017) Walden= client['walden_{}'. Format (self.date_time)] Sheet_tab= walden['Sheet_tab']        ifData is  notNone:sheet_tab.insert_one (data)defcheck_from_db (Self, date_time):#input time from database query related popular informationClient = Pymongo. Mongoclient ('localhost', 27017) Walden= client['walden_{}'. Format (date_time)] Sheet_tab= walden['Sheet_tab']         forDatainchSheet_tab.find ({'Popularity Index':{'$gte': 40}}):            Print(data)

Originally did not want to use the class to write, later found that a lot of data transmitted to very messy, and important data storage time is not uniform, so it is written in a class. The key point inside, select the element path with the Select module, such as.

Copied out is a long path, you can slowly observe the main key field to choose the best.

Then create a new py file to invoke the previously written class, after instantiation, using the pool multi-threaded run, the result is out.

#-*-coding:utf-8-*-Import Time fromMultiprocessingImportPool fromtest0822ImportDouyu_host_infodouyu=Douyu_host_info ()if __name__=='__main__':    #Multi-threaded crawl dataUrls_list =Douyu.get_url () pool=Pool () pool.map (Douyu.get_info, urls_list)

A panoramic view after a run.

Multi-threaded beatiful soup Crawl all Online anchors information

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.