First, installation
1, the response content is processed by requests, and the Requests.get () method returns a Response object
PIP Install requests
2. BeautifulSoup is not only flexible, efficient and very convenient for webpage parsing, but also supports many kinds of parsers.
Pip Install Beautifulsoup4
3, Pymongo is the Python Operation MONGO Toolkit
Pip Install Pymongo
4, installation MONGO
Second, analysis of Web pages & source code
1, determine the goal: first of all to know which page to crawl which section
2, analysis goal: To determine the crawl target after the URL link format and the meaning of the stitching parameters and then analyze the page source code to determine the data format
3. Write crawler code and execute
Third, write the code
#-*-coding:utf-8-*-#__author__: "The Junior," the public number: programmers grow together#__time__: 2018/8/22 18:51#__file__: spider_mayun.py#Import related LibrariesImportRequests fromBs4ImportBeautifulSoupImportPymongo"""querying popular information in different languages by analyzing page URLs is determined by the language parameter."""#language = ' java 'Language ='python'Domain='https://gitee.com'URI='/explore/starred?lang=%s'%Languageurl= domain +URI#User AgentUser_agent ='mozilla/5.0 (Macintosh;intel Mac OS X 10_12_6)' 'applewebkit/537.36 (khtml, like Gecko)' 'chrome/67.0.3396.99safari/537.36'#Building the headerHeader = {'user_agent': User_agent}#get page source codehtml = requests.get (URL, headers=header). Text#Get Beautiful ObjectSoup =BeautifulSoup (HTML)#Popular category categories today popular this week popular Data-tab tags to differentiate the hottest and hottest week of the dayHot_type = ['today-trending','week-trending']#divs = Soup.find_all (' div ', class_= ' UI tab active ')#Create a hot listHot_gitee = [] forIinchHot_type:#Query the top of the data with popular tagsDIVs = Soup.find_all ('Div', attrs={'Data-tab': i}) divs= Divs[0].select ('Div.row') forDivinchDivs:gitee={} a_content= Div.select ('div.sixteen > H3 > A') Div_content= Div.select ('Div.project-desc') #Project DescriptionScript =div_content[0].string#Title Propertytitle = a_content[0]['title'] Arr= Title.split ('/') #Author's nameAuthor_name =Arr[0]#Project NameProject_Name = arr[1] #Project URLhref = domain + a_content[0]['href'] #go to Popular Items sub-pageChild_page = requests.get (href, headers=header). Text Child_soup=BeautifulSoup (child_page) Child_div= Child_soup.find ('Div', class_='UI Small secondary pointing menu') """<div class= "UI small Secondary pointing menu" > <a class= "Item Active" Data-type= "htt P "data-url=" Https://gitee.com/dlg_center/cms.git ">HTTPS</a> <a class=" item "data-type=" SSH "dat A-url= "[email protected]:d lg_center/cms.git" >SSH</a> </div>"""A_arr= Child_div.findall ('a') #git http download linkHttp_url = a_arr[0]['Data-url'] #git ssh download linkSsh_url = a_arr[1]['Data-url'] gitee['Project_Name'] =Project_Name gitee['author_name'] =author_name gitee['href'] =href gitee['Script'] =Script gitee['Http_url'] =Http_url gitee['Ssh_url'] =Ssh_url gitee['Hot_type'] =I#Connection MONGOhot_gitee.append (gitee)Print(hot_gitee)#Link MONGO ParametersHOST, PORT, DB, TABLE ='127.0.0.1', 27017,'Spider','Gitee'#Create a linkClient = Pymongo. Mongoclient (Host=host, port=PORT)#Selected Librarydb =Client[db]tables=Db[table]#Insert MONGO LibraryTables.insert_many (Hot_gitee)
Iv. Results of implementation
[{' Project_Name ': ' incetops ', ' author_name ': ' Staugur ', ' href ': ' https://gitee.com/staugur/IncetOps ', ' script ': ' Based on inception, an open source system for auditing, execution, rollback, and statistical SQL ', ' http_url ': ' Https://gitee.com/staugur/IncetOps.git ', ' ssh_url ': ' [email Protected]:staugur/incetops.git ', ' hot_type ': ' today-trending '}, {' Project_Name ': ' cms ', ' author_name ': ' Dlg_center ' , ' href ': ' https://gitee.com/dlg_center/cms ', ' script ': None, ' http_url ': ' Https://gitee.com/dlg_center/cms.git ', ' Ssh_url ': ' [email protected]:d lg_center/cms.git ', ' hot_type ': ' today-trending '}, {' Project_Name ': ' Websiteaccount ', ' author_name ': ' Zhang Cong ', ' href ': ' https://gitee.com/crazy_zhangcong/WebsiteAccount ', ' script ': ' Various question and answer platform account registration ', ' http_url ': ' Https://gitee.com/crazy_zhangcong/WebsiteAccount.git ', ' ssh_url ': ' [email protected ]:crazy_zhangcong/websiteaccount.git ', ' hot_type ': ' today-trending '}, {' Project_Name ': ' Chain ', ' author_name ': ' All ', ' href ': ' https://gitee.com/hequan2020/chain ', ' script ': ' Linux Cloud Host management system, including CMDB,WEBSSH login, command execution,Asynchronously executes shell/python/yml, and so on. Continued more ... ', ' http_url ': ' Https://gitee.com/hequan2020/chain.git ', ' ssh_url ': ' [email protected]:hequan2020/ Chain.git ', ' hot_type ': ' today-trending '}, {' Project_Name ': ' Lepus ', ' author_name ': ' Ru memory. ', ' href ': ' Https://gitee.com/ruzuojun/Lepus ', ' script ': ' Simple, intuitive, powerful open source enterprise database monitoring System, mysql/oracle/mongodb/redis one-stop monitoring, Make database monitoring more simple ... ', ' http_url ': ' Https://gitee.com/ruzuojun/Lepus.git ', ' ssh_url ': ' [email protected]:ruzuojun/ Lepus.git ', ' hot_type ': ' today-trending '}, {' Project_Name ': ' Autolink ', ' author_name ': ' Bitter leaves ', ' href ': ' https:// Gitee.com/lym51/autolink ', ' script ': ' Autolink is an open source Web IDE Automation Test Integration solution ', ' http_url ': ' https://gitee.com/lym51/ Autolink.git ', ' ssh_url ': ' [email protected]:lym51/autolink.git ', ' hot_type ': ' week-trending '}, {' Project_Name ': ' Pornhubbot ', ' author_name ': ' XIYOUMC ', ' href ': ' Https://gitee.com/xiyouMc/pornhubbot ', ' script ': ' The world's largest adult website pornhub crawler (scrapy, MongoDB) 500w data per day ', ' http_url ': ' Https://gitee.com/xiyouMc/pornhubbot.git ', ' ssh_url ': ' [ EmaIl protected]:xiyoumc/pornhubbot.git ', ' hot_type ': ' week-trending '}, {' Project_Name ': ' Wph_opc ', ' author_name ' : ' Wan Shi ', ' href ': ' https://gitee.com/wph_it/wph_opc ', ' script ': None, ' http_url ': ' Https://gitee.com/wph_it/wph_ Opc.git ', ' ssh_url ': ' [email protected]:wph_it/wph_opc.git ', ' hot_type ': ' week-trending '}, {' Project_Name ': ' Websiteaccount ', ' author_name ': ' Zhang Cong ', ' href ': ' https://gitee.com/crazy_zhangcong/WebsiteAccount ', ' script ': ' Various question and answer platform account registration ', ' http_url ': ' Https://gitee.com/crazy_zhangcong/WebsiteAccount.git ', ' ssh_url ': ' [email protected ]:crazy_zhangcong/websiteaccount.git ', ' hot_type ': ' week-trending '}, {' Project_Name ': ' information27 ', ' author_name ' ': ' Indian mother ', ' href ': ' https://gitee.com/itcastyinqiaoyin/information27 ', ' script ': None, ' http_url ': ' https://gitee.com/ Itcastyinqiaoyin/information27.git ', ' ssh_url ': ' [email protected]:itcastyinqiaoyin/information27.git ', ' hot _type ': ' week-trending '}]
Crawl Gitee Popular open source projects via Python, BeautifulSoup