Crawl Gitee Popular open source projects via Python, BeautifulSoup

Source: Internet
Author: User
Tags ssh download

First, installation

1, the response content is processed by requests, and the Requests.get () method returns a Response object

PIP Install requests

2. BeautifulSoup is not only flexible, efficient and very convenient for webpage parsing, but also supports many kinds of parsers.

Pip Install Beautifulsoup4

3, Pymongo is the Python Operation MONGO Toolkit

Pip Install Pymongo

4, installation MONGO

Second, analysis of Web pages & source code

1, determine the goal: first of all to know which page to crawl which section

2, analysis goal: To determine the crawl target after the URL link format and the meaning of the stitching parameters and then analyze the page source code to determine the data format

3. Write crawler code and execute

Third, write the code

#-*-coding:utf-8-*-#__author__: "The Junior," the public number: programmers grow together#__time__: 2018/8/22 18:51#__file__: spider_mayun.py#Import related LibrariesImportRequests fromBs4ImportBeautifulSoupImportPymongo"""querying popular information in different languages by analyzing page URLs is determined by the language parameter."""#language = ' java 'Language ='python'Domain='https://gitee.com'URI='/explore/starred?lang=%s'%Languageurl= domain +URI#User AgentUser_agent ='mozilla/5.0 (Macintosh;intel Mac OS X 10_12_6)'             'applewebkit/537.36 (khtml, like Gecko)'             'chrome/67.0.3396.99safari/537.36'#Building the headerHeader = {'user_agent': User_agent}#get page source codehtml = requests.get (URL, headers=header). Text#Get Beautiful ObjectSoup =BeautifulSoup (HTML)#Popular category categories today popular this week popular Data-tab tags to differentiate the hottest and hottest week of the dayHot_type = ['today-trending','week-trending']#divs = Soup.find_all (' div ', class_= ' UI tab active ')#Create a hot listHot_gitee = [] forIinchHot_type:#Query the top of the data with popular tagsDIVs = Soup.find_all ('Div', attrs={'Data-tab': i}) divs= Divs[0].select ('Div.row')    forDivinchDivs:gitee={} a_content= Div.select ('div.sixteen > H3 > A') Div_content= Div.select ('Div.project-desc')       #Project DescriptionScript =div_content[0].string#Title Propertytitle = a_content[0]['title'] Arr= Title.split ('/')       #Author's nameAuthor_name =Arr[0]#Project NameProject_Name = arr[1]       #Project URLhref = domain + a_content[0]['href']       #go to Popular Items sub-pageChild_page = requests.get (href, headers=header). Text Child_soup=BeautifulSoup (child_page) Child_div= Child_soup.find ('Div', class_='UI Small secondary pointing menu')       """<div class= "UI small Secondary pointing menu" > <a class= "Item Active" Data-type= "htt P "data-url=" Https://gitee.com/dlg_center/cms.git ">HTTPS</a> <a class=" item "data-type=" SSH "dat A-url= "[email protected]:d lg_center/cms.git" >SSH</a> </div>"""A_arr= Child_div.findall ('a')       #git http download linkHttp_url = a_arr[0]['Data-url']       #git ssh download linkSsh_url = a_arr[1]['Data-url'] gitee['Project_Name'] =Project_Name gitee['author_name'] =author_name gitee['href'] =href gitee['Script'] =Script gitee['Http_url'] =Http_url gitee['Ssh_url'] =Ssh_url gitee['Hot_type'] =I#Connection MONGOhot_gitee.append (gitee)Print(hot_gitee)#Link MONGO ParametersHOST, PORT, DB, TABLE ='127.0.0.1', 27017,'Spider','Gitee'#Create a linkClient = Pymongo. Mongoclient (Host=host, port=PORT)#Selected Librarydb =Client[db]tables=Db[table]#Insert MONGO LibraryTables.insert_many (Hot_gitee)

Iv. Results of implementation

[{' Project_Name ': ' incetops ', ' author_name ': ' Staugur ', ' href ': ' https://gitee.com/staugur/IncetOps ', ' script ': ' Based on inception, an open source system for auditing, execution, rollback, and statistical SQL ', ' http_url ': ' Https://gitee.com/staugur/IncetOps.git ', ' ssh_url ': ' [email  Protected]:staugur/incetops.git ', ' hot_type ': ' today-trending '}, {' Project_Name ': ' cms ', ' author_name ': ' Dlg_center ' , ' href ': ' https://gitee.com/dlg_center/cms ', ' script ': None, ' http_url ': ' Https://gitee.com/dlg_center/cms.git ', ' Ssh_url ': ' [email protected]:d lg_center/cms.git ', ' hot_type ': ' today-trending '}, {' Project_Name ': ' Websiteaccount ', ' author_name ': ' Zhang Cong ', ' href ': ' https://gitee.com/crazy_zhangcong/WebsiteAccount ', ' script ': ' Various question and answer platform account registration ', ' http_url ': ' Https://gitee.com/crazy_zhangcong/WebsiteAccount.git ', ' ssh_url ': ' [email protected ]:crazy_zhangcong/websiteaccount.git ', ' hot_type ': ' today-trending '}, {' Project_Name ': ' Chain ', ' author_name ': ' All ', ' href ': ' https://gitee.com/hequan2020/chain ', ' script ': ' Linux Cloud Host management system, including CMDB,WEBSSH login, command execution,Asynchronously executes shell/python/yml, and so on. Continued more ... ', ' http_url ': ' Https://gitee.com/hequan2020/chain.git ', ' ssh_url ': ' [email protected]:hequan2020/ Chain.git ', ' hot_type ': ' today-trending '}, {' Project_Name ': ' Lepus ', ' author_name ': ' Ru memory. ', ' href ': ' Https://gitee.com/ruzuojun/Lepus ', ' script ': ' Simple, intuitive, powerful open source enterprise database monitoring System, mysql/oracle/mongodb/redis one-stop monitoring, Make database monitoring more simple ... ', ' http_url ': ' Https://gitee.com/ruzuojun/Lepus.git ', ' ssh_url ': ' [email protected]:ruzuojun/ Lepus.git ', ' hot_type ': ' today-trending '}, {' Project_Name ': ' Autolink ', ' author_name ': ' Bitter leaves ', ' href ': ' https:// Gitee.com/lym51/autolink ', ' script ': ' Autolink is an open source Web IDE Automation Test Integration solution ', ' http_url ': ' https://gitee.com/lym51/ Autolink.git ', ' ssh_url ': ' [email protected]:lym51/autolink.git ', ' hot_type ': ' week-trending '}, {' Project_Name ': ' Pornhubbot ', ' author_name ': ' XIYOUMC ', ' href ': ' Https://gitee.com/xiyouMc/pornhubbot ', ' script ': ' The world's largest adult website pornhub crawler (scrapy, MongoDB) 500w data per day ', ' http_url ': ' Https://gitee.com/xiyouMc/pornhubbot.git ', ' ssh_url ': ' [ EmaIl protected]:xiyoumc/pornhubbot.git ', ' hot_type ': ' week-trending '}, {' Project_Name ': ' Wph_opc ', ' author_name ' : ' Wan Shi ', ' href ': ' https://gitee.com/wph_it/wph_opc ', ' script ': None, ' http_url ': ' Https://gitee.com/wph_it/wph_ Opc.git ', ' ssh_url ': ' [email protected]:wph_it/wph_opc.git ', ' hot_type ': ' week-trending '}, {' Project_Name ': ' Websiteaccount ', ' author_name ': ' Zhang Cong ', ' href ': ' https://gitee.com/crazy_zhangcong/WebsiteAccount ', ' script ': ' Various question and answer platform account registration ', ' http_url ': ' Https://gitee.com/crazy_zhangcong/WebsiteAccount.git ', ' ssh_url ': ' [email protected ]:crazy_zhangcong/websiteaccount.git ', ' hot_type ': ' week-trending '}, {' Project_Name ': ' information27 ', ' author_name ' ': ' Indian mother ', ' href ': ' https://gitee.com/itcastyinqiaoyin/information27 ', ' script ': None, ' http_url ': ' https://gitee.com/ Itcastyinqiaoyin/information27.git ', ' ssh_url ': ' [email protected]:itcastyinqiaoyin/information27.git ', ' hot _type ': ' week-trending '}]

 

Crawl Gitee Popular open source projects via Python, BeautifulSoup

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.