Python crawler---->github project on Python

Source: Internet
Author: User
Tags certbot jupyter notebook

This is done by using some of the high-start Python projects on the crawler GitHub to learn about the use of BeautifulSoup and Pymysql. I always thought that the mountain is the story of the water, cloud is the story of the wind, you are my story, but do not know, I am not your story.

GitHub's Python crawler

Crawler requirements: Crawl high-quality Python-related projects on GitHub, the following are test cases, and do not crawl a lot of data.

First, the crawler version to achieve the basic function

This example learns about bulk inserts for Pymysql, parsing HTML data using BeautifulSoup, and get request data from the requests library. As for some uses of Pymysql, you can refer to the blog: Python framework----the use of >pymysql

ImportRequestsImportpymysql.cursors fromBs4ImportBeautifulSoupdefget_effect_data (data): Results=list () soup= BeautifulSoup (data,'Html.parser') Projects= Soup.find_all ('Div', class_='Repo-list-item')     forProjectinchProjects:writer_project= Project.find ('a', attrs={'class':'V-align-middle'})['href'].strip () project_language= Project.find ('Div', attrs={'class':'D-table-cell col-2 Text-gray pt-2'}). Get_text (). Strip () Project_starts= Project.find ('a', attrs={'class':'Muted-link'}). Get_text (). Strip () Update_desc= Project.find ('P', attrs={'class':'f6 text-gray mb-0 mt-2'}). Get_text (). Strip () result= (Writer_project.split ('/') [1], Writer_project.split ('/') [2], Project_language, Project_starts, Update_desc) results.append (Result)returnResultsdefget_response_data (page): Request_url='Https://github.com/search'params= {'o':'desc','Q':'python','s':'stars','type':'repositories','P': page} resp=Requests.get (Request_url, params)returnResp.textdefInsert_datas (data): Connection= Pymysql.connect (host='localhost', the user='Root', Password='Root', DB='Test', CharSet='UTF8MB4', Cursorclass=pymysql.cursors.DictCursor)Try: With Connection.cursor () as Cursor:sql='INSERT INTO Project_info (Project_writer, Project_Name, Project_language, Project_starts, Update_desc) VALUES (%s,% S,%s,%s,%s)'cursor.executemany (SQL, data) connection.commit ()except: Connection.close ()if __name__=='__main__': Total_page= 2#Total pages of reptile dataDatas =list () forPageinchRange (total_page): Res_data= Get_response_data (page + 1) Data=Get_effect_data (res_data) datas+=data Insert_datas (datas)

After you have finished running, you can see the following data in the database:

11 TensorFlow TensorFlow C++ 78.7k Updated Nov 22, 2017
12 Robbyrussell Oh-my-zsh Shell 62.2k Updated Nov 21, 2017
13 Vinta Awesome-python Python 41.4k Updated Nov 20, 2017
14 Jakubroztocil Httpie Python 32.7k Updated Nov 18, 2017
15 Nvbn Thefuck Python 32.2k Updated Nov 17, 2017
16 Pallets Flask Python 31.1k Updated Nov 15, 2017
17 Django Django Python 29.8k Updated Nov 22, 2017
18 Requests Requests Python 28.7k Updated Nov 21, 2017
19 Blueimp Jquery-file-upload Javascript 27.9k Updated Nov 20, 2017
20 Ansible Ansible Python 26.8k Updated Nov 22, 2017
21st Justjavac Free-programming-books-zh_cn Javascript 24.7k Updated Nov 16, 2017
22 Scrapy Scrapy Python 14H Updated Nov 22, 2017
23 Scikit-learn Scikit-learn Python 23.1k Updated Nov 22, 2017
24 Fchollet Keras Python 12H Updated Nov 21, 2017
25 Donnemartin System-design-primer Python 11H Updated Nov 20, 2017
26 Certbot Certbot Python 20.1k Updated Nov 20, 2017
27 Aymericdamien Tensorflow-examples Jupyter Notebook 18.1k Updated Nov 8, 2017
28 Tornadoweb Tornado Python 14.6k Updated Nov 17, 2017
29 Python CPython Python 14.4k Updated Nov 22, 2017
30 Reddit Reddit Python 14.2k Updated Oct 17, 2017

Friendship Link

Python crawler---->github project on Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.