This is done by using some of the high-start Python projects on the crawler GitHub to learn about the use of BeautifulSoup and Pymysql. I always thought that the mountain is the story of the water, cloud is the story of the wind, you are my story, but do not know, I am not your story.
GitHub's Python crawler
Crawler requirements: Crawl high-quality Python-related projects on GitHub, the following are test cases, and do not crawl a lot of data.
First, the crawler version to achieve the basic function
This example learns about bulk inserts for Pymysql, parsing HTML data using BeautifulSoup, and get request data from the requests library. As for some uses of Pymysql, you can refer to the blog: Python framework----the use of >pymysql
ImportRequestsImportpymysql.cursors fromBs4ImportBeautifulSoupdefget_effect_data (data): Results=list () soup= BeautifulSoup (data,'Html.parser') Projects= Soup.find_all ('Div', class_='Repo-list-item') forProjectinchProjects:writer_project= Project.find ('a', attrs={'class':'V-align-middle'})['href'].strip () project_language= Project.find ('Div', attrs={'class':'D-table-cell col-2 Text-gray pt-2'}). Get_text (). Strip () Project_starts= Project.find ('a', attrs={'class':'Muted-link'}). Get_text (). Strip () Update_desc= Project.find ('P', attrs={'class':'f6 text-gray mb-0 mt-2'}). Get_text (). Strip () result= (Writer_project.split ('/') [1], Writer_project.split ('/') [2], Project_language, Project_starts, Update_desc) results.append (Result)returnResultsdefget_response_data (page): Request_url='Https://github.com/search'params= {'o':'desc','Q':'python','s':'stars','type':'repositories','P': page} resp=Requests.get (Request_url, params)returnResp.textdefInsert_datas (data): Connection= Pymysql.connect (host='localhost', the user='Root', Password='Root', DB='Test', CharSet='UTF8MB4', Cursorclass=pymysql.cursors.DictCursor)Try: With Connection.cursor () as Cursor:sql='INSERT INTO Project_info (Project_writer, Project_Name, Project_language, Project_starts, Update_desc) VALUES (%s,% S,%s,%s,%s)'cursor.executemany (SQL, data) connection.commit ()except: Connection.close ()if __name__=='__main__': Total_page= 2#Total pages of reptile dataDatas =list () forPageinchRange (total_page): Res_data= Get_response_data (page + 1) Data=Get_effect_data (res_data) datas+=data Insert_datas (datas)
After you have finished running, you can see the following data in the database:
11 |
TensorFlow |
TensorFlow |
C++ |
78.7k |
Updated Nov 22, 2017 |
12 |
Robbyrussell |
Oh-my-zsh |
Shell |
62.2k |
Updated Nov 21, 2017 |
13 |
Vinta |
Awesome-python |
Python |
41.4k |
Updated Nov 20, 2017 |
14 |
Jakubroztocil |
Httpie |
Python |
32.7k |
Updated Nov 18, 2017 |
15 |
Nvbn |
Thefuck |
Python |
32.2k |
Updated Nov 17, 2017 |
16 |
Pallets |
Flask |
Python |
31.1k |
Updated Nov 15, 2017 |
17 |
Django |
Django |
Python |
29.8k |
Updated Nov 22, 2017 |
18 |
Requests |
Requests |
Python |
28.7k |
Updated Nov 21, 2017 |
19 |
Blueimp |
Jquery-file-upload |
Javascript |
27.9k |
Updated Nov 20, 2017 |
20 |
Ansible |
Ansible |
Python |
26.8k |
Updated Nov 22, 2017 |
21st |
Justjavac |
Free-programming-books-zh_cn |
Javascript |
24.7k |
Updated Nov 16, 2017 |
22 |
Scrapy |
Scrapy |
Python |
14H |
Updated Nov 22, 2017 |
23 |
Scikit-learn |
Scikit-learn |
Python |
23.1k |
Updated Nov 22, 2017 |
24 |
Fchollet |
Keras |
Python |
12H |
Updated Nov 21, 2017 |
25 |
Donnemartin |
System-design-primer |
Python |
11H |
Updated Nov 20, 2017 |
26 |
Certbot |
Certbot |
Python |
20.1k |
Updated Nov 20, 2017 |
27 |
Aymericdamien |
Tensorflow-examples |
Jupyter Notebook |
18.1k |
Updated Nov 8, 2017 |
28 |
Tornadoweb |
Tornado |
Python |
14.6k |
Updated Nov 17, 2017 |
29 |
Python |
CPython |
Python |
14.4k |
Updated Nov 22, 2017 |
30 |
Reddit |
Reddit |
Python |
14.2k |
Updated Oct 17, 2017 |
Friendship Link
Python crawler---->github project on Python