The first two we climbed the embarrassing encyclopedia and sister map site, learning requests, Beautiful Soup basic use. However, the first two articles are all from static HTML pages to filter out the information we need. In this article we'll learn how to get the results of an AJAX request return.
Welcome to the "Smart manufacturing column" to learn more about original intelligent manufacturing and programming knowledge.
Getting Started with Python Crawlers (ii)--crawl the sister map
Python Crawler Introduction (a)--crawl The embarrassing Hundred
This article takes the hook net as an example to illustrate how to get Ajax request content
Objective of this article
- Get Ajax requests to parse the required fields in JSON
- Data is saved to Excel
- Data saved to MySQL for easy analysis
Simple analysis
Average salary level of Python jobs in five cities
Python Job Requirements Education distribution
Python Industry Area Distribution
Python Company Size Distribution
View page Structure
We enter the query criteria in Python, for example, the other conditions are not selected by default, click on the query, you can see all the Python position, then we open the console, click on the Network tab can see the following request:
From the response, the request is exactly what we need. We'll just ask for the address in the back. You can see that the result below is the information of each job.
Here we know where to request data from and where to get results from. But the result list only has the first page 15 data, how to get the other page data?
Parsing request Parameters
We click on the Parameters tab, as follows:
The discovery submitted three form data, it is obvious that KD is our search keyword, pn is the current page number. First the default is OK, do not care about it. The remaining thing is to construct a request to download data for 30 pages.
Construct the request and parse the data
Construction request is very simple, we still use the requests library to fix. First we construct the form data data = {‘first‘: ‘true‘, ‘pn‘: page, ‘kd‘: lang_name}
and then use requests to request the URL address, parse the resulting Json data even if it is done. Due to the hook of the crawler restrictions are more stringent, we need to put the browser headers field all add, and the reptile interval is larger, I set the 10-20s, and then the normal data can be obtained.
Import requestsdef Get_json (URL, page, lang_name): headers = {' Host ': ' www.lagou.com ', ' Connection ': ' Ke Ep-alive ', ' content-length ': ' All ', ' Origin ': ' https://www.lagou.com ', ' x-anit-forge-code ': ' 0 ', ' User-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) gecko/20100101 firefox/61.0 ', ' content-type ': ' application/x-www-form-urlencoded; Charset=utf-8 ', ' Accept ': ' Application/json, Text/javascript, */*; q=0.01 ', ' x-requested-with ': ' XMLHttpRequest ', ' x-anit-forge-token ': ' None ', ' Referer ': ' Https://www. lagou.com/jobs/list_python?city=%e5%85%a8%e5%9b%bd&cl=false&fromsearch=true&labelwords=& Suginput= ', ' accept-encoding ': ' gzip, deflate, BR ', ' accept-language ': ' en-us,en;q=0.9,zh-cn;q=0.8,zh;q=0.7 ' } data = {' first ': ' false ', ' PN ': page, ' kd ': lang_name} json = Requests.post (URL, data, headers=headers). JSON () List_con = json[' content ' [' positionresult '] [' ResulT '] info_list = [] for i in list_con:info = [] Info.append (I.get (' companyshortname ', ' none ')) inf O.append (I.get (' companyfullname ', ' none ')) Info.append (I.get (' Industryfield ', ' none ')) Info.append (I.get (' Companys Ize ', ' none ')) Info.append (I.get (' salary ', ' none ')) Info.append (I.get (' City ', ' none ')) Info.append (I.get (' Educ Ation ', ' none ')) Info_list.append (info) return info_list
Get all data
Knowing how to parse the data, the rest is to request all the pages in succession, and we construct a function to request all 30 pages of data.
def main(): lang_name = 'python' wb = Workbook() conn = get_conn() for i in ['北京', '上海', '广州', '深圳', '杭州']: page = 1 ws1 = wb.active ws1.title = lang_name url = 'https://www.lagou.com/jobs/positionAjax.json?city={}&needAddtionalResult=false'.format(i) while page < 31: info = get_json(url, page, lang_name) page += 1 import time a = random.randint(10, 20) time.sleep(a) for row in info: insert(conn, tuple(row)) ws1.append(row) conn.close() wb.save('{}职位信息.xlsx'.format(lang_name))if __name__ == '__main__': main()
Full code
Import randomimport timeimport requestsfrom openpyxl import workbookimport pymysql.cursorsdef get_conn (): ' Establish database connection ' conn = pymysql.connect (host= ' localhost ', user= ' root ', PA ssword= ' root ', db= ' python ', charset= ' utf8mb4 ', Cursorclass=pymysql.cursors.dictcursor) return CONNDEF INSERT (conn, info): ' Data written to database ' with C Onn.cursor () as Cursor:sql = "INSERT into ' python ' (' shortname ', ' fullname ', ' Industryfield ', ' companysize ', ' Sala Ry ', ' City ', ' education ') VALUES (%s,%s,%s,%s,%s,%s,%s) "Cursor.execute (SQL, info) conn.commit () def GET_JS On (URL, page, lang_name): ' Returns the list of information for the current page ' ' headers = {' Host ': ' www.lagou.com ', ' Connection ': ' Keep-a Live ', ' content-length ': ' All ', ' Origin ': ' https://www.lagou.com ', ' x-anit-forge-code ': ' 0 ', ' U ' Ser-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) gecko/20100101 firefox/61.0 ', ' content-type ': ' application/x-www-form-urlencoded; Charset=utf-8 ', ' Accept ': ' Application/json, Text/javascript, */*; q=0.01 ', ' x-requested-with ': ' XMLHttpRequest ', ' x-anit-forge-token ': ' None ', ' Referer ': ' Https://www. lagou.com/jobs/list_python?city=%e5%85%a8%e5%9b%bd&cl=false&fromsearch=true&labelwords=& Suginput= ', ' accept-encoding ': ' gzip, deflate, BR ', ' accept-language ': ' en-us,en;q=0.9,zh-cn;q=0.8,zh;q=0.7 ' } data = {' first ': ' false ', ' PN ': page, ' kd ': lang_name} json = Requests.post (URL, data, headers=headers). JSON () List_con = json[' content ' [' positionresult '] [' result '] info_list = [] for i in list_con:info = [] Info.append (I.get (' companyshortname ', ' None ') # Company name Info.append (I.get (' companyfullname ', ' none ')) Info.append (i . Get (' Industryfield ', ' none ') # Industry field info.append (I.get (' companysize ', ' none ')) # company Size info.append (i.get (' salary ', ' none ') # Pay Info.append (I.get (' City ', ' none ')) Info.append (I.get (' E Ducation ', ' None ')) # Education Info_list.append (Info) return Info_list # return list def main (): lang_name = ' python ' W b = Workbook () # open Excel Workbook conn = Get_conn () # Establish database connection does not save database comment this line for i in [' Beijing ', ' Shanghai ', ' Guangzhou ', ' Shenzhen ', ' Hangzhou ']: # Five cities page = 1 WS1 = wb.active ws1.title = lang_name url = ' Https://www.lagou.com/jobs/positio Najax.json?city={}&needaddtionalresult=false '. Format (i) while page < 31: # 30 pages per city info = Get _json (URL, page, lang_name) page + = 1 time.sleep (Random.randint (Ten)) for row in info: Insert (conn, tuple (ROW)) # Inserts the database, if you do not want to note this row ws1.append (row) conn.close () # Close the database connection, do not save the number According to the library note this line Wb.save (' {} position information. xlsx '. Format (lang_name)) if __name__ = = ' __main__ ': Main ()
GitHub Address: https://github.com/injetlee/Python/tree/master/%E7%88%AC%E8%99%AB%E9%9B%86%E5%90%88
If you want the crawler to get information about the job, please pay attention to the "Intelligent manufacturing column" background message to send "Python post".
Python Crawler--python Post Analysis report