Python Crawler--python Post Analysis report

Last Update:2018-09-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The first two we climbed the embarrassing encyclopedia and sister map site, learning requests, Beautiful Soup basic use. However, the first two articles are all from static HTML pages to filter out the information we need. In this article we'll learn how to get the results of an AJAX request return.

Welcome to the "Smart manufacturing column" to learn more about original intelligent manufacturing and programming knowledge.

Getting Started with Python Crawlers (ii)--crawl the sister map
Python Crawler Introduction (a)--crawl The embarrassing Hundred

This article takes the hook net as an example to illustrate how to get Ajax request content

Objective of this article

Get Ajax requests to parse the required fields in JSON
Data is saved to Excel
Data saved to MySQL for easy analysis

Simple analysis

Average salary level of Python jobs in five cities

Python Job Requirements Education distribution

Python Industry Area Distribution

Python Company Size Distribution

View page Structure

We enter the query criteria in Python, for example, the other conditions are not selected by default, click on the query, you can see all the Python position, then we open the console, click on the Network tab can see the following request:

From the response, the request is exactly what we need. We'll just ask for the address in the back. You can see that the result below is the information of each job.

Here we know where to request data from and where to get results from. But the result list only has the first page 15 data, how to get the other page data?

Parsing request Parameters

We click on the Parameters tab, as follows:

The discovery submitted three form data, it is obvious that KD is our search keyword, pn is the current page number. First the default is OK, do not care about it. The remaining thing is to construct a request to download data for 30 pages.

Construct the request and parse the data

Construction request is very simple, we still use the requests library to fix. First we construct the form data data = {‘first‘: ‘true‘, ‘pn‘: page, ‘kd‘: lang_name} and then use requests to request the URL address, parse the resulting Json data even if it is done. Due to the hook of the crawler restrictions are more stringent, we need to put the browser headers field all add, and the reptile interval is larger, I set the 10-20s, and then the normal data can be obtained.

Import requestsdef Get_json (URL, page, lang_name): headers = {' Host ': ' www.lagou.com ', ' Connection ': ' Ke        Ep-alive ', ' content-length ': ' All ', ' Origin ': ' https://www.lagou.com ', ' x-anit-forge-code ': ' 0 ', ' User-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) gecko/20100101 firefox/61.0 ', ' content-type ': ' application/x-www-form-urlencoded; Charset=utf-8 ', ' Accept ': ' Application/json, Text/javascript, */*; q=0.01 ', ' x-requested-with ': ' XMLHttpRequest ', ' x-anit-forge-token ': ' None ', ' Referer ': ' Https://www. lagou.com/jobs/list_python?city=%e5%85%a8%e5%9b%bd&cl=false&fromsearch=true&labelwords=&     Suginput= ', ' accept-encoding ': ' gzip, deflate, BR ', ' accept-language ': ' en-us,en;q=0.9,zh-cn;q=0.8,zh;q=0.7 '     } data = {' first ': ' false ', ' PN ': page, ' kd ': lang_name} json = Requests.post (URL, data, headers=headers). JSON () List_con = json[' content ' [' positionresult '] [' ResulT '] info_list = [] for i in list_con:info = [] Info.append (I.get (' companyshortname ', ' none ')) inf O.append (I.get (' companyfullname ', ' none ')) Info.append (I.get (' Industryfield ', ' none ')) Info.append (I.get (' Companys Ize ', ' none ')) Info.append (I.get (' salary ', ' none ')) Info.append (I.get (' City ', ' none ')) Info.append (I.get (' Educ Ation ', ' none ')) Info_list.append (info) return info_list

Get all data

Knowing how to parse the data, the rest is to request all the pages in succession, and we construct a function to request all 30 pages of data.

def main():    lang_name = 'python'    wb = Workbook()    conn = get_conn()    for i in ['北京', '上海', '广州', '深圳', '杭州']:        page = 1        ws1 = wb.active        ws1.title = lang_name        url = 'https://www.lagou.com/jobs/positionAjax.json?city={}&needAddtionalResult=false'.format(i)        while page < 31:            info = get_json(url, page, lang_name)            page += 1            import time            a = random.randint(10, 20)            time.sleep(a)            for row in info:                insert(conn, tuple(row))                ws1.append(row)    conn.close()    wb.save('{}职位信息.xlsx'.format(lang_name))if __name__ == '__main__':    main()

Full code

Import randomimport timeimport requestsfrom openpyxl import workbookimport pymysql.cursorsdef get_conn (): ' Establish database connection ' conn = pymysql.connect (host= ' localhost ', user= ' root ', PA                                ssword= ' root ', db= ' python ', charset= ' utf8mb4 ', Cursorclass=pymysql.cursors.dictcursor) return CONNDEF INSERT (conn, info): ' Data written to database ' with C Onn.cursor () as Cursor:sql = "INSERT into ' python ' (' shortname ', ' fullname ', ' Industryfield ', ' companysize ', ' Sala Ry ', ' City ', ' education ') VALUES (%s,%s,%s,%s,%s,%s,%s) "Cursor.execute (SQL, info) conn.commit () def GET_JS On (URL, page, lang_name): ' Returns the list of information for the current page ' ' headers = {' Host ': ' www.lagou.com ', ' Connection ': ' Keep-a Live ', ' content-length ': ' All ', ' Origin ': ' https://www.lagou.com ', ' x-anit-forge-code ': ' 0 ', ' U ' Ser-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) gecko/20100101 firefox/61.0 ', ' content-type ': ' application/x-www-form-urlencoded; Charset=utf-8 ', ' Accept ': ' Application/json, Text/javascript, */*; q=0.01 ', ' x-requested-with ': ' XMLHttpRequest ', ' x-anit-forge-token ': ' None ', ' Referer ': ' Https://www. lagou.com/jobs/list_python?city=%e5%85%a8%e5%9b%bd&cl=false&fromsearch=true&labelwords=&     Suginput= ', ' accept-encoding ': ' gzip, deflate, BR ', ' accept-language ': ' en-us,en;q=0.9,zh-cn;q=0.8,zh;q=0.7 '     } data = {' first ': ' false ', ' PN ': page, ' kd ': lang_name} json = Requests.post (URL, data, headers=headers). JSON ()        List_con = json[' content ' [' positionresult '] [' result '] info_list = [] for i in list_con:info = [] Info.append (I.get (' companyshortname ', ' None ') # Company name Info.append (I.get (' companyfullname ', ' none ')) Info.append (i . Get (' Industryfield ', ' none ') # Industry field info.append (I.get (' companysize ', ' none ')) # company Size info.append (i.get (' salary ', ' none ') # Pay Info.append (I.get (' City ', ' none ')) Info.append (I.get (' E Ducation ', ' None ')) # Education Info_list.append (Info) return Info_list # return list def main (): lang_name = ' python ' W  b = Workbook () # open Excel Workbook conn = Get_conn () # Establish database connection does not save database comment this line for i in [' Beijing ', ' Shanghai ', ' Guangzhou ', ' Shenzhen ', ' Hangzhou ']: # Five cities page = 1 WS1 = wb.active ws1.title = lang_name url = ' Https://www.lagou.com/jobs/positio Najax.json?city={}&needaddtionalresult=false '. Format (i) while page < 31: # 30 pages per city info = Get                 _json (URL, page, lang_name) page + = 1 time.sleep (Random.randint (Ten)) for row in info: Insert (conn, tuple (ROW)) # Inserts the database, if you do not want to note this row ws1.append (row) conn.close () # Close the database connection, do not save the number According to the library note this line Wb.save (' {} position information. xlsx '. Format (lang_name)) if __name__ = = ' __main__ ': Main ()

GitHub Address: https://github.com/injetlee/Python/tree/master/%E7%88%AC%E8%99%AB%E9%9B%86%E5%90%88

If you want the crawler to get information about the job, please pay attention to the "Intelligent manufacturing column" background message to send "Python post".

Python Crawler--python Post Analysis report

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Crawler--python Post Analysis report

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Crawler--python Post Analysis report

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support