Python Crawler--python Post Analysis report

Source: Internet
Author: User

The first two we climbed the embarrassing encyclopedia and sister map site, learning requests, Beautiful Soup basic use. However, the first two articles are all from static HTML pages to filter out the information we need. In this article we'll learn how to get the results of an AJAX request return.

Welcome to the "Smart manufacturing column" to learn more about original intelligent manufacturing and programming knowledge.

Getting Started with Python Crawlers (ii)--crawl the sister map
Python Crawler Introduction (a)--crawl The embarrassing Hundred

This article takes the hook net as an example to illustrate how to get Ajax request content

Objective of this article
    1. Get Ajax requests to parse the required fields in JSON
    2. Data is saved to Excel
    3. Data saved to MySQL for easy analysis
Simple analysis

Average salary level of Python jobs in five cities

Python Job Requirements Education distribution

Python Industry Area Distribution

Python Company Size Distribution

View page Structure

We enter the query criteria in Python, for example, the other conditions are not selected by default, click on the query, you can see all the Python position, then we open the console, click on the Network tab can see the following request:

From the response, the request is exactly what we need. We'll just ask for the address in the back. You can see that the result below is the information of each job.

Here we know where to request data from and where to get results from. But the result list only has the first page 15 data, how to get the other page data?

Parsing request Parameters

We click on the Parameters tab, as follows:

The discovery submitted three form data, it is obvious that KD is our search keyword, pn is the current page number. First the default is OK, do not care about it. The remaining thing is to construct a request to download data for 30 pages.

Construct the request and parse the data

Construction request is very simple, we still use the requests library to fix. First we construct the form data data = {‘first‘: ‘true‘, ‘pn‘: page, ‘kd‘: lang_name} and then use requests to request the URL address, parse the resulting Json data even if it is done. Due to the hook of the crawler restrictions are more stringent, we need to put the browser headers field all add, and the reptile interval is larger, I set the 10-20s, and then the normal data can be obtained.

Import requestsdef Get_json (URL, page, lang_name): headers = {' Host ': ' www.lagou.com ', ' Connection ': ' Ke        Ep-alive ', ' content-length ': ' All ', ' Origin ': ' https://www.lagou.com ', ' x-anit-forge-code ': ' 0 ', ' User-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) gecko/20100101 firefox/61.0 ', ' content-type ': ' application/x-www-form-urlencoded; Charset=utf-8 ', ' Accept ': ' Application/json, Text/javascript, */*; q=0.01 ', ' x-requested-with ': ' XMLHttpRequest ', ' x-anit-forge-token ': ' None ', ' Referer ': ' Https://www. lagou.com/jobs/list_python?city=%e5%85%a8%e5%9b%bd&cl=false&fromsearch=true&labelwords=&     Suginput= ', ' accept-encoding ': ' gzip, deflate, BR ', ' accept-language ': ' en-us,en;q=0.9,zh-cn;q=0.8,zh;q=0.7 '     } data = {' first ': ' false ', ' PN ': page, ' kd ': lang_name} json = Requests.post (URL, data, headers=headers). JSON () List_con = json[' content ' [' positionresult '] [' ResulT '] info_list = [] for i in list_con:info = [] Info.append (I.get (' companyshortname ', ' none ')) inf O.append (I.get (' companyfullname ', ' none ')) Info.append (I.get (' Industryfield ', ' none ')) Info.append (I.get (' Companys Ize ', ' none ')) Info.append (I.get (' salary ', ' none ')) Info.append (I.get (' City ', ' none ')) Info.append (I.get (' Educ Ation ', ' none ')) Info_list.append (info) return info_list
Get all data

Knowing how to parse the data, the rest is to request all the pages in succession, and we construct a function to request all 30 pages of data.

def main():    lang_name = 'python'    wb = Workbook()    conn = get_conn()    for i in ['北京', '上海', '广州', '深圳', '杭州']:        page = 1        ws1 = wb.active        ws1.title = lang_name        url = 'https://www.lagou.com/jobs/positionAjax.json?city={}&needAddtionalResult=false'.format(i)        while page < 31:            info = get_json(url, page, lang_name)            page += 1            import time            a = random.randint(10, 20)            time.sleep(a)            for row in info:                insert(conn, tuple(row))                ws1.append(row)    conn.close()    wb.save('{}职位信息.xlsx'.format(lang_name))if __name__ == '__main__':    main()
Full code
Import randomimport timeimport requestsfrom openpyxl import workbookimport pymysql.cursorsdef get_conn (): ' Establish database connection ' conn = pymysql.connect (host= ' localhost ', user= ' root ', PA                                ssword= ' root ', db= ' python ', charset= ' utf8mb4 ', Cursorclass=pymysql.cursors.dictcursor) return CONNDEF INSERT (conn, info): ' Data written to database ' with C Onn.cursor () as Cursor:sql = "INSERT into ' python ' (' shortname ', ' fullname ', ' Industryfield ', ' companysize ', ' Sala Ry ', ' City ', ' education ') VALUES (%s,%s,%s,%s,%s,%s,%s) "Cursor.execute (SQL, info) conn.commit () def GET_JS On (URL, page, lang_name): ' Returns the list of information for the current page ' ' headers = {' Host ': ' www.lagou.com ', ' Connection ': ' Keep-a Live ', ' content-length ': ' All ', ' Origin ': ' https://www.lagou.com ', ' x-anit-forge-code ': ' 0 ', ' U ' Ser-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) gecko/20100101 firefox/61.0 ', ' content-type ': ' application/x-www-form-urlencoded; Charset=utf-8 ', ' Accept ': ' Application/json, Text/javascript, */*; q=0.01 ', ' x-requested-with ': ' XMLHttpRequest ', ' x-anit-forge-token ': ' None ', ' Referer ': ' Https://www. lagou.com/jobs/list_python?city=%e5%85%a8%e5%9b%bd&cl=false&fromsearch=true&labelwords=&     Suginput= ', ' accept-encoding ': ' gzip, deflate, BR ', ' accept-language ': ' en-us,en;q=0.9,zh-cn;q=0.8,zh;q=0.7 '     } data = {' first ': ' false ', ' PN ': page, ' kd ': lang_name} json = Requests.post (URL, data, headers=headers). JSON ()        List_con = json[' content ' [' positionresult '] [' result '] info_list = [] for i in list_con:info = [] Info.append (I.get (' companyshortname ', ' None ') # Company name Info.append (I.get (' companyfullname ', ' none ')) Info.append (i . Get (' Industryfield ', ' none ') # Industry field info.append (I.get (' companysize ', ' none ')) # company Size info.append (i.get (' salary ', ' none ') # Pay Info.append (I.get (' City ', ' none ')) Info.append (I.get (' E Ducation ', ' None ')) # Education Info_list.append (Info) return Info_list # return list def main (): lang_name = ' python ' W  b = Workbook () # open Excel Workbook conn = Get_conn () # Establish database connection does not save database comment this line for i in [' Beijing ', ' Shanghai ', ' Guangzhou ', ' Shenzhen ', ' Hangzhou ']: # Five cities page = 1 WS1 = wb.active ws1.title = lang_name url = ' Https://www.lagou.com/jobs/positio Najax.json?city={}&needaddtionalresult=false '. Format (i) while page < 31: # 30 pages per city info = Get                 _json (URL, page, lang_name) page + = 1 time.sleep (Random.randint (Ten)) for row in info: Insert (conn, tuple (ROW)) # Inserts the database, if you do not want to note this row ws1.append (row) conn.close () # Close the database connection, do not save the number According to the library note this line Wb.save (' {} position information. xlsx '. Format (lang_name)) if __name__ = = ' __main__ ': Main ()

GitHub Address: https://github.com/injetlee/Python/tree/master/%E7%88%AC%E8%99%AB%E9%9B%86%E5%90%88

If you want the crawler to get information about the job, please pay attention to the "Intelligent manufacturing column" background message to send "Python post".

Python Crawler--python Post Analysis report

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.