The rapid development of artificial intelligence and the advent of the Big Data era make the Python language not only shine in the field of artificial intelligence, but also have a unique advantage in data processing, and play an increasingly important role in WEB development, network programming, automation, game development, finance and other fields.
The Baidu Search index shows that Python's search index has surpassed Java since July 2017. The popularity of the Python language is evident.
Python is easy to learn! Small series have a communication, mutual ask each other, resource sharing exchange Learning base, if you are also a Python learner or Daniel welcome you to!?:548+377+875! Learn together to progress together!
In this article, the author decides to crawl the relevant Python job information on the hook net (a recruiting site for Internet practitioners), and to visualize the job data (salary, educational requirements, regional information, work experience, etc.) in a graphical visual analysis.
01
Pre-preparation
1. Web Analytics
Open the Web site search Python, you can find that each page has 15 job information data, up to 30 pages of data can be viewed, a total of 450 job information. The information we need to obtain includes: position, company name, salary range, area, education requirement, work experience, company financing, company number, job description.
2. Request Data analysis
By accessing the hook net via Chrome, opening the console console, you can see that when you turn the page, it is requested by XHR request. By observing, we can find that the city in the URL represents the cities, the post parameter KD represents the search position, and PN is page number, which represents the page numbers.
3. Position list JSON return data analysis get
The data is parsed by the JSON library to obtain relevant information. It is important to note that we need to remember to keep PositionID for the next step to get job description information.
def Get_lagou (PAGE,CITY,KD):
url = "Https://www.lagou.com/jobs/positionAjax.json"
QueryString = {"px": "New", "City": City, "Needaddtionalresult": "false", "Isschooljob": "0"}
Payload = "first=false&pn=" + str (page) + "&kd=" +str (KD)
Cookie = "jsessionid=" + get_uuid () + ";"
"User_trace_token=" + get_uuid () + "; Lguid= "+ get_uuid () +"; INDEX_LOCATION_CITY=%E6%88%90%E9%83%BD; "
"Search_id=" + get_uuid () + '; _gid=ga1.2.717841549.1514043316; ‘
' _ga=ga1.2.952298646.1514043316; ‘
' lgsid= ' + get_uuid () + ";"
"Lgrid=" + get_uuid () + ";"
headers = {' Cookie ': Cookie, ' origin ': ' https://www.lagou.com ', ' x-anit-forge-code ': ' 0 ', ' accept-encoding ': ' gzip, Deflate, BR ", ' accept-language ':" zh-cn,zh;q=0.8,en;q=0.6 ", ' user-agent ':" mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/55.0.2883.87 safari/537.36 ", ' Content-type ':" application/ x-www-form-urlencoded; Charset=utf-8 ", ' Accept ':" Application/json, Text/javascript, */*; q=0.01 ", ' Referer ':" Https://www.lagou.com/jobs/list_Java?px=new&city=%E6%88%90%E9%83%BD ", ' X-requested-with ' : "XMLHttpRequest", ' Connection ': "Keep-alive", ' X-anit-forge-token ': "None", ' Cache-control ': "No-cache", ' Postman-token ': "91beb456-8dd9-0390-a3a5-64ff3936fa63"}
Response = Requests.request ("POST", url, Data=payload.encode (' Utf-8 '), Headers=headers, params=querystring)
# Print (Response.text)
Hjson = Json.loads (Response.text)
For I in range (15):
positionname=hjson[' content ' [' positionresult '] [' result '][i][' positionname ']
CompanyID = hjson[' content ' [' positionresult '] [' result '][i][' CompanyID ']
positionid= hjson[' content ' [' positionresult '] [' result '][i][' PositionID ']
Salary = hjson[' content ' [' positionresult '] [' result '][i][' salary ']
city= hjson[' content ' [' positionresult '] [' result '][i][' city ']
district= hjson[' content ' [' positionresult '] [' result '][i][' district ']
Companyshortname= hjson[' content ' [' positionresult '] [' result '][i][' companyshortname ']
education= hjson[' content ' [' positionresult '] [' result '][i][' education ']
workyear= hjson[' content ' [' positionresult '] [' result '][i][' workyear ']
industryfield= hjson[' content ' [' positionresult '] [' result '][i][' Industryfield ']
financestage= hjson[' content ' [' positionresult '] [' result '][i][' financestage ']
companysize= hjson[' content ' [' positionresult '] [' result '][i][' companysize ']
Job_desc = Get_job_desc (PositionID)
Positionname_list.append (Positionname)
Salary_list.append (Salary)
City_list.append (city)
District_list.append (district)
Companyshortname_list.append (Companyshortname)
Education_list.append (Education)
Workyear_list.append (Workyear)
Industryfield_list.append (Industryfield)
Financestage_list.append (Financestage)
Companysize_list.append (Companysize)
#job_desc_list. Append (Job_desc)
4, Get Job information description
By observing that, when opening a detailed page of a specific position, the value inside the URL (for example, 4789029 of the URL) is the PositionID of the position, and the PositionID can return the data acquisition from the previous step of the position list JSON.
Request page information through requests, and then get job description information through XPath.
def get_job_desc (ID):
url = "https://www.lagou.com/jobs/" +str (ID) + ". html"
Cookie = "jsessionid=" + get_uuid () + ";"
"User_trace_token=" + get_uuid () + "; Lguid= "+ get_uuid () +"; INDEX_LOCATION_CITY=%E6%88%90%E9%83%BD; "
"Search_id=" + get_uuid () + '; _gid=ga1.2.717841549.1514043316; ‘
' _ga=ga1.2.952298646.1514043316; ‘
' lgsid= ' + get_uuid () + ";"
"Lgrid=" + get_uuid () + ";"
headers = {' Cookie ': Cookie, ' origin ': ' https://www.lagou.com ', ' x-anit-forge-code ': ' 0 ', ' accept-encoding ': ' gzip, Deflate, BR ", ' accept-language ':" zh-cn,zh;q=0.8,en;q=0.6 ", ' user-agent ':" mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/55.0.2883.87 safari/537.36 ", ' Content-type ':" application/ x-www-form-urlencoded; Charset=utf-8 ", ' Accept ':" Application/json, Text/javascript, */*; q=0.01 ", ' Referer ':" Https://www.lagou.com/jobs/list_Java?px=new&city=%E6%88%90%E9%83%BD ", ' X-requested-with ' : "XMLHttpRequest", ' Connection ': "Keep-alive", ' X-anit-forge-token ': "None", ' Cache-control ': "No-cache", ' Postman-token ': "91beb456-8dd9-0390-a3a5-64ff3936fa63"}
Response = Requests.request ("GET", url, headers=headers)
x = etree. HTML (Response.text)
data = X.xpath ('//*[@id = ' job_detail ']/dd[2]/div/*/text () ')
Return '. Join (data)
02
Data Acquisition-crawler
1. Setting up cookies and headers
If you do not set the information, you will not be allowed to crawl, return the prompt: "You are too frequent, please visit later." Therefore, we need to set headers and cookies information.
def Get_lagou (PAGE,CITY,KD):
url = "Https://www.lagou.com/jobs/positionAjax.json"
QueryString = {"px": "New", "City": City, "Needaddtionalresult": "false", "Isschooljob": "0"}
Payload = "first=false&pn=" + str (page) + "&kd=" +str (KD)
Cookie = "jsessionid=" + get_uuid () + ";"
"User_trace_token=" + get_uuid () + "; Lguid= "+ get_uuid () +"; INDEX_LOCATION_CITY=%E6%88%90%E9%83%BD; "
"Search_id=" + get_uuid () + '; _gid=ga1.2.717841549.1514043316; ‘
' _ga=ga1.2.952298646.1514043316; ‘
' lgsid= ' + get_uuid () + ";"
"Lgrid=" + get_uuid () + ";"
headers = {' Cookie ': Cookie, ' origin ': ' https://www.lagou.com ', ' x-anit-forge-code ': ' 0 ', ' accept-encoding ': ' gzip, Deflate, BR ", ' accept-language ':" zh-cn,zh;q=0.8,en;q=0.6 ", ' user-agent ':" mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/55.0.2883.87 safari/537.36 ", ' Content-type ':" application/ x-www-form-urlencoded; Charset=utf-8 ", ' Accept ':" Application/json, Text/javascript, */*; q=0.01 ", ' Referer ':" Https://www.lagou.com/jobs/list_Java?px=new&city=%E6%88%90%E9%83%BD ", ' X-requested-with ' : "XMLHttpRequest", ' Connection ': "Keep-alive", ' X-anit-forge-token ': "None", ' Cache-control ': "No-cache", ' Postman-token ': "91beb456-8dd9-0390-a3a5-64ff3936fa63"}
2. Delay setting and Page crawl
Avoid the crawl speed is blocked too fast, set the delay time is 3-5 seconds. Crawls the paging data through a for loop.
def main (pages,city,job):
For-N in range (1, pages+1):
Get_lagou (N,city,job)
Time.sleep (Round (Random.uniform (3, 5), 2))
Write_to_csv (City,job)
03
Data storage and processing
1. CSV Data storage
Because of the small amount of data, up to 450 data, using the CSV storage method.
2. Data processing
Payroll Data Processing
The proportion of the subsequent statistics on the monthly salary, because the compensation range is customizable range, does not have a unified standard. For example, compensation can be 10k-20k, 5k-8k, 11k-18k, 10k-16k and so on, the subsequent is not conducive to the visibility of the salary range, so the salary is classified into these kinds: 2k, 2k-5k, 5k-10k, 10k-15k, 15k-25k, 25k-50k, 50k or more.
If the salary is 10k-20k, it is considered to be included in both categories of 10k-15k and 15k-25k. Use regular expressions for the collation summary:
def salary_categorize (Salarys):
Dict = {' 2k or less ': 0, ' 2k-5k ': 0, ' 5k-10k ': 0, ' 10k-15k ': 0, ' 15k-25k ': 0, ' 25k-50k ': 0, ' 50k above ': 0}
For salary in Salarys:
If Re.match (' ^[0-1]k-*|. *-[0-1]k$ ', salary)!=none:
dict[' 2k below ' + = 1
If Re.match (' ^[2-4]k-*|. *-[2-4]k$ ', salary)!=none:
dict[' 2k-5k ') + = 1
If Re.match (' ^[5-9]k-*|. *-[5-9]k$ ', salary)!=none:
dict[' 5k-10k ') + = 1
If Re.match (' ^1[0-4]k-*|. *-1[0-4]k$ ', salary)!=none:
dict[' 10k-15k ') + = 1
If Re.match (' ^1[5-9]k-*|^2[0-4]k-*|. *-1[5-9]k$|. *-2[0-4]k$ ', salary)!=none:
dict[' 15k-25k ') + = 1
If Re.match (' ^2[5-9]k-*|^[3-4][0-9]k-*|. *-2[5-9]k$|. *-[3-4][0-9]k$ ', salary)!=none:
dict[' 25k-50k ') + = 1
If Re.match (' ^[5-9][0-9]k-*|. *-[5-9][0-9]k$|^d{3,}k-*|. *-d{3,}k$ ', salary)!=none:
dict[' 50k above ' + = 1
Return Dict
Industry information Processing
The company's industry can be multiple, usually separated by commas, but the existence of the part is separated by comma and space, there may be no relevant industry. In this case, the Python re library can handle multiple delimiter delimited data, and the industry is empty, then skipped.
def industryfield_counts (csv_file):
Industryfields = []
D = pd.read_csv (csv_file, engine= ' python ', encoding= ' utf-8 ')
info = d[' Industryfield ']
For I in range (len (info)):
Try
data = Re.split (' [,,] ', info[i])
Except
Continue
For j in Range (len (data)):
Industryfields.append (Data[j])
Counts = Counter (industryfields)
return counts
04
Visualization and interpretation of data
1, the company's relevant situation analysis
From the industry situation and the size of the company, the mobile Internet occupies 40% of the demand, data Services + Big Data + artificial intelligence accounted for 10% of the proportion. Python is very powerful, suitable for areas including WEB development, network programming, crawler, cloud computing, artificial intelligence, automated operation and so on, so whether the size of the company is large or small, the financing situation, are generally required to Python-related positions of talent.
2. Urban demand Analysis
From the analysis, it can be found that the demand is mainly concentrated in China's three major economic circles: Beijing-Tianjin-Hebei, Yangtze River delta, Pearl River Delta. The main distribution in Beijing (40%), Shanghai (16%), Shenzhen (15%), Guangzhou (6%), Chengdu (6%) and Hangzhou (6%) of the 6 cities. And Beijing's Internet entrepreneurial atmosphere in China, registered in Beijing's internet companies far higher than in other cities of the company, the demand is also the largest.
3. Salary and work experience analysis
Judging from the requirements of work experience, most of them focus on the two intervals of 3-5 and 1-3 years, as to the correlation between work experience and salary, it is found that 1-3 years work experience is generally paid in 15-25k, conforms to the normal distribution law, 3-5 years work experience salary is generally in 15k-25k and 25k-50k these two intervals, to 15k-25k the majority of this interval. To achieve 5-10 years of work experience, pay in 25k-50k this interval is the majority.
4, education requirements and work experience analysis
In terms of educational requirements, most of them require at least a bachelor's degree, which accounts for about 80% of the percentage. So do not believe in reading useless on this point of view, education is at least a stepping stone to work.
Work experience, the general requirement is 1-5 years, this part accounted for 84% of the proportion. Less than 1 years and experience is limited, accounting for about 9%, 5-10 years accounted for about 7% of the proportion.
05
Summarize
The TIOBE August programming language index has been published, while the top three are still Java, C, C + +. But Python is very close to the first 3 bits of the TIOBE index. This upward trend in python can also be seen in the TIOBE index rankings, and the Internet industry is beginning to use Python in general. The Python programming language was originally the successor of Perl and was used to write build scripts and various bonding software. But then gradually into other areas. Today, it is common to run Python in a large embedded system. As a result, Python is fully likely to enter the top three, and even replace Java in the future as the new first place.
From the current job prospects of Python, the summary is as follows:
Python is optimistic about the employment situation, from the TIOBE August programming language index rankings and Baidu index search numbers, Python popularity is getting higher.
In China, the demand for Python-related jobs is still concentrated in three major economic circles, especially in Beijing, Shanghai and Shenzhen. From the industry demand, mainly focus on mobile internet, data services, big data analysis and other industries.
From the data analysis of the hook net, most of the relevant jobs in Python are required to be in the Bachelor's degree or above, and the working experience requires more than 1-5 years. Because of the explosive development of Python in the field of big data and artificial intelligence, the salary of the Python direction job is rising, and the monthly salary varies from the data analysis to 10k-50k.
They say Python is catching up with Java, crawling the hooks, and finding that it's up to 50K in payroll.