Use Python to crawl 58 of the same city resume data

Source: Internet
Author: User

Use Python to crawl 58 of the same city resume data
Use Python to crawl 58 of the same city resume data
Recently received a job, need to get 58 of the same city resume information (http://gz.58.com/qzyewu/). The first thought was to use Python's scrapy framework to make crawlers. However, at the time of production, the discovery content cannot be stored in the local variable response. When I load a Web page through a shell, although the content can be stored in the response, the XPath returns a null value when it obtains the data I need. Considering that the data is in the source code, I use the beautifulsoup in Python to get the data by downloading the source code and then inserting it into the database.

Python Package Urllib2?,beautifulsoup,mysqldb,re Required

First, get the entire page

Coding:utf-8

Import Urllib2
From BeautifulSoup import BeautifulSoup
? url= ' http://jianli.58.com/resume/91655325401100 '
Content = Urllib2.urlopen (URL). Read ()
Soup=beautifulsoup (content)
Print Soup
1
2
3
4
5
6
7
URLs for pages that need to be downloaded
Open a Web page by using the Urllib2.urlopen () method
The Read () method reads the data on the URL

Second, filter the data you want
A regular expression is needed here, and Python provides a powerful regular expression, and the unclear little partner can refer to the information (http://www.runoob.com/regexp/regexp-syntax.html)

For example, we need to get names

From the console you can see where the name is located
Write a picture description here

A regular expression can be used to match the code as follows:

Name = Re.findall (R ' (? <=class= "Name" >). *? =) ', str (soup))
1
Run the program and find that the returned result is empty.

Check the regular expression is correct, we observe the previous return of the soup, found that he returned to the source code and the source code on the page is not the same. All our regular expressions based on the source code written on the observation page can no longer be returned to the source code to match the corresponding content. So we can only write regular expressions by observing the source of the returned code.

Write a picture description here

In the source code returned by soup, it is easy to find all the basic information of this person, and all in the label < LI class= "item" >, through the following Fandall () method, it is easy to get content

data = Soup.findall (' li ', attrs={' class ': ' Item '})
1
Through the above code, you can see the following results, the return of a list

Write a picture description here

In this way, we obtain the person's name, gender, age, work experience and education.

With the above method, we are able to get the data you need for the entire page.

Third, save the data to the database
I'm using a MySQL database, so here's the example of MySQL

Connecting to a database
conn = MySQLdb.connect (
Host = ' 127.0.0.1 ',
Port = 3306,user = ' Root ',
passwd = ' XXXXX ',
db = ' XXXXX ',
charset = ' UTF8 ')
cursor = Conn.cursor ()
1
2
3
4
5
6
7
Because I want to store Chinese, I set the encoding format here to UTF8

Create INSERT statement
Sql_insert = "INSERT INTO Resume (
Id,name,sex,age,experience,education,pay,ad
, job,job_experience,education_experience)
VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) "
1
2
3
4
Inserting data
Cursor.execute (Sql_insert, (id,name,sex,age,experience,education
, pay,ad,job,job_experience,education_experience))
Conn.commit ()
1
2
3
Close the database
Cursor.close ()
Conn.close ()
1
2
Execute the program
An error has been made ...

(1064, "You have a error in your SQL syntax; Check the manual-corresponds to your MySQL server version for the right syntax-to-use ()) ' at line 1 ')
1
This error occurs, if the SQL syntax is correct, there is usually a problem with coding.
Our database uses the encoding is UTF8, should be inserted data on the coding problem.
We re-encode the returned data using the decode () and Encode () methods to implement

Name = Data[0].decode (' Utf-8 '). Encode (' Utf-8 ')
1
Using this simple method, we solved the problem that the database encoding and data encoding inconsistency caused the error.

Why is the code different?
This is because when we use the BeautifulSoup package to crawl the page, the returned data is ASCII encoded data. And our database is UTF8 encoded, all inserting data is an error, as long as the crawled data is re-encoded

Results
Write a picture description here

This is the result of my climb, the effect is very good, the speed is about 1 seconds a page, although much slower than the scrapy, but BeautifulSoup and urllib2 use simple, suitable for novice practiced hand.

Appendix: Code

Coding:utf-8

Import Urllib2
From BeautifulSoup import BeautifulSoup
Import re
Import MySQLdb
url = ' http://jianli.58.com/resume/91655325401100 '
Content = Urllib2.urlopen (URL). Read ()
Soup = beautifulsoup (content)
Basedata = str (soup.findall (' Li ', attrs={' class ': ' Item '}))
Basedata = Re.findall (R ' (? <=class= "Item" >).? (? =) ', basedata)
ID = str (soup.findall (' script ', attrs={' type ': ' Text/javascript '}))
ID = Re.findall (R ' (? <=global.ids = ").
(? = ";)", ID)
ID = Id[0].decode (' Utf-8 '). Encode (' Utf-8 ')
Name = Basedata[0].decode (' Utf-8 '). Encode (' Utf-8 ')
Sex = Basedata[1].decode (' Utf-8 '). Encode (' Utf-8 ')
Age = Basedata[2].decode (' Utf-8 '). Encode (' Utf-8 ')
Experience = Basedata[3].decode (' Utf-8 '). Encode (' Utf-8 ')
Education = Basedata[4].decode (' Utf-8 '). Encode (' Utf-8 ')
Pay = str (soup.findall (' DD ', Attrs={none:none}))
Pay = Re.findall (R ' (? <=

) \d+. ? (?=
) ', pay]
Pay = Pay[0].decode (' Utf-8 '). Encode (' Utf-8 ')
Expectdata = str (soup.findall (' DD ', attrs={none:none})
Expectdata = Re.findall (r "' (? <=[" ']>) [^<].
? (? =) ", Expectdata)
AD = Expectdata[0].decode (' Utf-8 '). Encode (' Utf-8 ')
Job = Expectdata[1].decode (' Utf-8 '). Encode (' Utf-8 ')
job_experience = str (soup.findall (' div ', attrs={' class ': ' Employed '})
Job_experience = Re.findall (R ' (?<=>) [^<]. ? (?=<) ', job_experience)
Job_experience = ". Join (Job_experience). Decode (' Utf-8 '). Encode (' Utf-8 ')
education_experience = str (soup.findall (' DD ', attrs={none:none})
Education_experience = Re.findall (R ' (? <=

).\ n.A : \ n?', education_experience)
Education_experience = ". Join (Education_experience). Decode (' Utf-8 '). Encode (' Utf-8 ')
conn = MySQLdb.connect (
Host = ' 127.0.0.1 ',
Port = 3306,user = ' Root ',
passwd = ' XXXXX ',
db = ' XXXX ',
charset = ' UTF8 ')
cursor = Conn.cursor ()
Sql_insert = "INSERT into Resume (ID, Name,sex,age,experience,education,pay,ad,job,job_experience,education_ Experience) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) "
Try
Cursor.execute (Sql_insert, (ID, name,sex,age,experience,education,pay,ad,job,job_experience,education_experience ))
Conn.commit ()
Except Exception as E:
Print E
Conn.rollback ()
Finally
Cursor.close ()
Conn.close ()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
Use Python to crawl 58 of the same city resume data
First get the entire page
Second filter the data you want
The third data is saved to the database
Connecting to a database
Create INSERT statement
Inserting data
Close the database
Execute the program
Results
Appendix Code
Article tagged with: python 58 City data
Personal Category: Reptiles
Related hot words: 445 utilization of FCK utilization of beef using CPU using CSRF
View more information about this article
Without these technologies, big data development pay won't be high?
Big data technology and application of the mature, applications focused on the Internet, finance, medical, new energy, communications and real estate industries. Organize average salary and big data learning outline for viewing

Want to say something to the author? I'll say one thing.
weixin_42498033
weixin_424980332018-07-09 10:31:57#7 Building
Could the blogger please give me the code used? Thank you! [Email protected]
qq_23704631
QQ_237046312018-04-13 21:42:41#6 Building
My Contact information qq1018141445
qq_23704631
QQ_237046312018-04-13 21:41:38#5 Building
Can you leave a contact?

Use Python to crawl 58 of the same city resume data

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.