Batch crawling of python dynamic web pages

Source: Internet
Author: User
This article mainly introduces the method for batch crawling python dynamic web pages, mainly for batch crawling of scores of Grade 4 and grade 6, interested friends can refer to the four or six score query site I know there are two: xuexin Network (http://www.chsi.com.cn/cet/) and 99 dormitory (http://cet.99sushe.com /), both websites use dynamic web pages. I am using xuexin. OK. The website is as follows:

The website code is as follows:

 

From the figure, we can see that the form submission link is/cet/query, that is: http://www.chsi.com.cn/cet/query. if the link is correct, enter the form and the result are as follows:

However, after you click to view the source code, you will find that there is no score, that is, the code is still the one above, and then press F12 to view the Code:

Name:XXXXSchools:XXXXXXExam type:English 4Admission Ticket No:120135151100101Exam time:June 2015Total score:403
Hearing:132
Read:147
Writing and Translation:124

The code shows the result. you can know that the website uses dynamic web pages, and I don't know 0.0 about JavaScript, Ajax. js, or other websites. The above is the requirement.

Preface:BeautifulSoup has been used for crawling, but BeautifulSoup cannot crawl dynamic web pages. I used n kinds of things, scapy, pyqt, and so on in various forums to find various materials. It took a lot of detours, no, it should be that I won't use it. selenium and phantomjs are used in the end. These two should also be the most popular crawler modules.

1. Import selenium and phantomjs

from selenium import webdriverdriver = webdriver.PhantomJS(executable_path='D:\phantomjs-2.1.1-windows\phantomjs.exe')driver.get(url)driver.find_element_by_id('zkzh').send_keys(i)driver.find_element_by_id('xm').send_keys(xm)driver.find_elements_by_tag_name('form')[1].submit()

Code Description:

3. selenium can load a lot of drivers, such as Chrome and FireFox. here we need to have these two browsers and drivers. After a while, we can say that Phantomjs is better.

5, 6, and 7 are admission ticket numbers, names, and submissions respectively.

Ii. Character Processing

After the submission, you can directly find the following:

print driver.find_element_by_xpath("//tr[3]/td[1]").textprint driver.find_element_by_xpath("//tr[6]/td[1]").text

Code Description:

1. View name

2. view the score and the specific score

After printing:

Name: listening, reading, and writing

Then we need to process the score string and select the numbers of each part. Here we use the re module:

 import rem = re.findall(r'(\w*[0-9]+)\w*', chuli2)

Where m is an array and the output is ["403", "132", "147", "142"]

Iii. Database

Our school does not know whether it is very scum or humane. Anyway, we published the admission ticket number for the four or six levels of the school. Of course, it is excel. We need to import the mysql database. After opening the Excel file, I found that Microsoft and Oracle are really good, and Excel365 actually has the mysql workbench connection section.

The database code is as follows:

import MySQLdbconn = MySQLdb.Connect(host='localhost', user='root', passwd='root', db='cet', port=3306, charset='utf8')cur = conn.cursor()curr = conn.cursor()cur.execute("select name from cet.cet where zkzh=(%s)" % i)xm = cur.fetchone()[0]print "Name is " + xmsqltxt = "update cet.cet set leibie=(%s),zongfen=(%s),tingli=(%s),yuedu=(%s),xiezuo=(%s) WHERE zkzh=(%s)" % (  ss, m[0], m[1], m[2], m[3], i)cur.execute(sqltxt)conn.commit()cur.close()conn.close()

Code Description:

3. python database connection code

6. Connect to the database and obtain the name

9. I am speechless in this line. I have always reported an error when using the method like "+ ss +". I finally found some materials for a long time. I don't like this method very much, but I can use it together.

12. Remember to submit a transaction !!! Commit ()!!! Otherwise, it will be ineffective.

4. Use the proxy server (write after retained)

After running for a period of time, I caught about several hundred people and then asked for the verification code. The solution was to handle the verification code or use the proxy server, this part continues to strengthen learning and then gets it out (^ ω ^)

V. source code and Effects

# Encoding = utf8import MySQLdbimport reimport timefrom selenium import webdriver # connect mysql, get zkxh and xmconn = MySQLdb. connect (host = 'localhost', user = 'root', passwd = 'root', db = 'cet ', port = 3306, charset = 'utf8') cur = conn. cursor () curr = conn. cursor () url =' http://www.chsi.com.cn/cet/query 'Def kaishi (I): print I, print "start" try: cur.exe cute ("select name from cet. cet where zkzh = (% s) "% I) xm = cur. fetchone () [0] print "Name is" + xm driver = webdriver. phantomJS (executable_path = 'd: \ phantomjs-2.1.1-windows \ phantomjs.exe ') driver. get (url) driver. find_element_by_id ('zkzh '). send_keys (I) driver. find_element_by_id ('xm '). send_keys (xm) driver. find_elements_by_tag_name ('form') [1]. submit () Driver. set_page_load_timeout (10) leibie = driver. find_element_by_xpath ("// tr [3]/td [1]"). text leibie2 = str (leibie. encode ("UTF-8") ss = "" if leibie2.decode ("UTF-8") = 'English Level 4 '. decode ("UTF-8"): ss = 4 else: ss = 6 # zongfen = driver. find_element_by_xpath ("// tr [6]/th [1]"). text # print zongfen # print "===" chuli = driver. find_element_by_xpath ("// tr [6]/td [1]"). text print chuli chuli2 = str (chuli. enc Ode ("UTF-8") m = re. findall (R' (\ w * [0-9] +) \ w * ', chuli2) sqltxt = "update cet. cet set leibie = (% s), zongfen = (% s), tingli = (% s), yuedu = (% s), xiezuo = (% s) WHERE zkzh = (% s) "% (ss, m [0], m [1], m [2], m [3], I) cur.exe cute (sqltxt) conn. commit () print str (I) + "finish" failed t Exception, e: print e driver. close () time. sleep (10) kaishi (I) # for j1 in range (1201351511001,120 1351512154): for j1 in range (120135151100 7, 1201351512154): for j2 in range (0, 3): for j3 in range (0, 10): j = str (j1) + str (j2) + str (j3) if str (j2) + str (j3) = "00": print "0.0" elif str (j2) + str (j3) = "29": kaishi (str (j1) + str (j2) + str (j3) j4 = str (j1) + "30" kaishi (j4) else: kaishi (j) print "END !!! "Cur. close () conn. close ()

Summary: python's string processing details are really important, and errors are output at will, and IDE encoding is different. Remember to have a system encoding, character encoding, and environment encoding, database encoding must be consistent.

The above is all the content of this article, hoping to help you learn.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.