Python Dynamic Web page bulk Crawl

Source: Internet
Author: User
Level 46 Results Search site I know two: The Learning Network (http://www.chsi.com.cn/cet/) and the 99 Dorm (http://cet.99sushe.com/), both of which are dynamic Web pages. I am using the Learning Letter network, OK, the website is as follows:

The code for the website is as follows:

It can be seen from the diagram that the form submission link is/cet/query, that is: Http://www.chsi.com.cn/cet/query, OK, fill out the form and the results are as follows:

However, after clicking on the source code to find that there is no score, that is, the code is still the top, then press F12 to view the code:

Name: XXXX School: XXXXXX exam Category: English level four admission ticket: 120135151100101 Exam Time: June 2015 total:
Listening:
Read:
Writing and translating:

The code shows the score, you know, the site is using a Dynamic Web page, with JavaScript or ajax.js or something else I don't know 0.0. Above for demand.

Preface: used BeautifulSoup crawl, but BeautifulSoup is not crawling Dynamic Web pages, on various forums to find all kinds of information, with n kinds of things, scapy,pyqt and so on, walked a lot of detours, not not, it should be I will not use , the final use of selenium and PHANTOMJS, these two should also be the most popular crawler modules now.

First, import selenium and Phantomjs

From selenium import webdriverdriver = Webdriver. Phantomjs (executable_path= ' D:\phantomjs-2.1.1-windows\phantomjs.exe ') driver.get (URL) driver.find_element_by_id ( ' Zkzh '). Send_keys (i) driver.find_element_by_id (' XM '). Send_keys (XM) driver.find_elements_by_tag_name (' form ') [1]. Submit ()

Code Description:

3.selenium can load a lot of drivers, such as Chrome, Firefox, and so on, need to have these two browsers and drivers, toss a bit, online said PHANTOMJS is better

5, 6, 7 are the ticket number, name and submission

Second, character processing

Once submitted, you can find it directly:

Print Driver.find_element_by_xpath ("//tr[3]/td[1]"). Textprint Driver.find_element_by_xpath ("//tr[6]/td[1]"). Text

Code Description:

1. View Name

2. View scores and their specific accomplishments

After printing:

Name Listening reading writing

After that, we'll do string processing of the fractions, select the numbers for each part, here we use the RE module:

Import rem = Re.findall (R ' (\w*[0-9]+) \w* ', chuli2)

where M is an array, the output is ["403", "132", "147", "142"]

Third, the database

Our school also do not know to say is very slag or humane, anyway published the 46 class admission ticket number, of course, is Excel, need to import MySQL database, open Excel, I found that Microsoft Dafa and Oracle is really cow, Excel365 incredibly have MySQL Workbench Connection section.

The database code is as follows:

Import Mysqldbconn = MySQLdb.connect (host= ' localhost ', user= ' root ', passwd= ' root ', db= ' cet ', port=3306, charset= ' UTF8 ' ) cur = conn.cursor () Curr = Conn.cursor () cur.execute ("Select name from Cet.cet where zkzh= (%s)"% i) XM = Cur.fetchone () [0]p Rint "Name is" + xmsqltxt = "Update Cet.cet set leibie= (%s), zongfen= (%s), tingli= (%s), yuedu= (%s), xiezuo= (%s) where zkzh= (% s) "% (  SS, M[0], m[1], m[2], m[3], i) Cur.execute (sqltxt) conn.commit () Cur.close () Conn.close ()

Code Description:

3.python Connection Database Code

6. Connect to the database get the name part

9. This line I good no words ah, use ' "+ss+" ' Such writing has been error, finally found a half-day information, this writing I do not like, but make use of it.

12. Be sure to submit the transaction!!! Commit ()!!! Otherwise, it's ineffective.

Iv. using a proxy server (keep writing later)

After running for a period of time, probably grabbed hundreds of people, and then appeared to require verification code, the solution can only deal with the verification code or use a proxy server, this part continue to strengthen learning and then get out (^ω^)

V. Source code and Effects

# encoding=utf8import Mysqldbimport reimport timefrom Selenium import webdriver# connect mysql,get zkxh and xmconn = MySQL Db. Connect (host= ' localhost ', user= ' root ', passwd= ' root ', db= ' cet ', port=3306, charset= ' utf8 ') cur = conn.cursor () Curr =  Conn.cursor () url = ' Http://www.chsi.com.cn/cet/query ' def Kaishi (i): Print I, print "Start" Try:cur.execute ("SELECT name From Cet.cet where zkzh= (%s) '% i ' XM = Cur.fetchone () [0] print "Name is" + XM Driver = Webdriver. Phantomjs (executable_path= ' D:\phantomjs-2.1.1-windows\phantomjs.exe ') driver.get (URL) driver.find_element_by_id ( ' Zkzh '). Send_keys (i) driver.find_element_by_id (' XM '). Send_keys (XM) driver.find_elements_by_tag_name (' form ') [1]. Submit () Driver.set_page_load_timeout () Leibie = Driver.find_element_by_xpath ("//tr[3]/td[1]"). Text Leibie2 = str ( Leibie.encode ("Utf-8")) SS = "" If Leibie2.decode ("utf-8") = = ' English level four '. Decode ("Utf-8"): ss = 4 Else:ss = 6 # Zongfen = Dr Iver.find_element_by_xpath ("//tr[6]/th[1]"). Text # print Zongfen # print "= = == "Chuli = Driver.find_element_by_xpath ("//tr[6]/td[1] "). Text print Chuli chuli2 = str (Chuli.encode (" Utf-8 ")) m = Re.find All (R ' (\w*[0-9]+) \w* ', chuli2) sqltxt = "Update Cet.cet set leibie= (%s), zongfen= (%s), tingli= (%s), yuedu= (%s), xiezuo= (% s) WHERE zkzh= (%s) "% (ss, M[0], m[1], m[2], m[3], i) Cur.execute (sqltxt) conn.commit () print str (i) +" finish "except Exception, E:print e driver.close () Time.sleep (Ten) Kaishi (i) # for J1 in range (1201351511001, 1201351512154): for J1 in Ran GE (1201351511007, 1201351512154): For J2 in range (0, 3): For J3 in range (0, ten): j = str (J1) + str (j2) + str (J3) if STR ( J2) + str (j3) = = "XX": print "0.0" elif str (J2) + str (j3) = = "": Kaishi (str (J1) + str (j2) + str (J3)) J4 = str (J1) + "Kaishi" (J4) Else:kaishi (j) print "END!!!" Cur.close () Conn.close ()

Summary: Python string processing details are really important, the output error, and the IDE's code is not the same, remember there is a system code, character encoding, Environment coding, database encoding and so on are consistent.

The above is the whole content of this article, I hope that everyone's study has helped.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.