This article mainly introduces how to use cookielib and urlib2 in Python with PyQuery to capture web page information. it mainly uses PyQuery to parse HTML. if you need it, you can refer to what I just learned, I suddenly thought of the idea of creating a course schedule, so Baidu got up.
At the beginning, I thought like this: When writing a wall, I used urllib2 [two lines of code to capture the web page], so there was only html parsing. So Baidu: python parses html. I found a good article about pyQuery.
PyQuery is the implementation of jQuery in Python. it can use jQuery syntax to parse HTML documents. You need to install it before use. The Mac installation method is as follows:
sudo easy_install pyquery
OK! Installed!
Let's give it a try:
From pyquery import PyQuery as pqhtml = pq (url = u'http: // seam.ustb.edu.cn: 8080/sort GL/index. jsp ') # Now you have obtained the htmlclasses = html ('. haveclass ') # obtain elements by class name # If you are familiar with jQuery, you must understand the convenience of pyQuery. For more information, see pyQuery API.
It seems that you have learned how to use pyQuery to grasp the Course Table. However, if you use the source code directly, errors will certainly occur. Because you have not logged on yet!
Therefore, before running this line to capture the correct code, we need to simulate logon to the tutorial network. At this time, I think urllib has a function to simulate the post request, so I am Baidu: urllib post.
This is the simplest example of simulating a post request:
Import urllibimport urllib2import cookielibcj = cookielib. cookieJar () opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cj) opener. addheaders = [('User-agent', 'mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1) ')] urllib2.install _ opener (opener) req = urllib2.Request ("http://seam.ustb.edu.cn: 8080/logs GL/Login ", urllib. urlencode ({"username": "41255029", "password": "123456", "usertype": "student"}) req. add_header ("Referer", "http://xxoo.com") resp = urllib2.urlopen (req) # cookielib is used here. I don't know much about it later. # urllib and urllib2 are also used, urllib2 is probably an extension package of urllib [233 thought of Three Kingdoms
In this simplest instance, I use my campus network account to submit form data to the logon page to simulate logon.
Now, we have logged on to the tutorial network, and then parse html with the pyQuery to get the course list on the webpage.
html = pq(url=u'http://seam.ustb.edu.cn:8080/jwgl/index.jsp')self.render("index.html",data=html('.haveclass'))
Result Display
Finally:
I found that pyQuery is not only very convenient for parsing html, but also can be used as a tool for cross-origin data capturing. NICE !!!
I hope to help you.