Just so bored, suddenly remembered to do a timetable before the idea, so Baidu got up.
At first, I was thinking: when writing a wall, using the Urllib2 "two lines of code to catch the page", then only the parsing of HTML. So Baidu: Python parsing html. Found a good article, which introduced to the pyquery.
Pyquery is the implementation of jquery in Python and is capable of parsing HTML documents in the syntax of jquery. Need to install before use, Mac installation method is as follows:
sudo easy_install pyquery
Ok! It's all set!
Let's give it a try:
From pyquery import pyquery as pqhtml = PQ (url=u ' http://seam.ustb.edu.cn:8080/jwgl/index.jsp ') # has now obtained the Undergraduate Teaching network homepage htmlclasses = HTML ('. Haveclass ') #通过类名获取元素 # If you are familiar with jquery, then you must now understand the convenience of pyquery more usage see pyquery API
Seems to have learned to use pyquery can catch the timetable, but, if you directly use my source code, will certainly be wrong. Because I haven't logged in yet!
Therefore, before running this line to fetch the correct code, we need to simulate the login to the undergraduate teaching network. This time, I think up urllib have the function of analog POST request, so I Baidu: Urllib post.
This is one of the simplest examples of analog post requests:
Import urllibimport Urllib2import cookielibcj = cookielib. Cookiejar () opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (CJ)) opener.addheaders = [(' User-agent ', ' mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1) ']urllib2.install_opener (opener) req = Urllib2. Request ("Http://seam.ustb.edu.cn:8080/jwgl/Login", Urllib.urlencode ({"username": "41255029", "Password": "123456", "Usertype": "Student"})) Req.add_header ("Referer", "http://xxoo.com") resp = Urllib2.urlopen (req) #这里面用到了cookielib, I do not know, and then slowly understand it # also used Urllib and urllib2,urllib2 is probably urllib expansion pack "233 thought of the three countries killed
In this minimalist example, use my campus network account to submit form data to the login page and simulate login.
Now, we have signed into the undergraduate teaching network, and then combined with the previous Pyquery parsing HTML can be obtained in the page of the curriculum.
html = PQ (url=u ' http://seam.ustb.edu.cn:8080/jwgl/index.jsp ') self.render ("Index.html", data=html ('. Haveclass '))
Results show
At last:
I found that pyquery is not only useful for parsing HTML, but also as a tool for crawling data across domains, nice!!!
Hope to be of help to everyone.