Crawl Web information in Python using Cookielib and urlib2 mates Pyquery

Source: Internet
Author: User
Just so bored, suddenly remembered to do a timetable before the idea, so Baidu got up.

At first, I was thinking: when writing a wall, using the Urllib2 "two lines of code to catch the page", then only the parsing of HTML. So Baidu: Python parsing html. Found a good article, which introduced to the pyquery.

Pyquery is the implementation of jquery in Python and is capable of parsing HTML documents in the syntax of jquery. Need to install before use, Mac installation method is as follows:

sudo easy_install pyquery

Ok! It's all set!

Let's give it a try:

From pyquery import pyquery as pqhtml = PQ (url=u ' http://seam.ustb.edu.cn:8080/jwgl/index.jsp ') # has now obtained the Undergraduate Teaching network homepage htmlclasses = HTML ('. Haveclass ') #通过类名获取元素 # If you are familiar with jquery, then you must now understand the convenience of pyquery more usage see pyquery API

Seems to have learned to use pyquery can catch the timetable, but, if you directly use my source code, will certainly be wrong. Because I haven't logged in yet!

Therefore, before running this line to fetch the correct code, we need to simulate the login to the undergraduate teaching network. This time, I think up urllib have the function of analog POST request, so I Baidu: Urllib post.

This is one of the simplest examples of analog post requests:

Import urllibimport Urllib2import cookielibcj = cookielib. Cookiejar () opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (CJ)) opener.addheaders = [(' User-agent ', ' mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1) ']urllib2.install_opener (opener) req = Urllib2. Request ("Http://seam.ustb.edu.cn:8080/jwgl/Login", Urllib.urlencode ({"username": "41255029", "Password": "123456", "Usertype": "Student"})) Req.add_header ("Referer", "http://xxoo.com") resp = Urllib2.urlopen (req) #这里面用到了cookielib, I do not know, and then slowly understand it # also used Urllib and urllib2,urllib2 is probably urllib expansion pack "233 thought of the three countries killed

In this minimalist example, use my campus network account to submit form data to the login page and simulate login.

Now, we have signed into the undergraduate teaching network, and then combined with the previous Pyquery parsing HTML can be obtained in the page of the curriculum.

html = PQ (url=u ' http://seam.ustb.edu.cn:8080/jwgl/index.jsp ') self.render ("Index.html", data=html ('. Haveclass '))

Results show

At last:

I found that pyquery is not only useful for parsing HTML, but also as a tool for crawling data across domains, nice!!!

Hope to be of help to everyone.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.