Use cookielib and urlib2 in Python with PyQuery to capture web page information

Last Update:2018-05-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly introduces how to use cookielib and urlib2 in Python with PyQuery to capture web page information. it mainly uses PyQuery to parse HTML. if you need it, you can refer to what I just learned, I suddenly thought of the idea of creating a course schedule, so Baidu got up.

At the beginning, I thought like this: When writing a wall, I used urllib2 [two lines of code to capture the web page], so there was only html parsing. So Baidu: python parses html. I found a good article about pyQuery.

PyQuery is the implementation of jQuery in Python. it can use jQuery syntax to parse HTML documents. You need to install it before use. The Mac installation method is as follows:

sudo easy_install pyquery

OK! Installed!

Let's give it a try:

From pyquery import PyQuery as pqhtml = pq (url = u'http: // seam.ustb.edu.cn: 8080/sort GL/index. jsp ') # Now you have obtained the htmlclasses = html ('. haveclass ') # obtain elements by class name # If you are familiar with jQuery, you must understand the convenience of pyQuery. For more information, see pyQuery API.

It seems that you have learned how to use pyQuery to grasp the Course Table. However, if you use the source code directly, errors will certainly occur. Because you have not logged on yet!

Therefore, before running this line to capture the correct code, we need to simulate logon to the tutorial network. At this time, I think urllib has a function to simulate the post request, so I am Baidu: urllib post.

This is the simplest example of simulating a post request:

Import urllibimport urllib2import cookielibcj = cookielib. cookieJar () opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cj) opener. addheaders = [('User-agent', 'mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1) ')] urllib2.install _ opener (opener) req = urllib2.Request ("http://seam.ustb.edu.cn: 8080/logs GL/Login ", urllib. urlencode ({"username": "41255029", "password": "123456", "usertype": "student"}) req. add_header ("Referer", "http://xxoo.com") resp = urllib2.urlopen (req) # cookielib is used here. I don't know much about it later. # urllib and urllib2 are also used, urllib2 is probably an extension package of urllib [233 thought of Three Kingdoms

In this simplest instance, I use my campus network account to submit form data to the logon page to simulate logon.

Now, we have logged on to the tutorial network, and then parse html with the pyQuery to get the course list on the webpage.

html = pq(url=u'http://seam.ustb.edu.cn:8080/jwgl/index.jsp')self.render("index.html",data=html('.haveclass'))

Result Display

Finally:

I found that pyQuery is not only very convenient for parsing html, but also can be used as a tool for cross-origin data capturing. NICE !!!

I hope to help you.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More