Use python to capture web pages

Source: Internet
Author: User
Tags control characters http cookie

 

Use python to capture web pages (for example, new things on the People's Network and Group Buying Network Information)

From http://www.pinkyway.info/2010/12/19/fetch-webpage-by-python? Replytocom= 448

By yingfengster

Beautifulsoup, Python, urllib, Renren, group buying 12 comments

The little things I wrote some time ago have never had time to write out his system. If my eyes hurt today, write them ~~ (The blog will also be updated when the blogger is not suffering from any headache ~)

Python web page capturing Basics

Python has many network application-related modules, such as ftplib for FTP-related operations and smtplib and poplib for sending and receiving emails.
It is entirely possible for the module to write an FTP software or email client software by itself. I simply tried to use Python scripts to send and receive emails and operate my FTP server. Of course, this is not today
The leading role of day. The modules we will use today are urllib, urllib2, cookielib, and beautifulsoup. Let's give a brief introduction.
Both urllib and urllib2 process URL-related operations. urllib can download files from a specified URL or encode or decode some strings to make them specific.
URL string, while urllib2 is a bit more than urllib. It has various handler. processor can handle more complex queries.
Such as network authentication, proxy server, and cookie. The third cookielib, as its name implies, is dedicated to cookie-related operations, and I have never gone deep into other operations.
The last package, beautifulsoup, "beautiful Soup", is a third-party package dedicated to parsing HTML and XML files. It is a very dummies and we need to rely on it for parsing.
The source code of the webpage can be downloaded from here. Of course, you can also install it with easy_install. Here, there are also its Chinese user documents. It is quite good to have many examples.

Let's take a look at the simplest web page capture. In fact, web page capture is to download the source code file of the desired web page and analyze it to extract useful information for ourselves. The simplest way to capture is to use one sentence:

12 import urllib html_src = urllib.urlopen('http://www.baidu.com').read()

This will print out the HTML source code of the Baidu homepage, which is still very easy. Urllib's urlopen function will return a "class file" object, so you can directly call
Read () to get the content. However, the printed content is a mess, the layout is poor, and the encoding problem is not solved, so Chinese characters are not displayed. This requires us
Beautifulsoup package. We can use the source code string html_src obtained above to initialize a beautifulsoup object:

12 from BeautifulSoup import BeautifulSoup parser = BeautifulSoup(html_src)

In this way, the subsequent processing of the HTML source code will be handled by the parser variable. We can simply call the parser prettify function to display the source code relatively beautifully,
You can see the Chinese characters in this way, because beautifulsoup can automatically handle Character Problems and convert the returned results to unicode encoding formats. Another
In addition, beautifulsoup can quickly locate the specified tag that meets the conditions, which will be used later ~

 

Capture new things from Renren

We have mentioned the simplest capture scenario, but we usually need to deal with more complex situations. In the case of People's Network, You need to log on to your account to display new things, in this way, we can only turn to a second point.
Urllib2 module. Imagine that we need an opener to open your Renren homepage. In order to enter the homepage, authentication is required first, and this login authentication requires cookie support.
We need to build a handler to process cookies on this opener:

1234 import urllib,urllib2,cookielib from BeautifulSoup import BeautifulSoup myCookie = urllib2.HTTPCookieProcessor(cookielib.CookieJar()); opener = urllib2.build_opener(myCookie)

First import all the modules we need, and then use httpcookieprocessor of the urllib2 module to build a cookie Processing
Handler: Pass in the cookiejar function of the cookielib module as the parameter. This function processes the HTTP cookie. In short, it starts from the HTTP request.
Extract the cookie and return it to the HTTP response. Then we use urllib2's build_opener to establish the opener we need. This opener has
Our cookie processing requirements are met.

So how does the handler work? As mentioned above, it needs to capture HTTP requests. That is to say, we need to know what data packets are sent when logging on to everyone.
People can use a variety of command line packet capture tools, such as tcpdump. I will wait for the small people to figure it out. Here we use Firefox's httpfox to implement it. here we can add this awesome plug-in. After the plug-in is installed, we use Firefox to log on to Renren. Then, enter your account and password. Then, don't worry. Do not press the logon button to open the plug-in from the status bar.
Httpfox plug-in, click the start button to start packet capture, and then click Renren's login. After the login process is complete, click httpfox's stop to stop packet capture. If everything works properly, you should be able
To see the following information.


We can see that when logging on to Renren, the browser sends the POST request data to Renren's server. There are four items, two of which are your account and password. Now we can use the code to simulate the same request.

123456789 post_data = {     'email':'xxxxx',     'password':'xxxxx',     'origURL':'http://www.renren.com/Home.do',     'domain':'renren.com'} req = urllib2.Request('http://www.renren.com/PLogin.do', urllib.urlencode(post_data)) html_src = openner.open(req).read() parser = BeautifulSoup(html_src)

First, construct a dictionary to store the post data we just caught, and then construct a "request" object by using the urllib2 request class. The request submission URL is
The URL part of the POST request captured above is followed by our post data. Note that the urllib urlencode method must be used to re-encode it to make it legal.
URL string. The following process is the same as the simple crawling process mentioned above, except that the website is not a simple website, but a request that encapsulates the website and post data. Similarly, I
The source code is assigned to beautifulsoup for later processing.

Now our parser stores the source code of a webpage that contains new things from friends. How can we extract useful information? To analyze the web page, you can give it to firebug of Firefox (download it here. Log on to your Renren network, right-click a person's status in new things, and select "view element". The following window is displayed, showing the source code of the clicked part:

We can see that every new thing corresponds to the article tag. Let's take a closer look at the details of the article Tag:

The H3 tag contains the name and status of a friend. Of course, there are some fat pig stream emoticons. You should be familiar with HTML. So what we need to catch is this H3!

Beautifulsoup capture TAG content

The following is the time for parser to show his face. beautifulsoup provides many tag locating methods. In short, it is mainly the find function and findall function.
Function. Common parameters are name and attrs. One is the tag name, the other is the tag attribute, the name is a string, and the attrs is a dictionary. For example
Find ('img ', {'src': 'abc.jpg'}) returns a tag similar to this: Src00000000abc.jpg ">. The difference between find and findall is that find only returns the first tag that meets the requirements, while findall returns all
The list of tags. After obtaining tags, there are still many convenient ways to obtain sub-tags. For example, through the dot operator, tag. A can obtain sub-tag a under the tag represented by the tag, as shown in
Tag. contents can get a list of all its children, and more applications can view its documentation. The following describes how to capture the content we want.

1234567891011 article_list = parser.find('div','feed-list').findAll('article') for my_article in article_list:     state = []     for my_tag in my_article.h3.contents:         factor = my_tag.string         if factor != None:             factor = factor.replace(u'/xa0','')             factor = factor.strip(u'/r/n')             factor = factor.strip(u'/n')             state.append(factor)     print ' '.join(state)

Here, we use find ('div ', 'feed-list'). findall ('Article') to obtain the class attribute as 'feed-
List 'div tag, the second parameter is directly a string era table is the class attribute of CSS, and then get the list of all its article, compare with the above figure, this sentence
The real is to get a list of all new things. Then, we traverse the list, obtain the H3 tag for each article tag, and extract the content. If the tag contains text directly, we can use
Get the string attribute. Finally, we remove some control characters, such as line breaks. Finally, the results are printed out. Of course, only a small part can be obtained. The "more new things" function is not available yet.
I think it is not difficult to implement httpfox.

Group Buying information aggregation gadgets

With the same knowledge, we can also make some interesting applications. For example, if the group buying information aggregation is very popular now, the idea is still very easy. It is to analyze the source code of the website and extract the Group Buying title, images and prices
That's all. The source code file is released here. If you are interested, you can study it! On the interface made with pyqt, the button function has not been implemented yet. You can only extract three websites: Meituan, glutinous rice network, and QQ group buying.
Select from the drop-down list box to display the image. The image is saved in the local directory ~

The cross button function is not implemented ~ Here is the file download, A pyw main form file, two other py files are the UI, and one is the resource.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.