A brief introduction to RSS processing in Python

A brief introduction to RSS processing in Python _python

Last Update:2017-01-19 Source: Internet

Author: User

Tags movable type

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

RSS is an abbreviation that can be represented by a variety of extensions: "RDF Site Summary (RDF sites Summary)", "Really Simple Syndication (really simplicity Syndication)", "Rich Site Summary", It may also be expressed in other extensions. Behind the name of such confusion, you will find that the story of such an ordinary technical field is much more surprising. RSS is a simple XML format for distributing summaries of content on a Web site. It can be used to share a wide range of information, including (but not limited to) newsletters, Web site updates, event calendars, software updates, feature collections, and web-based auctions.

The RSS was created by Netscape in 1999, allowing content from many sources to be clustered into the Netcenter portal (the portal is now Non-existent). Web enthusiasts in the Userland community became early supporters of RSS, and RSS quickly became a very popular format. This popularity makes it difficult to improve RSS so that it can be used in more places. This restriction has led to differences in the development of RSS. One group chose an RDF based approach, which was designed to take advantage of a large number of RDF tools and modules, while the other group chose a more restrictive approach. The former is called RSS 1.0, while the latter is called RSS 0.91. Just last month. As a result of the emergence of a new version of the RSS non-RDF variant, the competition between the two is further intensified, which is called "RSS 2.0" by its creators.

RSS 0.91 and 1.0 are very popular, and they are used by numerous portal sites and Web logs. In fact, the blogging community is the primary user of RSS, and RSS is an impressive reason for some of the existing networks for XML exchange. These networks are already growing organically and truly become the network of the most successful XML services available. RSS becomes an XML service because it is exchanged for XML information via Internet Protocol (the vast majority of RSS exchanges are simple HTTP get for RSS documents). In this article, we're talking about just a handful of Python tools that you can work with with RSS. We do not provide a technical introduction to RSS because you can get them in many other articles. (see Resources). We recommend that you first simply familiarize yourself with RSS knowledge and understand XML. You don't need to understand RDF.

[Because RSS uses XML descriptions instead of WSDL, we treat RSS as an "XML service" rather than a "Web service".) Note

rss.py

The rss.py written by Mark Nottingham is a Python library for RSS processing. It's perfect and well written. It requires Python 2.2 and PyXML 0.7.1. Its installation is very simple; you simply download the Python file from Mark's home page and copy it to somewhere in your pythonpath.

Most rss.py users themselves only need to care about the two classes it offers: Collectionchannel and Trackingchannel. The latter seems to be one of the more useful in these two classes. Trackingchannel is a data structure that contains all of the RSS data that is indexed by the keyword for each item. Collectionchannel is a similar data structure, but its structure is more like the RSS document itself, and its top-level channel information uses the hash value represented by the URL to point to the item details. You will most likely use the utility namespace declaration in the RSS.NS structure. Listing 1 is a simple script that downloads and parses the RSS feed for Python news and prints all the information from each item in a simple list.

Listing 1: A simple exercise to use rss.py

 from RSS import NS, Collectionchannel, Trackingchannel #Create a tracking channel, which I s a data structure that #Indexes RSS data by item URL TC = Trackingchannel () #Returns The Rssparser instance used, which C An usually is ignored Tc.parse ("HTTP://WWW.PYTHON.ORG/CHANNEWS.RDF") Rss10_title = (NS.RSS10, ' TITLE ') Rss10_desc = (NS.R SS10, ' description ') #You can also use Tc.keys () items = Tc.listitems () to item in items: #Each item is a (URL, Order_i  ndex) Tuple URL = item[0] print "RSS item:", URL #Get all of the data for the item as a Python dictionary item_data =  Tc.getitem (item) print "Title:", Item_data.get (Rss10_title, "(none)") print "Description:", Item_data.get (Rss10_desc, "(none)")

We start with creating a Trackingchannel instance and populate it with data that is parsed from the RSS feed on the HTTP://WWW.PYTHON.ORG/CHANNEWS.RDF. rss.py uses tuples as the property name for RSS data. For those of you who are not accustomed to XML processing, this may seem unusual, but it is an effective way to get an accurate picture of what's in the original RSS file. Therefore, an RSS 0.91 title element is considered different from an element of the same name in RSS 1.0. The application has enough data to ignore the difference, and if you prefer, you can ignore the difference by ignoring the part of each tuple's namespace, but the basic API is combined with the syntax of the initial RSS file, so this information is not lost. In code, we use this property data to aggregate all the items in the news feeds that are displayed. Note that we are very careful not to assume that any particular item might have any attributes. We retrieve the property using the security form shown in the following code.

  Print "Title:", Item_data.get (Rss10_title, "(none)")

If the property is not found, it provides a default value, not the example.

 Print "Title:", Item_data[rss10_title]

This caution is necessary because you cannot know what elements are used in the RSS feed. Listing 2 shows the output from listing 1.

Listing 2: The output of Listing 1

$ python listing1.py rss item:http://www.python.org/2.2.2/title:python 2.2.2b1 Description: (none) RSS item:http://s F.net/projects/spambayes/title:spambayes Project Description: (none) RSS Item:http://www.mems-exchange.org/software /SCGI/TITLE:SCGI 0.5 Description: (none) RSS item:http://roundup.sourceforge.net/title:roundup 0.4.4 Description: (No NE) rss item:http://www.pygame.org/title:pygame 1.5.3 Description: (none) RSS item:http://www.cosc.canterbury.ac.nz/~ Greg/python/pyrex/title:pyrex 0.4.4.1 Description: (None) RSS ITEM:HTTP://WWW.TUNDRAWARE.COM/SOFTWARE/HB/TITLE:HB 1. Description: (None) RSS item:http://www.tundraware.com/software/abck/title:abck 2.2 Description: (none) RSS item:ht TP://WWW.TERRA.ES/PERSONAL7/INIGOSERNA/LFM/TITLE:LFM 0.9 Description: (none) RSS item:http://www.tundraware.com/ Software/waccess/title:waccess 2.0 Description: (none) RSS Item:http://www.krause-software.de/jinsitu/title:jinsitu 0.3 Description: (none) RSS item:http://www.alobbs.com/pykyra/title:pykyra 0.1.0 Description: (none) RSS item:http://www.havenrock.com/developer/ treewidgets/index.html title:treewidgets 1.0a1 Description: (none) RSS Item:http://civil.sf.net/title:civil 0.80

 Iption: (none) RSS item:http://www.stackless.com/title:stackless Python Beta Description: (None)

Of course, you may encounter slightly different output, because the news item may have changed when you experimented with it. The Rss.py channel object also provides methods to add and modify RSS information. You can use the output () method to write the results back to the RSS 1.0 format. Test it by writing back the information that is parsed in Listing 1. Start the script in interactive mode by running Python-i listing1.py. At the resulting Python prompt, run the following example.

>>> result = tc.output (items)
>>> print Result

The result is a printout of the RSS 1.0 document. For it to work you must have rss.py, version 0.42 or higher version. There was an error in the output () method in an earlier version.

rssparser.py

Mark Pilgrim provides another module for RSS file parsing. It does not provide all the features and options provided by rss.py, but it provides a very free parser that handles all the confusing differences in the RSS world. The following excerpt from the rssparser.py page:

As you can see, most RSS feeds are bad. Invalid character, escaped & symbol (Blogger supplied), invalid entity (Radio supplied), and not escaped and invalid HTML (usually provided by the registry). Or just a general mix of RSS 0.9x elements and RSS 1.0 elements (removable type supply (movable type feeds)).

There are a lot of cutting-edge supplies, just like Aaron's feed. He puts an excerpt into the description element and puts the complete text in the content:encoded element (like CDATA). This is a valid RSS 1.0, but no one really uses it (except Aaron), almost no news aggregator supports it, and many parsers reject it. Other parsers are confused by the new Element (GUID) in RSS 0.94 (see the supply of Dave Winer as an example). and the supply of Jon Udell, and the Fullitem elements he chose from his creations.

The fact that XML and WEB services increase interoperability is almost certain, so considering this is ridiculous. In any case, the purpose of design rssparser.py is to deal with all these absurd situations.

Installation of rssparser.py is also very simple. Please download the Python file (see Resources), rename "Rssparser.py.txt" to "rssparser.py", and copy it to your pythonpath. I also recommend that you get an optional Timeoutsocket module that improves timeout behavior for socket operations in Python, which helps you get RSS feeds without having to stop application threads to prevent errors.

Listing 3 is a script equivalent to listing 1, but it uses rssparser.py instead of rss.py.

Listing 3: Using a simple rssparser.py exercise

Import Rssparser
#Parse The data, returns a tuple: (data for channels, data for items)
Channel, items = Rssparser. Parse ("HTTP://WWW.PYTHON.ORG/CHANNEWS.RDF") for
item in items:
  #Each The item is a dictionary mapping properties to V Alues
  print "RSS Item:", Item.get (' Link ', "(none)")
  print "title:", Item.get (' title ', ' (none) ')
  print " Description: ", Item.get (' Description '," (none) ")

As you can see, this piece of code is very simple. rss.py and rssparser.py cannot replace each other in large part because the former has more functional components and maintains more grammatical information in the RSS feed. The latter is simpler and is a more fault-tolerant parser (the rss.py parser can only accept well-formed XML).

Its output should be the same as the output in Listing 2.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More