This article mainly introduces RSS processing in Python. This article is from the IBM official Developer Technical Documentation. if you need it, refer to RSS, which is short for multiple extensions: "RDF Site Summary", "truly Simple Syndication", and "Rich Site Summary )", other extensions may also be used for representation. Behind such a chaotic name, you will find that stories related to such an ordinary technology field are surprising. RSS is a simple XML format used to distribute summaries of content on websites. It can be used to share a wide range of information, including (but not limited to) messaging, Web site updates, event calendars, software updates, featured content sets, and Web-based auction products.
RSS was created by Netscape in 1999. it allows content from many sources of information to the Netcenter Portal website (this portal does not exist now ). Web enthusiasts in the UserLand community have become an early supporter of RSS, which soon became a very popular format. This popularity makes it difficult for people to improve RSS so that it can be used in more places. This restriction leads to a divergence in the development of RSS. One group chooses an RDF-based approach to utilize a large number of RDF tools and modules, while the other group chooses a more restrictive approach. The former is called RSS 1.0, while the latter is called RSS 0.91. Just last month, competition between the two was further intensified due to the emergence of a new version of the RSS non-RDF variant, which was called "RSS 2.0" by its creators ".
RSS 0.91 and 1.0 are very popular, and many portal websites and Web logs use them. In fact, the blogging community is the main user of RSS, and RSS is an impressive reason for some existing networks for XML exchange. These networks have grown organically and truly become the most successful XML service network. RSS becomes an XML service because it is exchanged XML information through the Internet Protocol (the vast majority of RSS exchanges are simple http get for RSS documents ). In this article, we will introduce a few Python tools that can work with RSS. We do not provide technical introduction to RSS, because you can obtain this content in many other articles. (See references ). We recommend that you familiarize yourself with RSS and XML. You do not need to understand RDF.
[Because RSS uses XML description instead of WSDL, we treat RSS as an "XML service" instead of a "Web Service. -Editor's note]
RSS. py
RSS. py compiled by Mark nottheim is a Python library for RSS processing. It is perfect and well written. It requires Python 2.2 and PyXML 0.7.1. It is easy to install. you only need to download the Python file from the Mark homepage and copy it to somewhere in your PYTHONPATH.
Most RSS. py users only need to care about the two classes they provide: CollectionChannel and TrackingChannel. The latter seems to be a more useful one of the two classes. TrackingChannel is a data structure that contains all RSS data indexed by each keyword. CollectionChannel is a similar data structure, but its structure is more like the RSS document itself. its top-level channel information uses the hash value represented by URL to point to the Item details. You may use the utility namespace declaration in the RSS. ns structure. Listing 1 is a simple script that downloads and parses the RSS feed for Python news and prints all the information from each item in a simple list.
Listing 1: a simple exercise using RSS. py
from RSS import ns, CollectionChannel, TrackingChannel#Create a tracking channel, which is a data structure that#Indexes RSS data by item URLtc = TrackingChannel()#Returns the RSSParser instance used, which can usually be ignoredtc.parse("http://www.python.org/channews.rdf")RSS10_TITLE = (ns.rss10, 'title')RSS10_DESC = (ns.rss10, 'description')#You can also use tc.keys()items = tc.listItems()for item in items: #Each item is a (url, order_index) tuple url = item[0] print "RSS Item:", url #Get all the data for the item as a Python dictionary item_data = tc.getItem(item) print "Title:", item_data.get(RSS10_TITLE, "(none)") print "Description:", item_data.get(RSS10_DESC, "(none)")
We started from creating a TrackingChannel instance and filled in the data from the RSS feed resolution on the http://www.python.org/channews.rdf. RSS. py uses tuples as the attribute names of RSS data. For those who are not familiar with XML processing technology, this method may seem unusual, but it is indeed a very effective way to precisely understand the content in the original RSS file. Therefore, an RSS 0.91 title element is considered different from an element with the same name in RSS 1.0. The application has enough data to ignore this difference. if you want to, you can ignore this difference by ignoring the namespace of each tuples; however, the basic API is combined with the syntax of the initial RSS file, so this information is not lost. In the code, we use this attribute data to aggregate all items in the news feed for display. Please note that we carefully do not assume that any special item may have any attributes. We use the following code to search for attributes in a secure way.
print "Title:", item_data.get(RSS10_TITLE, "(none)")
If this property is not found, it provides a default value instead of this example.
print "Title:", item_data[RSS10_TITLE]
Because you cannot know what elements are used in the RSS feed, such caution is necessary. Listing 2 shows the output of listing 1.
Listing 2: Output of listing 1
$ python listing1.py RSS Item: http://www.python.org/2.2.2/Title: Python 2.2.2b1Description: (none)RSS Item: http://sf.net/projects/spambayes/Title: spambayes projectDescription: (none)RSS Item: http://www.mems-exchange.org/software/scgi/Title: scgi 0.5Description: (none)RSS Item: http://roundup.sourceforge.net/Title: Roundup 0.4.4Description: (none)RSS Item: http://www.pygame.org/Title: Pygame 1.5.3Description: (none)RSS Item: http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/Title: Pyrex 0.4.4.1Description: (none)RSS Item: http://www.tundraware.com/Software/hb/Title: hb 1.88Description: (none)RSS Item: http://www.tundraware.com/Software/abck/Title: abck 2.2Description: (none)RSS Item: http://www.terra.es/personal7/inigoserna/lfm/Title: lfm 0.9Description: (none)RSS Item: http://www.tundraware.com/Software/waccess/Title: waccess 2.0Description: (none)RSS Item: http://www.krause-software.de/jinsitu/Title: JinSitu 0.3Description: (none)RSS Item: http://www.alobbs.com/pykyra/Title: PyKyra 0.1.0Description: (none)RSS Item: http://www.havenrock.com/developer/treewidgets/index.htmlTitle: TreeWidgets 1.0a1Description: (none)RSS Item: http://civil.sf.net/Title: Civil 0.80Description: (none)RSS Item: http://www.stackless.com/Title: Stackless Python BetaDescription: (none)
Of course, you may encounter a slightly different output, because the news item may have changed when you test it. The RSS. py channel object also provides methods to add and modify RSS information. You can use the output () method to write the result back to the RSS 1.0 format. Write the parsed information in listing 1 back to test it. In interactive mode, run python-I listing1.py to start the script. Run the following example at the generated Python prompt.
>>> result = tc.output(items)>>> print result
The result is a printed RSS 1.0 document. To make it work, you must have an RSS. py, version 0.42, or a later version. The output () method in earlier versions has an error.
Rssparser. py
Mark Pilgrim provides another module for RSS file parsing. It does not provide RSS. py provides all the functional components and options, but it provides a very free parser that can well handle all the messy differences in the RSS world. The following is an excerpt from the rssparser. py page:
As you can see, most RSS feeds are terrible. Invalid characters, unescaped & symbols (supplied by Blogger), invalid entities (supplied by Radio), unescaped, and invalid HTML (usually provided by the registration center ). Or a general mix of RSS 0.9x and RSS 1.0 elements (Movable Type feeds )).
There are still many supply lines that are too cutting-edge, just like Aaron's feed. He puts an excerpt into the description element and the complete text into the content: encoded element (like CDATA ). This is an effective RSS 1.0, but no one actually uses it (except Aaron), almost no news clustering supports it, and many parsers reject it. Other resolvers are confused by the new element (guid) in RSS 0.94 (see Dave Winer's supply as an example ). There is also the supply of Jon Udell, among which there is the fullitem element he selected from the creation.
XML and Web services will increase interoperability, which is almost final, so it is really ridiculous to consider this. In any case, rssparser. py is designed to handle all these absurd situations.
Installing rssparser. py is also very simple. Download the Python file (reference document), rename "“rssparser.py.txt" to "rssparser. py", and copy it to your PYTHONPATH. I also recommend that you obtain the optional timeoutsocket module, which can improve the timeout behavior of socket operations in Python, so as to help get RSS feeds without stopping the application thread to prevent errors.
Listing 3 is a script equivalent to listing 1, but it uses rssparser. py instead of RSS. py.
Listing 3: using a simple rssparser. py exercise
import rssparser#Parse the data, returns a tuple: (data for channels, data for items)channel, items = rssparser.parse("http://www.python.org/channews.rdf")for item in items: #Each item is a dictionary mapping properties to values print "RSS Item:", item.get('link', "(none)") print "Title:", item.get('title', "(none)") print "Description:", item.get('description', "(none)")
As you can see, this code is very simple. RSS. py and rssparser. py cannot replace each other because they have more functional components and maintain more syntax information in the RSS feed. The latter is simpler and a parser with better fault tolerance (the RSS. py parser can only accept well-formed XML ).
Its output should be the same as the output in listing 2.