I recently completed some data mining tasks on Wikipedia. It consists of these parts:
Parse Enwiki-pages-articles.xml's Wikipedia dump;
Store categories and pages in MongoDB;
Re-categorize the category names.
I tested the actual task performance of CPython 2.7.3 and PyPy 2b. The libraries I use are:
Redis 2.7.2
Pymongo 2.4.2
Additionally, CPython is supported by the following libraries:
Hiredis
Pymongo c-extensions
The test mainly included database parsing, so I didn't anticipate how much benefit I would get from pypy (not to mention the CPython database driver was written in C).
I'll describe some interesting results below.
Extract a wiki page name
I need to create a link to page.id in all Wikipedia categories and store the re-assigned pages. The simplest solution would be to import enwiki-page.sql (which defines an RDB table) into MySQL and then transfer the data and redistribute it. But I don't want to increase MySQL demand (backbone!) XD) So I wrote a simple SQL INSERT statement parser in pure Python and then imported the data directly from Enwiki-page.sql for redistribution.
This task is much more CPU dependent, so I'm bullish on pypy again.
/Time
PyPy 169.00s User State 8.52s System State 90% CPU
CPython 1287.13s User State 8.10s System State 96% CPU
I also made a similar connection to the Page.id-> category (the memory of my notebook is too small to save the information for my test).
Filter Categories from Enwiki.xml
For the convenience of work, I need to filter categories from Enwiki-pages-articles.xml and store them in the same XML format category. So I chose the SAX parser, which is applicable in both PyPy and CPython. Native compiled packages for external (colleagues in pypy and CPython).
The code is simple:
Class Wikicategoryhandler (handler. ContentHandler): "" "Class which detecs category pages and stores them separately" "" Ignored = Set (' Contributor ' , ' comment ', ' Meta ')) def __init__ (self, f_out): handler. Contenthandler.__init__ (self) self.f_out = f_out Self.curr_page = None Self.curr_tag = "Self" . Curr_elem = Element (' root ', {}) Self.root = Self.curr_elem Self.stack = Stack () Self.stack.push (self . Curr_elem) Self.skip = 0 def startelement (self, Name, attrs): If self.skip>0 or name in Self.ignore D:self.skip + = 1 return self.curr_tag = name Elem = Element (name, attrs) if NA me = = ' Page ': elem.ns =-1 Self.curr_page = Elem Else: # we don ' t want to keep old pages in Memory Self.curr_elem.append (elem) Self.stack.push (elem) Self.curr_elem = Elem def endeleme NT (self, name): if self.skip>0: Self.skip-= 1 return if name = = ' Page ': self.task () Self.curr_page = No Ne self.stack.pop () Self.curr_elem = Self.stack.top () Self.curr_tag = Self.curr_elem.tag def cha Racters (self, content): If Content.isspace (): return if Self.skip = = 0:self.curr_elem.append (Tex TElement (content)) if Self.curr_tag = = ' NS ': self.curr_page.ns = Int (content) def STARTDOCU ment (self): Self.f_out.write ("
\ n ") def enddocument (self): Self.f_out.write (" <\root>\n ") print (" FINISH processing WIKIPEDIA ") def task (self): if Self.curr_page.ns = = 14:self.f_out.write (Self.curr_page.render ()) class Elemen T (object): Def __init__ (self, Tag, attrs): Self.tag = Tag Self.attrs = Attrs Self.childrens = [] Self.append = Self.childrens.append def __repr__ (self): return "Element {}". Format (SELF.TAG) def re NDEr (self, margin=0): If not Self.childrens:return u "{0}<{1}{2}/>". Format ("" *ma Rgin, Self.tag, "". Join ([' {}= ' {} '. Format (k,v) for k,v in {}.iteritems ()])) if ISINs Tance (Self.childrens[0], textelement) and Len (self.childrens) ==1:return u "{0}<{1}{2}>{3}
". Format (" "*margin, Self.tag," ". Join ([u ' {}=" {} "'. Format (k,v) for k,v in { }.iteritems ()]), Self.childrens[0].render ()) return u "{0}<{1}{2}>\n{3}\n{0}
". Format (" "*margin, Self.tag," ". Join ([u ' {}=" {} "'. Format (k,v) for k,v in {}.iteritems ( )]), "\ n". Join ((C.render (margin+2) for C in Self.childrens)) class TextElement (object): Def __init__ (self, Content): self.content = Content def __repr__ (self): return ' TextElement ' def render (self, margin=0): Return self.content
Element and textelement elements replacement tag and body information, while providing a way to render it.
Here are the pypy and CPython I want to compare the results.
/Time
PyPy 2169.90s
CPython 4494.69s
I was very surprised at the result of PyPy.
Calculate interesting collection of categories
I once wanted to calculate an interesting collection of categories--in one of my application contexts, some of the categories derived from the computing category were started for calculation. For this I need to build a class diagram that provides classes--a subclass diagram.
Structure class--subclass diagram
This task uses MongoDB as the source of data, and the structure is redistributed. The algorithm is:
For each category.id in redis_categories (it holds *category.id, category title mapping*) do: title = Redis_categ Ories.get (category.id) parent_categories = MongoDB Get categories for title for all Parent_cat in parent Categor ies do: redis_tree.sadd (Parent_cat, title) # Add to Parent_cat set title
Sorry to write such pseudo code, but I think it looks more compact.
So this task only copies data from one database to another. The result here is that after the MongoDB has warmed up (the data will be biased if not warmed up – this Python task consumes about 10% of the CPU). Timings are as follows:
/Time
PyPy 175.11s User State 66.11s System State 64% CPU
CPython 457.92s User State 72.86s System State 81% CPU
Traverse Redis_tree (re-allocated tree)
If we have a redis_tree database, the only remaining problem is to traverse all the achievable nodes under the computing category. To avoid looping, we need to record the nodes that have been visited. Since I wanted to test the performance of Python's database, I used the redistribution of collection columns to solve this problem.
/Time
PyPy 14.79s User State 6.22s System State 69% CPU 30.322 Total
CPython 44.20s User State 13.86s System State 71% CPU 1:20.91 Total
To be honest, this task also needs to build some Tabu list (forbidden lists)--to avoid entering the unwanted category. But that's not the point of this article.
Conclusion
The tests are just a brief introduction to my final work. It requires a knowledge system, a knowledge system that I extract from the appropriate content in Wikipedia.
PyPy compared to CPython, in my simple database operation, improved 2-3 times the performance. (I'm not counting the SQL parser here, about 8 times times)
Thanks to PyPy, my work was more enjoyable-I didn't rewrite the algorithm to make Python efficient, and pypy didn't hang up my cpu like CPython, so I couldn't use my laptop for a while (look at the percentage of CPU time).
Tasks are mostly database operations, while CPython has some of the accelerated, messy C-language modules. PyPy not use these, but the results are faster!
All my work needs a lot of cycles, so I'm really happy to use PyPy.