Performance comparison test for PyPy and CPython

Source: Internet
Author: User
I recently completed some data mining tasks on Wikipedia. It consists of these parts:

Parse Enwiki-pages-articles.xml's Wikipedia dump;

Store categories and pages in MongoDB;

Re-categorize the category names.

I tested the actual task performance of CPython 2.7.3 and PyPy 2b. The libraries I use are:

Redis 2.7.2

Pymongo 2.4.2

Additionally, CPython is supported by the following libraries:

Hiredis

Pymongo c-extensions

The test mainly included database parsing, so I didn't anticipate how much benefit I would get from pypy (not to mention the CPython database driver was written in C).

I'll describe some interesting results below.


Extract a wiki page name


I need to create a link to page.id in all Wikipedia categories and store the re-assigned pages. The simplest solution would be to import enwiki-page.sql (which defines an RDB table) into MySQL and then transfer the data and redistribute it. But I don't want to increase MySQL demand (backbone!) XD) So I wrote a simple SQL INSERT statement parser in pure Python and then imported the data directly from Enwiki-page.sql for redistribution.

This task is much more CPU dependent, so I'm bullish on pypy again.

/Time

PyPy 169.00s User State 8.52s System State 90% CPU

CPython 1287.13s User State 8.10s System State 96% CPU

I also made a similar connection to the Page.id-> category (the memory of my notebook is too small to save the information for my test).


Filter Categories from Enwiki.xml


For the convenience of work, I need to filter categories from Enwiki-pages-articles.xml and store them in the same XML format category. So I chose the SAX parser, which is applicable in both PyPy and CPython. Native compiled packages for external (colleagues in pypy and CPython).

The code is simple:

Class Wikicategoryhandler (handler. ContentHandler): "" "Class which detecs category pages and stores them separately" "" Ignored = Set (' Contributor ' , ' comment ', ' Meta ')) def __init__ (self, f_out): handler. Contenthandler.__init__ (self) self.f_out = f_out Self.curr_page = None Self.curr_tag = "Self" . Curr_elem = Element (' root ', {}) Self.root = Self.curr_elem Self.stack = Stack () Self.stack.push (self . Curr_elem) Self.skip = 0 def startelement (self, Name, attrs): If self.skip>0 or name in Self.ignore D:self.skip + = 1 return self.curr_tag = name Elem = Element (name, attrs) if NA  me = = ' Page ': elem.ns =-1 Self.curr_page = Elem Else: # we don ' t want to keep old pages in Memory Self.curr_elem.append (elem) Self.stack.push (elem) Self.curr_elem = Elem def endeleme  NT (self, name): if self.skip>0:          Self.skip-= 1 return if name = = ' Page ': self.task () Self.curr_page = No Ne self.stack.pop () Self.curr_elem = Self.stack.top () Self.curr_tag = Self.curr_elem.tag def cha Racters (self, content): If Content.isspace (): return if Self.skip = = 0:self.curr_elem.append (Tex TElement (content)) if Self.curr_tag = = ' NS ': self.curr_page.ns = Int (content) def STARTDOCU ment (self): Self.f_out.write ("
 
  \ n ") def enddocument (self): Self.f_out.write (" <\root>\n ") print (" FINISH processing WIKIPEDIA ") def task (self): if Self.curr_page.ns = = 14:self.f_out.write (Self.curr_page.render ()) class Elemen        T (object): Def __init__ (self, Tag, attrs): Self.tag = Tag Self.attrs = Attrs Self.childrens = [] Self.append = Self.childrens.append def __repr__ (self): return "Element {}". Format (SELF.TAG) def re NDEr (self, margin=0): If not Self.childrens:return u "{0}<{1}{2}/>". Format ("" *ma Rgin, Self.tag, "". Join ([' {}= ' {} '. Format (k,v) for k,v in {}.iteritems ()])) if ISINs Tance (Self.childrens[0], textelement) and Len (self.childrens) ==1:return u "{0}<{1}{2}>{3}
  ". Format (" "*margin, Self.tag," ". Join ([u ' {}=" {} "'. Format (k,v) for k,v in { }.iteritems ()]), Self.childrens[0].render ()) return u "{0}<{1}{2}>\n{3}\n{0}
  ". Format (" "*margin, Self.tag," ". Join ([u ' {}=" {} "'. Format (k,v) for k,v in {}.iteritems (  )]), "\ n". Join ((C.render (margin+2) for C in Self.childrens)) class TextElement (object): Def __init__ (self,        Content): self.content = Content def __repr__ (self): return ' TextElement ' def render (self, margin=0): Return self.content
 

Element and textelement elements replacement tag and body information, while providing a way to render it.

Here are the pypy and CPython I want to compare the results.

/Time

PyPy 2169.90s

CPython 4494.69s

I was very surprised at the result of PyPy.

Calculate interesting collection of categories


I once wanted to calculate an interesting collection of categories--in one of my application contexts, some of the categories derived from the computing category were started for calculation. For this I need to build a class diagram that provides classes--a subclass diagram.

Structure class--subclass diagram


This task uses MongoDB as the source of data, and the structure is redistributed. The algorithm is:

For each category.id in redis_categories (it holds *category.id, category title mapping*) do:    title = Redis_categ Ories.get (category.id)    parent_categories = MongoDB Get categories for title for all    Parent_cat in parent Categor ies do:        redis_tree.sadd (Parent_cat, title) # Add to Parent_cat set title

Sorry to write such pseudo code, but I think it looks more compact.

So this task only copies data from one database to another. The result here is that after the MongoDB has warmed up (the data will be biased if not warmed up – this Python task consumes about 10% of the CPU). Timings are as follows:

/Time

PyPy 175.11s User State 66.11s System State 64% CPU

CPython 457.92s User State 72.86s System State 81% CPU


Traverse Redis_tree (re-allocated tree)


If we have a redis_tree database, the only remaining problem is to traverse all the achievable nodes under the computing category. To avoid looping, we need to record the nodes that have been visited. Since I wanted to test the performance of Python's database, I used the redistribution of collection columns to solve this problem.

/Time

PyPy 14.79s User State 6.22s System State 69% CPU 30.322 Total

CPython 44.20s User State 13.86s System State 71% CPU 1:20.91 Total

To be honest, this task also needs to build some Tabu list (forbidden lists)--to avoid entering the unwanted category. But that's not the point of this article.


Conclusion

The tests are just a brief introduction to my final work. It requires a knowledge system, a knowledge system that I extract from the appropriate content in Wikipedia.

PyPy compared to CPython, in my simple database operation, improved 2-3 times the performance. (I'm not counting the SQL parser here, about 8 times times)

Thanks to PyPy, my work was more enjoyable-I didn't rewrite the algorithm to make Python efficient, and pypy didn't hang up my cpu like CPython, so I couldn't use my laptop for a while (look at the percentage of CPU time).

Tasks are mostly database operations, while CPython has some of the accelerated, messy C-language modules. PyPy not use these, but the results are faster!

All my work needs a lot of cycles, so I'm really happy to use PyPy.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.