High performance PythonDirectory
- 1 Understanding Performant Python
- 2 Profiling
- 3 Lists and tuples
- 4 Dictionaries and sets
- 5 iterators and Generators
- 6 Matrix and Vector computation
- 7 Compiling to C
- 8 Concurrency
- 9 Multiprocessing
- Ten Clusters and Job Queues
- One Using less RAM
- A lessons from the Field
Understanding Performant PythonProfilingLists and tuples
- is an internal implementation an array?
Dictionaries and sets
- Dictionary elements: __hash__ + __eq__/__cmp__
- Entropy (entropy)
- Locals () Globals () __builtin__
- List Understanding/Generator Understanding: (one with [], one with ())
-
[<value> for <item> in <sequence> if <condition>] vs (< Value> for <item> in <sequence> if <condition>)
- Itertools:
- IMAP, Ireduce, IFilter, Izip, Islice, chain, TakeWhile, cycle
- P95 Knuth ' s online mean algorithm?
iterators and GeneratorsMatrix and Vector computation
- is always lifting the example of ' Loop invariant ', which is the compiler is not optimized OK?
- $ perf stat-e cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,\
cache-references,cache-misses,branches,branch-misses,task-clock,faults,\
minor-faults,cs,migrations-r 3 python diffusion_python_memory.py
- numpy
- Np.roll ([[[1,2,3],[4,5,6]], 1, axis=1)
- ? Can cython optimize data structures? Or is it just about handling code?
- in-place operations, such as + =, *=
- = numexpr
- from numexpr Import Evaluate
- Evaluate ("Next_grid*d*dt+grid", Out=next_grid)
- ? Creating our own roll function
- scipy
- from scipy.ndimage.filters import Laplace
- Laplace (grid, out, mode= ' wrap ')
- Page-faults shows scipy allocating a lot of memory? Instructions shows that the SCIPY function is too generic?
Compiling to C
- Compile to C:
- Cython
- ZMQ also used?
- setup.py
-
from distutils.core Import setup
-
from Distutils.extension impo RT Extension
-
from cython.distutils import build_ext
-
Setup (cmdclass = {' BUILD_EX T ': Build_ext},
ext_modules = [Extension ("Calculate", ["Cythonfn.pyx"])]
-
)
- $ python setup.py build_ext--inplace
- Cython Annotations: Line of code more Yellow for "more calls to the" Python virtual machine, "
- Add type Annotations
- cdef unsigned int I, n
- disable bounds checking: #cython: boundscheck=false (modifier)
- buffer tag protocol?
- def calculate_z (int maxiter, double complex[:] ZS, double complex[:] cs): ....
- OpenMP
- prange
- -fopenmp (for GCC? )
- schedule= "guided"
- Shed skin:for non-numpy Code
- Shedskin--extmod test.py
- Additional 0.05s: Used to replicate data from the Python environment
- Pythran
- Numba:specialized for NumPy based on LLVM
- Use continuum ' s Anaconda version
- From Numba import JIT
- @jit ()
- Experimental GPU support is also available?
- #pythran export Evolve (float64[][], float)
- VMS & Jit:pypy
- GC Behavior: Whereas CPython uses reference counting, PyPy uses a modified mark and sweep (thus may not be recovered in time)
- Note that PyPy 2.3 runs as Python 2.7.3.
- STM: try to remove Gil
- Other tools: Theano Parakeet Pyviennacl Nuitka Pyston (Dropbox)Pycuda(low-level code is not portable?) )
- cTYPES,Cffi (from PyPy), f2py, CPython module
- $ f2py-c-M diffusion--fcompiler=gfortran--opt= '-o3 ' diffusion.f90
- JIT Versus AOT
Concurrency
- Concurrency: Avoid waste of I/O wait
- In Python, Coroutines is implemented as generators.
- For Python 2.7 implementations of future-based concurrency, ...?
- Gevent (suitable for mainly cpu-based problems that sometimes involve heavy I/O)
- Gevent monkey-patches The standard I/O functions to be asynchronous
- Greenlet
- Wait
- The futures is created with gevent.spawn
- Control the number of simultaneous open resources: from Gevent.coros import Semaphore
- requests = [Gevent.spawn (Download, u, semaphore) for u in URLs]
- Import grequests?
- 69x acceleration? Does this mean that the corresponding unnecessary IO waits?
- Event loop may either underutilizing or overutilizing
- Tornado (by Facebook, suitable for mostly i/o-bound asynchronous applications)
- From tornado import Ioloop, Gen
- From Functools Import Partial
- Asynchttpclient.configure ("Tornado.curl_httpclient. Curlasynchttpclient ", max_clients=100)
- @gen. coroutine
- ... responses = yield [Http_client.fetch (URL) for URL in URLs] #生成Future对象?
- Response_sum = SUM (len (r.body) for r in responses)
- raise Gen. Return (Value=response_sum)
- _ioloop = Ioloop. Ioloop.instance ()
- Run_func = partial (Run_experiment, Base_url, Num_iter)
- result = _ioloop.run_sync (Run_func)
- Disadvantage: Tracebacks can no longer hold valuable information
- In Python 3.4, new machinery introduced to easily create coroutines and has them still return values
- Asyncio
- yieldfrom: Raise exception is no longer required in order to return results from Coroutine
- very low-level = import aiohttp
-
@asyncio. Coroutinedef http_get (URL ): #<span style= "White-space:pre" ></span>nonlocal semaphore<span style= "White-space:pre" ></ Span>with (yield from semaphore): <span style= "white-space:pre" ></span>response = yield from Aiohttp.request (' GET ', url) <span style= "white-space:pre" ></span>body = yield from Response.content.read ( ) <span style= "White-space:pre" ></span>yield from Response.wait_for_close () <span style= "White-space: Pre "></span>return bodyreturn http_gettasks = [http_client (URL) for URLs in Urls]for the future in Asyncio.as_complet Ed (tasks): <span style= "white-space:pre" ></span>data = yield from futureloop = Asyncio.get_event_loop () result = Loop.run_until_complete (Run_experiment (Base_url, Num_iter))
- Allows us to unify modules like Tornado and gevent by have them run in the same event loop
Multiprocessing
- Process Pool Queue Pipe Manager ctypes (for IPC?) )
- In Python 3.2, the Concurrent.futures module is introduced (via PEP 3148)
- PyPy fully supports multiprocessing and runs faster
- From Multiprocessing.dummy import Pool (multi-threaded version?) )
- hyperthreading can give up to a 30% perf gain if there is enough compute resources
- It is worth noting, the negative of threads on cpu-bound problems are reasonably solved in Python 3.2+
- Using external queue implementations: Gearman, 0MQ, celery (using RABBITMQ as the message agent), pyres, SQS or Hotqueue
- Manager = multiprocessing. Manager ()
Value = Manager. Value (b ' C ', flag_clear)
- RDS = Redis. Strictredis ()
Rds[flag_name] = Flag_set
- Value = multiprocessing. RawValue (b ' C ', flag_clear) #无同步机制?
- Sh_mem = Mmap.mmap ( -1, 1) # Memory map 1 byte as a flag
Sh_mem.seek (0)
Flag = Sh_mem.read_byte ()
- Using Mmap as a Flag Redux (? A bit of a sight to understand, skip over)
- $ ps-a-O Pid,size,vsize,cmd | grep np_shared
- Lock = Lockfile. Filelock (filename)
Lock.acquire/release ()
- Lock = multiprocessing. Lock ()
Value = multiprocessing. Value (' I ', 0)
Lock.acquire ()
Value.value + = 1
Lock.release ()
Clusters and Job Queues
- $462 Million Wall Street Loss Through Poor Cluster Upgrade Strategy
- Does the version upgrade cause inconsistencies? But the API should be versioned ...
- Skype ' s 24-hour Global Outage
- Some versions of the Windows client didn ' t properly handle the delayed responses and crashed.
- To reliably start the cluster's components when the machine boots, we tend to use either a cron Job,circus or Supervisord, or sometimes upstart (which is being replaced by SYSTEMD)
- Might want to introduce a random-killer tool like Netflix ' s Chaosmonkey
- Make sure it was cheap in time and money to deploy updates to the system
- Make sure-deployment system like Fabric, Salt, Chef, or Puppet
- Early Warning: Pingdom andserverdensity
- Status monitoring: Ganglia
- 3 Clustering Solutions
- Parallel Python
- Ppservers = ("*",) # set IP list to be autodiscovered
- Job_server = pp. Server (Ppservers=ppservers, Ncpus=nbr_local_cpus)
- ... job = Job_server.submit (Calculate_pi, (Input_args,), (), ("random",))
- IPython Parallel
- Via Ipcluster
- ? schedulers Hide the synchronous nature of the engines and provide an asynchronous interface
- NSQ (distributed messaging system, go authoring)
- Pub/sub:topicd, Channels, consumers
- writer = nsq. Writer ([' 127.0.0.1:4150 ',])
- Handler = partial (Calculate_prime, Writer=writer)
- Reader = nsq. Reader (Message_handler = handler, nsqd_tcp_addresses = [' 127.0.0.1:4150 ',], topic = ' numbers ', channel = ' worker_group_a ' ,)
- Nsq.run ()
- Other cluster tools
Using less RAM
- IPython #memit
- Array module
- Dawg/dafsa
- Marisa Trie (Static tree)
- Datrie (need an alphabet to contain all the keys?) )
- HAT trie
- HTTP microservices (using flask): https://github.com/j4mie/postcodeserver/
- Probabilistic Data Structures
- hyperloglog++ structure?
- Very Approximate counting with a 1-byte Morris Counter
- 2^exponent, updating using probabilistic rules: random (0,1) <=2^- Exponent
- k-minimum values/kmv (remember K minimum hash value, assuming hash value is evenly distributed)
- Bloom Filters
- This method gives us no false negatives and a controllable rate of false positives (possibly misjudged)
- ? Use 2 separate hash simulations for any number of hashes
- very sensitive to initial capacity
- Scalable Bloom Filters: By chaining together multiple bloom filters ...
- loglog Counter
-
bit_index = Trailing_zeros (item_hash)
-
if bit _index > Self.counter:
self.counter = Bit_index
- variant: Superloglog hyperloglog
lessons from the Field
- Sentry is used to log and diagnose Python stack traces
- Aho-corasick trie?
- We use Graphite with collectd and statsd to allow us to draw pretty graphs of what ' s going on
- Gunicorn is used as a WSGI and its IO loop is executed by Tornado
High performance Python notes (Python is a good language, and all-stack programmers use it!) )