About algorithms, things you don't know 1. Algorithms, not just brush questions
Referring to the algorithm, whether it is trained or halfway decent programmer may say a few words, algorithm who did not learn who do not know ah? For the students who take the industrial route rather than the academic route, the biggest role of algorithmic learning may be to find a job ... After all, after work, most of the time with a variety of mature libraries, few of their own implementation of advanced data structures and algorithms. But just ended a semester to fix the algorithm class, I really did not learn the algorithm like, let me eye-opener, although every time I listen to not very understand, but each section is looking forward to the teacher and can bring anything new. A little bit of discovery, it turned out that there are so many useful but never learned or even heard the algorithm and data structure AH!
These can be counted as more frontier research results, and is not away from our daily work very far from the pure theory, so personally feel that these knowledge is worth learning, even if only to understand the pioneering ideas. Resources, in addition to the specific papers mentioned in each section below, the system introduces these advanced algorithms and data structure of the book is really rare, but also really found a copy: Peter Brass's "advance Data structures". In addition, it is the ability to honestly look at the original or analyze a better paper, which is also a skill this semester. Of course, some papers are very difficult, we can be an excerpt to see, most of the time still look very enjoy!
2. randomization algorithm (randomized algorithms)
Randomization (randomization) is the focus of this course, and has been neglected before, but now found a very important algorithm design strategy. There are some systematic explanations and examples in Introduction to algorithm (the textbook of this course), but this classic textbook is not specifically about randomization algorithms. The classic book specializing in this area is Motwani Raghvan's randomized algorithms, and a slightly more modern "probability and computing:randomized algorithms and" Probabilistic analysis ", but for me, these two are really too difficult! If like me just a bit interested, want a little system to understand the words, recommended to study carefully "algorithm Design" in the 13th Chapter randomized algorithm , this chapter content from the basic definition, w.h.p., to the common randomization algorithm and data structure analysis, Is enough detail ~
2.1 Randomized Quicksort
Just as the key to the merge sort is the same as the merge function, the key to fast sorting performance is in the partition function. But because we have no control over what the input data looks like, we generally turn to randomization. There are two strategies for randomized rapid sequencing: 1 random selection of pivot;2) random sampling of three digits to take the median. Both ways can reach O (Nlogn). The proof of the first method is explained in detail in the CLRs, and the second method is in the form of problem exercises in the book. At that time in class, the teacher directly to this seemingly difficult problem analysis, lightly converted to the coin toss model, four two dial daughter, give a person feel very shocked!
Recommended Reading : These two ways in the "Introduction to algorithm" in the detailed performance analysis, but also through the "algorithm Design" supplement learning, the proof of the method can be slightly easier to understand some.
2.2 Skip List
Since the previous work focused on Redis, so to skip list some understanding, Redis dictionary data structure is to use the skip list instead of the red-black tree implementation of the way. It was understood that the skip list is simple to implement, the cache locality good, etc., but it is not clear that the original skip table is a random data structure. Each newly inserted key is determined by repeatedly throwing coins (Flip a fair coin) to promote to the upward level. How, it's interesting! Oddly enough, many algorithmic books do not mention the skip list, including the Bloom Filter, which is very important in large distributed systems, to be talked about later.
Recommended Reading :
- Author's thesis: "Skip lists:a probabilistic alternative to Balanced Trees"
- Implementation: Sedgewick "algorithm in C, 3rd edition" in section 13.5.
- The clrs-based MIT courseware complements the Skip list, which is also great to explain ( highly recommended!). )!
2.3 Universal Hashing
Typically, a well-chosen hash function works well, but in some extreme cases, such as someone who knows your hashing mechanism and deliberately creates a lot of hash conflicts, it won't work. This is not a fantasy, it happened in 2012, including PHP, Java, Python and so on almost all major platforms, see the "Hash Collision dos problem." So a defensive approach is to not just use a hash function, but a family of functions, each time you need to use a random selection of this family function.
Recommended Reading :
- "Introduction to algorithm" in universal hashing and perfect hashing two sections.
- The MIT courseware based on CLRs is also very good to explain!
2.4 Bloom Filter & Cuckoo Filter
Bloom Filter was previously heard in the book Big Data: Internet large-scale data mining and distributed processing, but it has always been thought to be a very advanced and difficult to understand algorithm. After the teacher's detailed analysis of the explanation found that the original so simple, and really too useful! It is like the ubiquitous hash table we use every day, and can be used in combination with many data structures. At its core is the question of whether or not the data exists in the set, and if the negative answer is the overwhelming majority, it will use little space to help you achieve your goal. Occasionally encountered affirmative reply, the BF gave only unsure (False positive), then to check the real data behind the BF.
And Cuckoo filter is a Chinese classmate of CMU realized, it skillfully use fingerprint, multi-choice hashing technology, mainly to solve the BF pain point: Can not delete or expand (resize), can only be torn down reconstruction, and data locality is not good, will generate a lot of random I/O.
Recommended Reading :
- Paper ( highly recommended!) ): "Theory and practice of Bloom Filters for distributed Systems"
- Paper: "Cuckoo filter:practically Better Than Bloom"
2.5 Robin-karp String Matching
With regard to string matching, the teacher did not speak of the more famous KMP algorithm, but also the idea of randomization in the ROBIN-KARP algorithm: Take full advantage of the "characters" that have been encountered to form rolling hashes, reducing waste (which is also a common strategy for many efficient algorithms, such as the trie tree). and the judgment for matching is somewhat similar to the Bloom filter described earlier, which also produces false positive. Detailed explanations and proofs are given in the clrs.
Recommended reading : "Introduction to Algorithm" in the corresponding chapters.
2.6 Treaps
Just as we guarantee the performance of quicksort with two versions of randomization, Treaps is the "protection" of the two-fork search tree. It assigns a priority number to each key at random, and when inserting new data, it allows the tree to guarantee the properties of BST based on key, while maintaining the tree balance based on the randomly assigned priority guaranteed heap characteristics. So it can be seen from the name, Treap=tree+heap.
Recommended reading : "Introduction to Algorithm" as the problem exercise appeared.
2.7 Matrix Product equlity Testing
This is the teacher left in the homework of a problem, very magical a problem, recommend everyone to see, really can see the charm of randomization ! The way of proving is similar to universal hashing, so called deferred Principle is adopted. I do not quite understand, the main idea is to fix the value of all other variables to retain only a "free" variable, the results of this special case and then see how many of these special cases (that is, the number of variables just fixed how many combinations), thus simplifying the proof. Feel a bit difficult, and so there is time to learn the probability of the problem of proof method.
Recommended Reading :
- "Matrix-product Verification"
- "Verifying matrix multiplication"
- Deferred Principle: Events and probability (verifying matrix multiplication, randomized min-cut)
3. Parallel algorithm (Parallel algorithms)
Parallel computing is also a current hotspot, as a beginner, our best information is "Introduction to algorithm". It first introduces the work, span, parallelism three basic concepts and related theorems, the simple thing is the performance of a single-core runtime program, and Span is to run the program on the platform with an infinite core performance, the final ratio is parallelism, That is, the maximum degree of parallelism of the program. There are three primitives to achieve parallelization: Parallel-loop, Spwan, Sync. The book focuses on the multiplication and merging of matrices, which are the two most common parallel versions of algorithms.
4. External memory algorithm (External Memory algorithms)
Although the RAM model is our most familiar model, it is not omnipotent, for example, when analyzing external storage-related algorithms. The asymptotic complexity of memory operations for RAM analysis can be negligible, as compared to the I/O latency of external storage. For this reason, when considering external storage, we use a different model of--dam. Focus on analyzing the number of data exchanges with the disk, i.e. the number of I/O.
We can put the data structure of the most common algorithms into the dam model for analysis, such as linear search, binary lookup, and the most important k-way merging algorithm (such as database joins and so on widely used). Find some feelings, then analyze the complex matrix multiplication, b-tree, Lsm-tree, B^epsilon-tree and so on.
Because the previous only understand B-tree and external memory related, when the system learned a lot of external memory algorithm, thought this is a relatively new topic. Actually, like the hash table of the linear probing early in the 70 's, Knuth in the "Art of Computer Programming" (TAOCP) have detailed research, really ignorant!
Recommended Reading :
- Pagh Finishing Manual ( highly recommended!) ): "Basic External Memory Data Structures"
- Stonybrook Courseware: "Analyzing I/O and Cache performance"
5. Write optimized data structure (write-optimized dictionaries, wods)
Write optimization can be seen as an important design strategy, the idea of buffer buffers. The problem is the common problem with the B-tree family, where each insert puts the new key in the final position, causing the insert performance bottleneck. The data structure of the WODs family uses the buffering idea to draw a portion of the tree's internal nodes as a buffer, which is staged into the buffer when the data is inserted, and then a little bit flush to the position where the key should go. The members of the WODs family have a buffer tree (which should be a tree that has been used earlier and more primitive), and Google's log-structured Merge tree (lsm-tree), which uses buffering and cascading ideas, is widely used in HBase, Cassandra, Leveldb and other popular software), Stonybrook Teacher Research B^epsilon-tree and cache-oblivious streaming B-trees (COLA) and other variants.
Focus on the B^epsilon tree. It abstracts the key into an asynchronous message, and the whole tree seems to be an MQ. Depending on the selection of the epsilon parameters, decide how much space to set aside in each tree node as buffer. It not only greatly improves the write performance of B-tree, but also makes full use of disk bandwidth. It is worth mentioning that several teachers in partnership with the B^epsilon-tree application to the database, Linux file system and other storage software, so commercialized into a company. Simply put, a data structure propped up by a start-up company, entirely powered by technology, is really admirable!
Recommended Reading :
- Paper: The Buffer tree:a technique for designing
batched External Data Structures "
- Paper: "The log-structured merge-tree (lsm-tree)"
- Paper ( highly recommended!) ): "An Introduction to Bepsilon-trees and Write-optimization"
- Paper: "Cache-oblivious streaming B-trees"
- Professor Daniel Bender, who came from MIT ( highly recommended!) ), which involves almost every aspect of big data and wods, and some introduction to his commercial company, slide content is great! : "Data structures and algorithms for Big Databases"
5.CO algorithm (cache-oblivious algorithm)
This is a kind of magic algorithm, in any block-sized machine architecture can be run with optimal performance! This is unthinkable in the dam model described earlier, since the analysis of the external memory algorithm assumes a block size B, which is then analyzed on this basis. For any size B or ignore B, it's hard to imagine! But such an algorithm really exists, and not just a two, has found a lot.
The model of the CO algorithm is called the ideal cache model (Ideal cache models), based on the dam model, assuming that the memory size m is much larger than B, the perfect pager for the swap conditions. Although this model is similar to dam, it is only a two-tier cache architecture, but it can prove to be optimal on modern multilayer cache architectures (CPU register-l1-l2-l3-memory-disk). Known co algorithms such as Matrix transpose, matrix multiplication, and so on, please refer to the reading materials recommended below.
It's not that easy to design or transform a co algorithm or data structure, but there are three common strategies: Van Emde Boas data layout (VEB layout, not VEB tree), weighted balance tree, compact memory Array (Packed Memory Array or PMA). VEB is an interesting nesting structure where no redundant I/O is generated on any of the B-size architectures, while the PMA is more magical, and the normal static array doubles or truncate when the space is insufficient or redundant, which results in a sudden drop in performance in one operation. The PMA then segments the normal array, keeping some gap slots in each paragraph so that each segment is full and only moves within a small area.
Recommended Reading :
- CO originator Frigo Thesis: "Cache-oblivious Algorithms"
- Erik D. demaine Finishing Manual ( highly recommended!) ): "Cache-oblivious Algorithms and Data structures"
- Paper: "Cache-efficient Matrix transposition"
- Thesis (VEB, PMA) ( highly recommended!) ): "Cache-oblivious b-trees"
- Thesis (implict VEB): "Cache oblivious Search Trees via Binary Trees of Small Height"
- Thesis (PMA): "An Adaptive packed-memory Array"
6. Optimization of algorithms based on SSD storage
SSD is not a rarity now, many new personal laptops are equipped with hundreds of grams of SSD, why should it be optimized for SSDs? This is because SSDs differ greatly from the physical structure of traditional disks. The entire SSD is divided into a number of areas, each of which can be erased as a whole. Therefore, in order to delay the service life, from the OS's I/O Scheduler and drive aspects and the SSD internal FTL (Flash translation Layer) will be balanced.
On top of this physical structure, SSDs have two features: Channel-level parallel and package-level parallel. Simply put, the bandwidth of one I/O is greater, and the degree of parallelism of the operation is higher. So we optimize the data structure in two directions: either increase the amount of data I/O or increase the number of parallel I/O operations . The SSD model is also extended on the dam model, adding an additional parameter p, which indicates the number of I/Os that can be done in parallel at each step.
As an example of B-tree's search operation, we cannot know the traversal path in advance for one operation, so we cannot parallelize. So we take advantage of SSD features by increasing the amount of data per I/O from a single block size B to PB from the second aspect. For multiple unrelated operations, we can request b-tree directly in parallel to achieve the same goal.
Recommended reading : "B+-tree Index optimization by exploiting internal:parallelism of flash-based Solid State Drives"
7. Algorithm Security 7.1 complexity Attack
Complexity attacks mentioned in the previous hash, for the ordinary data structure we can be randomized to avoid the worst-case scenario. But for the randomization algorithm, the same problem arises if the "enemy" knows our internal information, such as the rules of the random number generator. For example, the skip list becomes a ladder shape, and treap degenerate into a linked list.
7.2 history-independent
The so-called history-independent is: even if the "enemy" has been the current data structure of your appearance, it can not infer that you call to delete and change the API operation sequence and parameters, typical examples are treap, counter example is chaining Hash, AVL and many other data structures. About this part is just the teacher in the class inadvertently mentioned a few words, understand is not too good, feel very novel, but did not find specific information specifically explained, and so find to learn a bit!
About algorithms, things you don't know about.