Introduction to the nark Database

Source: Internet
Author: User

Introduction to the nark Database

This article is my original, from my company website: http://nark.cc/p? P = 1560

Unlike databases with common Hash or Tree structures, the nark database isBased on automatic machinesWhich makes nark powerful and concise. However, the most important thing is that nark providesA complete set of solutions.

Because automatic machines are created only offlineRead-Only DatabaseIn order to provide the mostMemory savingAndHigh-Speed Search. Therefore, the vast majority of nark components are dividedOffline database creationAndOnline Search.

Currently,Offline database creationIn the form of executable programsAll usersOpen,Online SearchIn the form of C ++ APIPaying userOpen.

To makeAll usersExperience the high performance of nark before payment. The downloaded package also contains someSample program, Most sample programs are also benchmark programs,All usersYou can run these sample programs on your machine. At the same time, the Code of these sample programs isAll usersOpen, but onlyPaying userTo compile these sample programs by yourself, because C ++ API is required.


How powerful nark is
As
NoSQL
Database
Intelligent Error Correction The Demo can be viewed on the homepage, which can be considered to be used in essence.Regular Expression searchDatabase, but this "Regular Expression" is not handwritten, but a DFA is created from a search term, which naturally corresponds to a regular expression.

Rule Engine
Each rule is an advanced regular expression. If 0.1 million rules are configured, there is a string (such as a network message), depending on which rule the string can match, the nark rule engine only needsSeveral microseconds. Application Case: from an Internet companyQuery word categoriesIn a natural language processing applicationSMS AnalysisApplications, network equipment vendorsIntrusion Detection......
A simplified scenario is that the rule only contains binary strings that require exact matching. You can use the nark AC automatic mechanism and the Benchmark contains 2000 pattern strings. The matching performance is as high720 MB/S
For NLP
(Natural Language Processing)
  • Compress a large number of corpus (for example, N-Gram) into a DFA with both powerful matching capabilities and greatly reduced memory usage (or in other words, ).
  • Simplify many complex computing problems to (matching + simple computing), and so on.
For massive
Small File compression
  • In small file storage, mainstream compression algorithms are used. To read a single small file in real time, only the file redundancy can be compressed.
  • UseRltzip, YesEfficient CompressionMassive small files, andHigh-speed readingA single small file (equivalent to extracting only one file)
  • Search EngineForward tableIs a typical case

Core nark APIs
Abstract Interface Function
DFA_Interface Prefix search, Key-Value search, and Key Existence Check
DAWG_Interface Prefix search, string Key, and integer Index convert each other. Index is the sequence number of keys in the entire database: from 0 to n-1.
To implement the Map <Key, Value> function, you can store the Value in an external array and access it using Index. This provides another capability:Modify Value
SuffixCountableDAWG Derived interface of DAWG_Interface, new function: Get the number of all different suffixes with a specified prefix AT HIGH SPEED
AC_Scan_Interface Multi-mode matching of AC automatic mechanism: searches for multiple occurrence locations of Pattern in the entire input data.

Nark offline database creation program
Adfa_build

UsedText FilesCreate(Key, Value)Database, (Key, Value) are strings. Each line in a text file is a (Key, Value) record. Generally, (Key, Value) is separated by \ t. Key is before the first \ t, value can also contain \ t.

The dfa database generated by this program only supports the DFA_Interface interface.

This program is suitable for creating (Key, Value) setsCombination featuresWhen you are not sure about this, you can tryRptrie_buildTo check whether the generated dfa database file is smaller.

Dawg_build

UsedText FilesCreate(Key, Index)Database,Text FilesAll content in each row is treated as a Key. The generated dfa Database supports both DFA_Interface and SuffixCountableDAWG.

The full name of DAWG is Directed Acylic Word Graph.

  • In DAWG, the Index corresponding to a Key is the lexicographic sequence of the Key in the entire Key set (0 ~ N-1)
  • This database has a higher compression ratio only when the Key set has a high combination of features
  • Based on past experience, DAWG is less suitable than adfa and rptrie.
Rptrie_build

This program is also used to create a (Key, Value) database from a text file. In many cases, the generated database file is smaller than adfa_build and has more functions. It supports DFA_Interface and DAWG_Interface.

This type of database andNoDirected Acylic Word Graph, but it can also map keys and integer indexes. This Index is not a Lexicographic Order, but a KeyLen + Lexicographic Order. You can use the following code to generate such a sequence:

Compare the length first, and then compare the content in lexicographically

From the name, we can see that such a database is essentially a trie tree. However, compared with double array (DoubleArray) Trie, this trie is about 30 times smaller, or even 300 times smaller. Of course, the speed is much slower than DoubleArray Trie. Based on Data and application scenarios, the speed is about 3 ~ Between 10 times.

Although this type of database has additional capabilities (Key-to-Index conversion) than adfa, the size is often smaller and faster, and seems a bit intuitive, but it does.

Rptrie is larger than adfa only when data has a large number of combined features. Theoretically, the compression ratio of rptrie is linear, and adfa is exponential, but the actual data redundancy is more linear, not exponential.

Rptrie has the following advantages:

  1. Each line of a text file is a (Key, Value) record, usually using the DFA_Interface Interface
  2. Each line of a text file contains only one Key but no Value. Usually, DAWG_Interface is used. If the Value needs to be modified, this method can only be used.
  3. This type of database is not SuffixCountableDAWG. Fortunately, this function is not required by most applications.
Rltzip

Use the same algorithm as rptrie_build to compress a large number of small files (currently the maximum length of a single file is 16 Mb). The larger the number of files, the higher the compression ratio, especially for text files.

The format of the generated dfa database file is exactly the same as that of rptrie_build. I used rltzip to compress 3 million small json files of 58 GB to a single rptrie of 7 GB.

Rltunzip

Extract/read from database generated by rltzip by file name

Regex_build

This is a rule engine database creation tool.

 

More documents are being written ......

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.