Implement a big data search and source code with Python

Last Update:2018-04-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In daily life, we know that search engines such as Baidu, Sogou, Google, and so on, search is the big data in the field of common needs. Splunk and elk are leaders in the field of non-open source and open source, respectively. This article uses very few Python code to implement a basic data search function, trying to get everyone to understand the basic principle of big data search.

Bron Filter (Bloomfilter)

The first step is to implement a fabric filter.

Bron filters are a common algorithm in the field of big data that is designed to filter out elements that are not targets. That is, if a word to be searched does not exist with my data, then it can return to the target at a very fast rate that does not exist.

Let's take a look at the code for the following fabric filters:

Classbloomfilter (object):

"""

Abloomfilterisaprobabilisticdata-structurethattradesspaceforaccuracy

Whendeterminingifavalueisinaset. itcantellyouifavaluewaspossibly

Added,orifitwasdefinitelynotadded,butitcan ' Ttellyouforcertainthat

Itwasadded.

"""

defInit(self,size):

"" "Setupthebfwiththeappropriatesize" ""

Self.values=[false]*size

Self.size=size

Defhash_value (Self,value):

"" "Hashthevalueprovidedandscaleittofitthebfsize" ""

Returnhash (value)%self.size

Defadd_value (Self,value):

"" "ADDAVALUETOTHEBF" ""

H=self.hash_value (value)

Self.values[h]=true

Defmight_contain (Self,value):

"" "CHECKIFTHEVALUEMIGHTBEINTHEBF" ""

H=self.hash_value (value)

RETURNSELF.VALUES[H]

Defprint_contents (self):

"" "Dumpthecontentsofthebffordebuggingpurposes" ""

Printself.values

The basic data structure is the array (actually a single-bit graph, with 1/0 to record whether the data exists), initialization is nothing, so all false. In actual use, the length of the array is very large to ensure efficiency.

Use a hashing algorithm to determine which of the data should exist, that is, the index of the array

When a data is added to the Bron filter, the hash value is computed and the corresponding position is true

When checking whether a data already exists or is indexed, just check the true/fasle of the bit that corresponds to the hash value

As you can see here, you should see that if the Bron filter returns false, then the data must not have been indexed, but if it returns true, it cannot say that the data must have been indexed. Using the filter in the search process allows many missed searches to be returned early to improve efficiency.

Let's see how this code works:

Bf=bloomfilter (10)

Bf.add_value (' dog ')

Bf.add_value (' Fish ')

Bf.add_value (' Cat ')

Bf.print_contents

Bf.add_value (' bird ')

Bf.print_contents

#Note: Contentsareunchangedafteraddingbird-itcollides

fortermin[' dog ', ' fish ', ' cat ', ' bird ', ' duck ', ' EMU ']:

print ' {}:{}{} '. Format (term,bf.hash_value (term), Bf.might_contain (term))

Results:

[False,false,false,false,true,true,false,false,false,true]

Dog:5true

Fish:4true

Cat:9true

Bird:9true

Duck:5true

Emu:8false

First, a 10-size fabric filter was created.

Then add ' dog ', ' fish ', ' cat ' three objects, then the contents of the filter are as follows:

Then adding the ' Bird ' object, the contents of the Bron filter did not change, because ' bird ' and ' fish ' had exactly the same hash.

Finally we examine a bunch of objects (' dog ', ' fish ', ' cat ', ' bird ', ' duck ', ' emu ') that are not already indexed. It turns out that ' duck ' returns true,2 and ' EMU ' returns FALSE. Because the hash of ' duck ' is exactly the same as ' dog '.

Word segmentation

In the next step we will implement participle. The purpose of participle is to divide our text data into the smallest searchable unit, the word. Here we mainly focus on English, because the Chinese word is related to natural language processing, more complex, and English as long as the use of punctuation is good.

Let's look at the code for the participle:

Defmajor_segments (s):

"""

Performmajorsegmentingonastring.splitthestringbyallofthemajor

Breaks,andreturnthesetofeverythingfound. Thebreaksinthisimplementation

Aresinglecharacters,butinsplunkpropertheycanbemultiplecharacters.

Asetisusedbecauseorderingdoesn ' Tmatter,andduplicatesarebad.

"""

Major_breaks= "

Last=-1

Results=set

#enumeratewillgiveus (0,s[0]), (1,s[1]),...

Foridx,chinenumerate (s):

Ifchinmajor_breaks:

SEGMENT=S[LAST+1:IDX]

Results.add (segment)

Last=idx

#Thelastcharactermaynotbeabreaksoalwayscapture

#thelastsegment (whichmayendupbeing "", Butyolo)

Segment=s[last+1:]

Results.add (segment)

Returnresults

Main division

The main partition using empty Glyd participle, the actual word-breaker logic, there will be other separators. For example, the default delimiter for Splunk includes the following, and the user can also define their own delimiters.

]<>{}|!;, ' "*\n\r\s\t&?+%21%26%2526%3b%7c%20%2b%3d-%2520%5d%5b%3a%0a%2c%28%29

Defminor_segments (s):

"""

Performminorsegmentingonastring.thisislikemajor

Segmenting,exceptitalsocapturesfromthestartofthe

Inputtoeachbreak.

"""

Minorbreaks= '. '

Last=-1

Results=set

Foridx,chinenumerate (s):

Ifchinminor_breaks:

SEGMENT=S[LAST+1:IDX]

Results.add (segment)

SEGMENT=S[:IDX]

Results.add (segment)

Last=idx

Segment=s[last+1:]

Results.add (segment)

Results.add (s)

Returnresults

Secondary split

The secondary split is similar to the logic of the main split, but it also adds the results from the beginning to the current split. For example, the secondary division of "1.2.3.4" will have 1,2,3,4,1.2,1.2.3

Defsegments (Event):

"" "Simplewrapperaroundmajor_segments/minor_segments" ""

Results=set

Formajorinmajor_segments (Event):

Forminorinminor_segments (Major):

Results.add (minor)

Returnresults

The logic of Word segmentation is the main division of the text first, and the secondary division of each main partition. And then return all the words that have been separated.

Let's see how this code works:

Forterminsegments (' src_ip=1.2.3.4 '):

Printterm

Src

1.2

1.2.3.4

Src_ip

1.2.3

Well, with a word breaker and a Bron filter, we can implement the search function after the support of the two sharp tools. On the code:

Classsplunk (object):

defInit(self):

Self.bf=bloomfilter (64)

self.terms={} #Dictionaryoftermtosetofevents

self.events=

Defadd_event (self,event):

"" "Addsaneventtothisobject" ""

#GenerateauniqueIDfortheevent, Andsaveit

Event_id=len (self.events)

Self.events.append (Event)

#Addeachtermtothebloomfilter, Andtracktheeventbyeachterm

Forterminsegments (Event):

Self.bf.add_value (term)

Iftermnotinself.terms:

Self.terms[term]=set

Self.terms[term].add (event_id)

Defsearch (self,term):

"" "Searchforasingleterm,andyieldalltheeventsthatcontainit" ""

#InSplunkthisrunsinO (1), Andislikelytobeinfilesystemcache (memory)

Ifnotself.bf.might_contain (term):

Return

#InSplunkthisprobablyrunsinO (LOGN) wherenisthenumberoftermsinthetsidx

Iftermnotinself.terms:

Return

Forevent_idinsorted (Self.terms[term]):

YIELDSELF.EVENTS[EVENT_ID]

Splunk represents a collection of indexes that have a search function

Each collection contains a fabric filter, an inverted thesaurus (dictionary), and an array of all events stored

When an event is added to the index, the following logic is made

Generate a Unqieid for each event, here is the serial number

The event is participle, each word is added to the inverted list, that is, each word corresponding to the ID mapping structure of the event, note that a word may correspond to multiple events, so the value of the inverted table is a set. The inverted list is the core function of most search engines.

When a word is searched, it does the following logic

Check the fabric filter, if False, and return directly

Check the glossary, if the searched word is not in the thesaurus, return directly

Find all the corresponding event IDs in the inverted list and return the contents of the event

Let's run a look at the following:

S=splunk

S.add_event (' src_ip=1.2.3.4 ')

S.add_event (' src_ip=5.6.7.8 ')

S.add_event (' dst_ip=1.2.3.4 ')

Foreventins.search (' 1.2.3.4 '):

Printevent

print '-'

Foreventins.search (' Src_ip '):

Printevent

print '-'

Foreventins.search (' IP '):

Printevent

src_ip=1.2.3.4

dst_ip=1.2.3.4

src_ip=1.2.3.4

src_ip=5.6.7.8

src_ip=1.2.3.4

src_ip=5.6.7.8

dst_ip=1.2.3.4

is not great!

More complex searches

Further, in the search process, we want to use and and or to implement more complex search logic.

On the code:

CLASSSPLUNKM (object):

defInit(self):

Self.bf=bloomfilter (64)

self.terms={} #Dictionaryoftermtosetofevents

self.events=

Defadd_event (self,event):

"" "Addsaneventtothisobject" ""

#GenerateauniqueIDfortheevent, Andsaveit

Event_id=len (self.events)

Self.events.append (Event)

#Addeachtermtothebloomfilter, Andtracktheeventbyeachterm

Forterminsegments (Event):

Self.bf.add_value (term)

Iftermnotinself.terms:

Self.terms[term]=set

Self.terms[term].add (event_id)

Defsearch_all (self,terms):

"" "Searchforanandofallterms" ""

#Startwiththeuniverseofallevents ...

Results=set (Range (len (self.events)))

Forterminterms:

#Ifatermisn ' tpresentatallthenwecanstoplooking

Ifnotself.bf.might_contain (term):

Return

Iftermnotinself.terms:

Return

#Dropeventsthatdon ' Tmatchfromourresults

Results=results.intersection (Self.terms[term])

Forevent_idinsorted (Results):

YIELDSELF.EVENTS[EVENT_ID]

Defsearch_any (self,terms):

"" "Searchforanorofallterms" ""

Results=set

Forterminterms:

#Ifatermisn ' Tpresent,weskipit,butdon ' tstop

Ifnotself.bf.might_contain (term):

Continue

Iftermnotinself.terms:

Continue

#Addtheseeventstoourresults

Results=results.union (Self.terms[term])

Forevent_idinsorted (Results):

YIELDSELF.EVENTS[EVENT_ID]

With the intersection and Union operations of the Python collection, it is convenient to support the operation of and (intersection) and or (for collection).

The results of the operation are as follows:

S=splunkm

S.add_event (' src_ip=1.2.3.4 ')

S.add_event (' src_ip=5.6.7.8 ')

S.add_event (' dst_ip=1.2.3.4 ')

Foreventins.search_all ([' src_ip ', ' 5.6 ']):

Printevent

print '-'

Foreventins.search_any ([' src_ip ', ' dst_ip ']):

Printevent

src_ip=5.6.7.8

src_ip=1.2.3.4

src_ip=5.6.7.8

dst_ip=1.2.3.4

Implement a big data search and source code with Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Implement a big data search and source code with Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Implement a big data search and source code with Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support