Implement a big data search and source code with Python

Source: Internet
Author: User


In daily life, we know that search engines such as Baidu, Sogou, Google, and so on, search is the big data in the field of common needs. Splunk and elk are leaders in the field of non-open source and open source, respectively. This article uses very few Python code to implement a basic data search function, trying to get everyone to understand the basic principle of big data search.





Bron Filter (Bloomfilter)



The first step is to implement a fabric filter.



Bron filters are a common algorithm in the field of big data that is designed to filter out elements that are not targets. That is, if a word to be searched does not exist with my data, then it can return to the target at a very fast rate that does not exist.



Let's take a look at the code for the following fabric filters:



Classbloomfilter (object):



"""



Abloomfilterisaprobabilisticdata-structurethattradesspaceforaccuracy



Whendeterminingifavalueisinaset. itcantellyouifavaluewaspossibly



Added,orifitwasdefinitelynotadded,butitcan ' Ttellyouforcertainthat



Itwasadded.



"""



defInit(self,size):



"" "Setupthebfwiththeappropriatesize" ""



Self.values=[false]*size



Self.size=size



Defhash_value (Self,value):



"" "Hashthevalueprovidedandscaleittofitthebfsize" ""



Returnhash (value)%self.size



Defadd_value (Self,value):



"" "ADDAVALUETOTHEBF" ""



H=self.hash_value (value)



Self.values[h]=true



Defmight_contain (Self,value):



"" "CHECKIFTHEVALUEMIGHTBEINTHEBF" ""



H=self.hash_value (value)



RETURNSELF.VALUES[H]



Defprint_contents (self):



"" "Dumpthecontentsofthebffordebuggingpurposes" ""



Printself.values



The basic data structure is the array (actually a single-bit graph, with 1/0 to record whether the data exists), initialization is nothing, so all false. In actual use, the length of the array is very large to ensure efficiency.



Use a hashing algorithm to determine which of the data should exist, that is, the index of the array



When a data is added to the Bron filter, the hash value is computed and the corresponding position is true



When checking whether a data already exists or is indexed, just check the true/fasle of the bit that corresponds to the hash value



As you can see here, you should see that if the Bron filter returns false, then the data must not have been indexed, but if it returns true, it cannot say that the data must have been indexed. Using the filter in the search process allows many missed searches to be returned early to improve efficiency.



Let's see how this code works:



Bf=bloomfilter (10)



Bf.add_value (' dog ')



Bf.add_value (' Fish ')



Bf.add_value (' Cat ')



Bf.print_contents



Bf.add_value (' bird ')



Bf.print_contents



#Note: Contentsareunchangedafteraddingbird-itcollides



fortermin[' dog ', ' fish ', ' cat ', ' bird ', ' duck ', ' EMU ']:



print ' {}:{}{} '. Format (term,bf.hash_value (term), Bf.might_contain (term))



Results:



[False,false,false,false,true,true,false,false,false,true]



[False,false,false,false,true,true,false,false,false,true]



Dog:5true



Fish:4true



Cat:9true



Bird:9true



Duck:5true



Emu:8false



First, a 10-size fabric filter was created.



Then add ' dog ', ' fish ', ' cat ' three objects, then the contents of the filter are as follows:



Then adding the ' Bird ' object, the contents of the Bron filter did not change, because ' bird ' and ' fish ' had exactly the same hash.



Finally we examine a bunch of objects (' dog ', ' fish ', ' cat ', ' bird ', ' duck ', ' emu ') that are not already indexed. It turns out that ' duck ' returns true,2 and ' EMU ' returns FALSE. Because the hash of ' duck ' is exactly the same as ' dog '.



Word segmentation



In the next step we will implement participle. The purpose of participle is to divide our text data into the smallest searchable unit, the word. Here we mainly focus on English, because the Chinese word is related to natural language processing, more complex, and English as long as the use of punctuation is good.



Let's look at the code for the participle:



Defmajor_segments (s):



"""



Performmajorsegmentingonastring.splitthestringbyallofthemajor



Breaks,andreturnthesetofeverythingfound. Thebreaksinthisimplementation



Aresinglecharacters,butinsplunkpropertheycanbemultiplecharacters.



Asetisusedbecauseorderingdoesn ' Tmatter,andduplicatesarebad.



"""



Major_breaks= "



Last=-1



Results=set



#enumeratewillgiveus (0,s[0]), (1,s[1]),...



Foridx,chinenumerate (s):



Ifchinmajor_breaks:



SEGMENT=S[LAST+1:IDX]



Results.add (segment)



Last=idx



#Thelastcharactermaynotbeabreaksoalwayscapture



#thelastsegment (whichmayendupbeing "", Butyolo)



Segment=s[last+1:]



Results.add (segment)



Returnresults



Main division



The main partition using empty Glyd participle, the actual word-breaker logic, there will be other separators. For example, the default delimiter for Splunk includes the following, and the user can also define their own delimiters.



]<>{}|!;, ' "*\n\r\s\t&?+%21%26%2526%3b%7c%20%2b%3d-%2520%5d%5b%3a%0a%2c%28%29



Defminor_segments (s):



"""



Performminorsegmentingonastring.thisislikemajor



Segmenting,exceptitalsocapturesfromthestartofthe



Inputtoeachbreak.



"""



Minorbreaks= '. '



Last=-1



Results=set



Foridx,chinenumerate (s):



Ifchinminor_breaks:



SEGMENT=S[LAST+1:IDX]



Results.add (segment)



SEGMENT=S[:IDX]



Results.add (segment)



Last=idx



Segment=s[last+1:]



Results.add (segment)



Results.add (s)



Returnresults



Secondary split



The secondary split is similar to the logic of the main split, but it also adds the results from the beginning to the current split. For example, the secondary division of "1.2.3.4" will have 1,2,3,4,1.2,1.2.3



Defsegments (Event):



"" "Simplewrapperaroundmajor_segments/minor_segments" ""



Results=set



Formajorinmajor_segments (Event):



Forminorinminor_segments (Major):



Results.add (minor)



Returnresults



The logic of Word segmentation is the main division of the text first, and the secondary division of each main partition. And then return all the words that have been separated.



Let's see how this code works:



Forterminsegments (' src_ip=1.2.3.4 '):



Printterm



Src



1.2



1.2.3.4



Src_ip



3



1



1.2.3



Ip



2



=



4



Search



Well, with a word breaker and a Bron filter, we can implement the search function after the support of the two sharp tools. On the code:



Classsplunk (object):



defInit(self):



Self.bf=bloomfilter (64)



self.terms={} #Dictionaryoftermtosetofevents



self.events=



Defadd_event (self,event):



"" "Addsaneventtothisobject" ""



#GenerateauniqueIDfortheevent, Andsaveit



Event_id=len (self.events)



Self.events.append (Event)



#Addeachtermtothebloomfilter, Andtracktheeventbyeachterm



Forterminsegments (Event):



Self.bf.add_value (term)



Iftermnotinself.terms:



Self.terms[term]=set



Self.terms[term].add (event_id)



Defsearch (self,term):



"" "Searchforasingleterm,andyieldalltheeventsthatcontainit" ""



#InSplunkthisrunsinO (1), Andislikelytobeinfilesystemcache (memory)



Ifnotself.bf.might_contain (term):



Return



#InSplunkthisprobablyrunsinO (LOGN) wherenisthenumberoftermsinthetsidx



Iftermnotinself.terms:



Return



Forevent_idinsorted (Self.terms[term]):



YIELDSELF.EVENTS[EVENT_ID]



Splunk represents a collection of indexes that have a search function



Each collection contains a fabric filter, an inverted thesaurus (dictionary), and an array of all events stored



When an event is added to the index, the following logic is made



Generate a Unqieid for each event, here is the serial number



The event is participle, each word is added to the inverted list, that is, each word corresponding to the ID mapping structure of the event, note that a word may correspond to multiple events, so the value of the inverted table is a set. The inverted list is the core function of most search engines.



When a word is searched, it does the following logic



Check the fabric filter, if False, and return directly



Check the glossary, if the searched word is not in the thesaurus, return directly



Find all the corresponding event IDs in the inverted list and return the contents of the event



Let's run a look at the following:



S=splunk



S.add_event (' src_ip=1.2.3.4 ')



S.add_event (' src_ip=5.6.7.8 ')



S.add_event (' dst_ip=1.2.3.4 ')



Foreventins.search (' 1.2.3.4 '):



Printevent



print '-'



Foreventins.search (' Src_ip '):



Printevent



print '-'



Foreventins.search (' IP '):



Printevent



src_ip=1.2.3.4



dst_ip=1.2.3.4



-



src_ip=1.2.3.4



src_ip=5.6.7.8



-



src_ip=1.2.3.4



src_ip=5.6.7.8



dst_ip=1.2.3.4



is not great!



More complex searches



Further, in the search process, we want to use and and or to implement more complex search logic.



On the code:



CLASSSPLUNKM (object):



defInit(self):



Self.bf=bloomfilter (64)



self.terms={} #Dictionaryoftermtosetofevents



self.events=



Defadd_event (self,event):



"" "Addsaneventtothisobject" ""



#GenerateauniqueIDfortheevent, Andsaveit



Event_id=len (self.events)



Self.events.append (Event)



#Addeachtermtothebloomfilter, Andtracktheeventbyeachterm



Forterminsegments (Event):



Self.bf.add_value (term)



Iftermnotinself.terms:



Self.terms[term]=set



Self.terms[term].add (event_id)



Defsearch_all (self,terms):



"" "Searchforanandofallterms" ""



#Startwiththeuniverseofallevents ...



Results=set (Range (len (self.events)))



Forterminterms:



#Ifatermisn ' tpresentatallthenwecanstoplooking



Ifnotself.bf.might_contain (term):



Return



Iftermnotinself.terms:



Return



#Dropeventsthatdon ' Tmatchfromourresults



Results=results.intersection (Self.terms[term])



Forevent_idinsorted (Results):



YIELDSELF.EVENTS[EVENT_ID]



Defsearch_any (self,terms):



"" "Searchforanorofallterms" ""



Results=set



Forterminterms:



#Ifatermisn ' Tpresent,weskipit,butdon ' tstop



Ifnotself.bf.might_contain (term):



Continue



Iftermnotinself.terms:



Continue



#Addtheseeventstoourresults



Results=results.union (Self.terms[term])



Forevent_idinsorted (Results):



YIELDSELF.EVENTS[EVENT_ID]



With the intersection and Union operations of the Python collection, it is convenient to support the operation of and (intersection) and or (for collection).



The results of the operation are as follows:



S=splunkm



S.add_event (' src_ip=1.2.3.4 ')



S.add_event (' src_ip=5.6.7.8 ')



S.add_event (' dst_ip=1.2.3.4 ')



Foreventins.search_all ([' src_ip ', ' 5.6 ']):



Printevent



print '-'



Foreventins.search_any ([' src_ip ', ' dst_ip ']):



Printevent



src_ip=5.6.7.8



-



src_ip=1.2.3.4



src_ip=5.6.7.8



dst_ip=1.2.3.4



Implement a big data search and source code with Python


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.