Let's talk about how to use Python to implement a big data search engine.

Source: Internet
Author: User

Let's talk about how to use Python to implement a big data search engine.

Search is a common requirement in the big data field. Splunk and ELK are leaders in non-open source and open source fields respectively. This article uses a small number of Python code to implement a basic data search function, trying to let everyone understand the basic principles of big data search.

Bloom Filter)

The first step is to implement a bloom filter.

Bloom filter is a common algorithm in the big data field. It aims to filter out elements that are not targets. That is to say, if a word to be searched does not exist in my data, it can return the target as quickly as possible.

Let's take a look at the following bloom filter code:

class Bloomfilter(object):  """  A Bloom filter is a probabilistic data-structure that trades space for accuracy  when determining if a value is in a set. It can tell you if a value was possibly  added, or if it was definitely not added, but it can't tell you for certain that  it was added.  """  def __init__(self, size):    """Setup the BF with the appropriate size"""    self.values = [False] * size    self.size = size  def hash_value(self, value):    """Hash the value provided and scale it to fit the BF size"""    return hash(value) % self.size  def add_value(self, value):    """Add a value to the BF"""    h = self.hash_value(value)    self.values[h] = True  def might_contain(self, value):    """Check if the value might be in the BF"""    h = self.hash_value(value)    return self.values[h]  def print_contents(self):    """Dump the contents of the BF for debugging purposes"""    print self.values
  1. The basic data structure is an array (actually a bitmap, which uses 1/0 to record whether the data exists). The initialization does not contain any content, so set all to False. In actual use, the length of the array is very large to ensure efficiency.
  2. Use the hash algorithm to determine which data should exist, that is, the index of the array.
  3. When a data is added to the bloom filter, calculate its hash value and set the corresponding position to True.
  4. When checking whether a data already exists or has been indexed, you only need to check the True/Fasle of the bit where the corresponding hash value is located.

We can see that if the bloom filter returns False, the data must have not been indexed. However, if True is returned, the data must have been indexed. The bloom filter can be used in the search process to improve the efficiency of many missed searches returned in advance.

Let's take a look at how this code runs:

bf = Bloomfilter(10)bf.add_value('dog')bf.add_value('fish')bf.add_value('cat')bf.print_contents()bf.add_value('bird')bf.print_contents()# Note: contents are unchanged after adding bird - it collidesfor term in ['dog', 'fish', 'cat', 'bird', 'duck', 'emu']:  print '{}: {} {}'.format(term, bf.hash_value(term), bf.might_contain(term))

Result:

[False, False, True, True, False, True]
[False, False, True, True, False, True]
Dog: 5 True
Fish: 4 True
Cat: 9 True
Bird: 9 True
Duck: 5 True
Emu: 8 False

First, a bloom filter with a capacity of 10 is created.

Then we add three objects, 'dog ', 'fish', and 'cat' respectively. The content of the bloom filter is as follows:

Then add the 'bird 'object. The content of the bloom filter has not changed because 'bird' and 'fish 'share the same hash.

Finally, we check whether a bunch of objects ('dog ', 'fish', 'cat', 'bird ', 'duck', and 'emu') have been indexed. The result shows that 'duck 'returns True, 2, and 'emu' returns False. Because the hash of 'duck 'is exactly the same as that of 'Dog.

Word Segmentation

Next, we will implement word segmentation. Word Segmentation aims to divide our text data into the smallest unit that can be searched, that is, words. Here we mainly aim at English, because Chinese Word Segmentation involves natural language processing, which is complicated, and English only requires punctuation.

Let's take a look at the word segmentation code:

def major_segments(s):  """  Perform major segmenting on a string. Split the string by all of the major  breaks, and return the set of everything found. The breaks in this implementation  are single characters, but in Splunk proper they can be multiple characters.  A set is used because ordering doesn't matter, and duplicates are bad.  """  major_breaks = ' '  last = -1  results = set()  # enumerate() will give us (0, s[0]), (1, s[1]), ...  for idx, ch in enumerate(s):    if ch in major_breaks:      segment = s[last+1:idx]      results.add(segment)      last = idx  # The last character may not be a break so always capture  # the last segment (which may end up being "", but yolo)    segment = s[last+1:]  results.add(segment)  return results

Main Segmentation

Space is used for word segmentation. In actual word segmentation logic, there are other separators. For example, Splunk has the following default delimiters. You can also define your own delimiters.

] <> () {}|! ;, '"* \ N \ r \ s \ t &? + % 21% 26% 2526% 3B % 7C % 20% 2B % 3D -- % 2520% 5D % 5B % 3A % 0A % 2C % 28% 29

def minor_segments(s):  """  Perform minor segmenting on a string. This is like major  segmenting, except it also captures from the start of the  input to each break.  """  minor_breaks = '_.'  last = -1  results = set()  for idx, ch in enumerate(s):    if ch in minor_breaks:      segment = s[last+1:idx]      results.add(segment)      segment = s[:idx]      results.add(segment)      last = idx  segment = s[last+1:]  results.add(segment)  results.add(s)  return results

Secondary Segmentation

The secondary split is similar to the primary split logic, but the result from the start part to the current split is added. For example, the secondary segmentation of "1.2.3.4" may include 1.2, and 1.2.3.

def segments(event):  """Simple wrapper around major_segments / minor_segments"""  results = set()  for major in major_segments(event):    for minor in minor_segments(major):      results.add(minor)  return results

The logic of Word Segmentation is to separate the text first, and separate each of them at a secondary level. Then, all the words are returned.

Let's take a look at how this code runs:

for term in segments('src_ip = 1.2.3.4'):    print term

Src
1.2
1.2.3.4
Src_ip
3
1
1.2.3
Ip
2
=
4

Search

Well, with the support of Word Segmentation and bloom filter, we can implement the search function.

Code:

class Splunk(object):  def __init__(self):    self.bf = Bloomfilter(64)    self.terms = {} # Dictionary of term to set of events    self.events = []    def add_event(self, event):    """Adds an event to this object"""    # Generate a unique ID for the event, and save it    event_id = len(self.events)    self.events.append(event)    # Add each term to the bloomfilter, and track the event by each term    for term in segments(event):      self.bf.add_value(term)      if term not in self.terms:        self.terms[term] = set()      self.terms[term].add(event_id)  def search(self, term):    """Search for a single term, and yield all the events that contain it"""        # In Splunk this runs in O(1), and is likely to be in filesystem cache (memory)    if not self.bf.might_contain(term):      return    # In Splunk this probably runs in O(log N) where N is the number of terms in the tsidx    if term not in self.terms:      return    for event_id in sorted(self.terms[term]):      yield self.events[event_id]

Splunk represents a set of indexes with search functions

Each collection contains a bloom filter, an inverted Vocabulary (dictionary), and an array that stores all events.

When an event is indexed, the following logic is performed:

  1. Generate an unqie id for each event.
  2. Perform word segmentation on events and add each word to the inverted vocabulary, that is, the ing structure of the event id corresponding to each word. Note that a word may correspond to multiple events, therefore, the value of the inverted table is a Set. Inverted table is the core function of most search engines.

When a word is searched, the following logic is performed:

  1. Check bloom filter. If false, return directly
  2. Check the word table. If the word to be searched is not in the Word Table, return the result directly.
  3. Find all corresponding event IDs in the inverted table, and then return the event content.

Let's run the following command:

s = Splunk()s.add_event('src_ip = 1.2.3.4')s.add_event('src_ip = 5.6.7.8')s.add_event('dst_ip = 1.2.3.4')for event in s.search('1.2.3.4'):  print eventprint '-'for event in s.search('src_ip'):  print eventprint '-'for event in s.search('ip'):  print event
src_ip = 1.2.3.4dst_ip = 1.2.3.4-src_ip = 1.2.3.4src_ip = 5.6.7.8-src_ip = 1.2.3.4src_ip = 5.6.7.8dst_ip = 1.2.3.4

Is it awesome!

More complex search

Furthermore, in the search process, we want to use And Or to implement more complex search logic.

Code:

class SplunkM(object):  def __init__(self):    self.bf = Bloomfilter(64)    self.terms = {} # Dictionary of term to set of events    self.events = []    def add_event(self, event):    """Adds an event to this object"""    # Generate a unique ID for the event, and save it    event_id = len(self.events)    self.events.append(event)    # Add each term to the bloomfilter, and track the event by each term    for term in segments(event):      self.bf.add_value(term)      if term not in self.terms:        self.terms[term] = set()            self.terms[term].add(event_id)  def search_all(self, terms):    """Search for an AND of all terms"""    # Start with the universe of all events...    results = set(range(len(self.events)))    for term in terms:      # If a term isn't present at all then we can stop looking      if not self.bf.might_contain(term):        return      if term not in self.terms:        return      # Drop events that don't match from our results      results = results.intersection(self.terms[term])    for event_id in sorted(results):      yield self.events[event_id]  def search_any(self, terms):    """Search for an OR of all terms"""    results = set()    for term in terms:      # If a term isn't present, we skip it, but don't stop      if not self.bf.might_contain(term):        continue      if term not in self.terms:        continue      # Add these events to our results      results = results.union(self.terms[term])    for event_id in sorted(results):      yield self.events[event_id]

Using the intersection And union operations of the Python set, you can easily support And (intersection) And Or (union) operations.

The running result is as follows:

s = SplunkM()s.add_event('src_ip = 1.2.3.4')s.add_event('src_ip = 5.6.7.8')s.add_event('dst_ip = 1.2.3.4')for event in s.search_all(['src_ip', '5.6']):  print eventprint '-'for event in s.search_any(['src_ip', 'dst_ip']):  print event
src_ip = 5.6.7.8-src_ip = 1.2.3.4src_ip = 5.6.7.8dst_ip = 1.2.3.4

Summary

The above code is just to illustrate the basic principles of big data search, including bloom filter, Word Segmentation and inverted table. If you really want to use this code to implement the real search function, it's still too far away. All content comes from Splunk Conf2017. I hope it will be helpful for everyone's learning, and I hope you can support the house of helping customers more.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.