Horspool algorithm C++11 implementation (support Chinese-English hybrid search)

Source: Internet
Author: User


Summary:


This paper presents an implementation of a horspool algorithm, presents a usage example, and introduces a very useful UTF8 character transcoding project, gives a simple test report, and so on.


Algorithm implementation:

#include <iostream>
#include <unordered_map>
// # include <codecvt>
#include <fstream>
#include <iterator>
#include <sstream>
#include <bitset>
#include "utf8.h"
using namespace std;

template <typename Key, typename Value>
class ShiftTable {

  public:
        ShiftTable (const std :: u32string & pattern) {
            index_ = pattern.size ();
            auto end = pattern.rbegin ();
            auto head = pattern.rend ();
            auto cur = end + 1;
            while (cur! = head) {
              shiftTable_.emplace (make_pair (* cur, cur-end));
              ++ cur;
            }
        }
        Value operator [] (const Key & key) {
            auto cur = shiftTable_.find (key);
            if (cur! = shiftTable_.end ())
              return cur-> second;
             else
              return index_;
        }
    private:
        unordered_map <Key, Value> shiftTable_;
        size_t index_;
};


int HorspoolMatching (const std :: u32string & pattern, const std :: u32string & text) {
    if (pattern.empty () || text.empty ()) return -1;
    ShiftTable <char32_t, size_t> table (pattern);
    auto m = pattern.size ();
    auto n = text.size ();
    auto i = m-1;
    while (i <= n-1) {
        int k = 0;
        while (k <= m-1 && pattern [m-1-k] == text [i-k]) k ++;
        if (k == m)
            return i-m + 1;
        else
            i + = table [text [i]];
    }
    return -1;
}



The horspool algorithm is not explained here, only one implementation is shared.

The implementation uses std :: u32string, and we require characters to use unicode32 to support searching for any country-building character channeling.

Here I strongly recommend everyone to pay attention to a lightweight open source utf8 transcoding implementation, this is the project homepage utf8

An example of use, find and replace:
int main ()
{

    // A more efficient, pure C ++ way to read the file into a string
    ifstream filestream ("/ home / ron / input.in"); // The file needs to be saved in utf8 format (no requirement for operating system)
    stringstream ss;
    ss << filestream.rdbuf ();
    
    string text (ss.str ());
    string pattern = "you are"; // here "you are" is saved by utf8, because the source code is saved by utf8 under ubuntu

    std :: u32string text32;
    std :: u32string pattern32;
    utf8 :: utf8to32 (text.begin (), text.end (), back_inserter (text32));
    utf8 :: utf8to32 (pattern.begin (), pattern.end (), back_inserter (pattern32));
    
    string repWord = "me"; // here "me" is saved by utf8 because the source code is saved by utf8 under ubuntu
    std :: u32string repWord32;
    utf8 :: utf8to32 (repWord.begin (), repWord.end (), back_inserter (repWord32));

    // Find "you are" in the file
    auto index = HorspoolMatching (pattern32, text32);
    if (index! =-1)
    {
        cout << "found it, at index" << index << endl;
        text32.replace (index, 1, repWord32);
         // In the replacement file, the first "You are" is "I am"
        ofstream ofilestream ("/ home / ron / input.in");
        ostream_iterator <char> out (ofilestream);
        utf8 :: utf32to8 (text32.begin (), text32.end (), out);
    }
    else
    {
        cout << "not found" << endl;
    }


    return 0;
}
The above code is an example of use. It can be cross-platform (it does not matter if the operating system does not support utf8, our program supports arbitrary transcoding of utf8 / 16/32), so only the input file and mode string are encoded using utf8. We know that utf8 is the standard adopted for network transmission, and most systems support utf8.
We can do lookups that support any encoding, then the problem is complicated, who wants to endlessly fall into character encoding, I believe only experts in this field.


codecvt:
#include <codecvt>
What is this header file? C ++ 11 introduced the implementation of character transcoding, but unfortunately, gcc has not yet been implemented, hey, how can it be? . Versions after vc2010 should be supported. Interested students can learn by themselves. Because my compilation environment is ubuntu gcc, I can't use codecvt. There are other character encoding libraries that can be used, similar to ICU, etc., but they are too big, and it is also troublesome to use. Finally found the utf8 lightweight project, copy source code is ready to use, it performs very well under ubuntu. Of course, the same is true under windows. The shortcomings are nothing more than support only utf.

test:
I use some text to compare this implementation with the standard library implementation, and the time performance efficiency is almost the same (the standard library is slightly better). There is a gcc issue, hoping to use the Boyer-Moore algorithm to find. So I guess that gcc is likely to use the horspool algorithm for the find implementation, which is fast and simple, but the worst complexity is not guaranteed.

KMP:
Why not see KMP, KMP is too complicated. Unless you are obsessed with algorithms, no programmer will choose the same efficiency but implement more complex algorithms. But the idea of KMP algorithm did have an impact on other subsequent algorithms.


Limited to my level, you are welcome to criticize and correct. Reprint please indicate the source, thank you.

Horspool algorithm C ++ 11 implementation (support Chinese and English mixed search)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.