python實現simhash演算法執行個體_python

來源:互聯網
上載者:User

Simhash的演算法簡單的來說就是,從海量文本中快速搜尋和已知simhash相差小於k位的simhash集合,這裡每個文本都可以用一個simhash值來代表,一個simhash有64bit,相似的文本,64bit也相似,論文中k的經驗值為3。該方法的缺點如優點一樣明顯,主要有兩點,對於短文本,k值很敏感;另一個是由於演算法是以空間換時間,系統記憶體吃不消。

複製代碼 代碼如下:

#!/usr/bin/python
# coding=utf-8
class simhash:

    #建構函式
    def __init__(self, tokens='', hashbits=128):       
        self.hashbits = hashbits
        self.hash = self.simhash(tokens);

    #toString函數   
    def __str__(self):
        return str(self.hash)

    #產生simhash值   
    def simhash(self, tokens):
        v = [0] * self.hashbits
        for t in [self._string_hash(x) for x in tokens]: #t為token的普通hash值          
            for i in range(self.hashbits):
                bitmask = 1 << i
                if t & bitmask :
                    v[i] += 1 #查看當前bit位是否為1,是的話將該位+1
                else:
                    v[i] -= 1 #否則的話,該位-1
        fingerprint = 0
        for i in range(self.hashbits):
            if v[i] >= 0:
                fingerprint += 1 << i
        return fingerprint #整個文檔的fingerprint為最終各個位>=0的和

    #求海明距離
    def hamming_distance(self, other):
        x = (self.hash ^ other.hash) & ((1 << self.hashbits) - 1)
        tot = 0;
        while x :
            tot += 1
            x &= x - 1
        return tot

    #求相似性
    def similarity (self, other):
        a = float(self.hash)
        b = float(other.hash)
        if a > b : return b / a
        else: return a / b

    #針對source產生hash值   (一個可變長度版本的Python的內建散列)
    def _string_hash(self, source):       
        if source == "":
            return 0
        else:
            x = ord(source[0]) << 7
            m = 1000003
            mask = 2 ** self.hashbits - 1
            for c in source:
                x = ((x * m) ^ ord(c)) & mask
            x ^= len(source)
            if x == -1:
                x = -2
            return x
            

if __name__ == '__main__':
    s = 'This is a test string for testing'
    hash1 = simhash(s.split())

    s = 'This is a test string for testing also'
    hash2 = simhash(s.split())

    s = 'nai nai ge xiong cao'
    hash3 = simhash(s.split())

    print(hash1.hamming_distance(hash2) , "   " , hash1.similarity(hash2))
    print(hash1.hamming_distance(hash3) , "   " , hash1.similarity(hash3))


 

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.