C語言中壓縮字串的簡單演算法小結

C語言中壓縮字串的簡單演算法小結_C 語言

最後更新：2017-01-18 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

應用中，經常需要將字串壓縮成一個整數，即字串散列。比如下面這些問題：
（1）搜尋引擎會通過記錄檔把使用者每次檢索使用的所有檢索串都記錄下來，每個查詢串的長度為1-255位元組。請找出最熱門的10個檢索串。
（2）有一個1G大小的一個檔案，裡面每一行是一個詞，詞的大小不超過16位元組，記憶體限制大小是1M。返回頻數最高的100個詞。
（3）有10個檔案，每個檔案1G，每個檔案的每一行存放的都是使用者的query，每個檔案的query都可能重複。要求你按照query的頻度排序。
（4）給定a、b兩個檔案，各存放50億個url，每個url各佔64位元組，記憶體限制是4G，讓你找出a、b檔案共同的url。
（5）一個文字檔，大約有一萬行，每行一個詞，要求統計出其中最頻繁出現的前10個詞。

這些問題都需要將字串壓縮成一個整數，或者說是散列到某個整數 M 。然後再進行取餘操作，比如 M%16，就可以將該字串放到編號為M%16的檔案中，相同的字串肯定是在同一個檔案中。通過這種處理，就可以將一個大檔案等價劃分成若干小檔案，而對於小檔案，就可以用常規的方法處理，內排序、hash_map等等。最後將這些小檔案的處理結果綜合起來，就可以求得原問題的解。
下面介紹一些字串壓縮的演算法。

方法1：最簡單就是將所有字元加起來，代碼如下：

unsigned long HashString(const char *pString, unsigned long tableSize){ unsigned long hashValue = 0; while(*pString)    hashValue += *pString++; return hashValue % tableSize;}

分析：如果字串的長度有限，而散列表比較大的話，浪費比較大。例如，如果字串最長為16位元組，那麼用到的僅僅是散列表的前16*127=2032。假如散列表含2729項，那麼2032以後的項都用不到。

方法2：將上次計算出來的hash值左移5位（乘以32），再和當前關鍵字相加，能得到較好的均勻分布的效果。

unsigned long HashString(const char *pString,unsigned long tableSize){ unsigned long hashValue = 0; while (*pString) hashValue = (hashValue << 5) + *pString++; return hashValue % tableSize;}

分析：這種方法需要遍曆整個字串，如果字串比較大，效率比較低。

方法3：利用哈夫曼演算法，假設只有0-9這十個字元組成的字串，我們藉助哈夫曼演算法，直接來看執行個體：

#define Size 10 int freq[Size]; string code[Size]; string word; struct Node {  int id;  int freq;  Node *left;  Node *right;  Node(int freq_in):id(-1), freq(freq_in)  {   left = right = NULL;  } }; struct NodeLess {  bool operator()(const Node *a, const Node *b) const  {   return a->freq < b->freq;  } };  void init() {  for(int i = 0; i < Size; ++i)   freq[i] = 0;  for(int i = 0; i < word.size(); ++i)   ++freq[word[i]]; } void dfs(Node *root, string res) {  if(root->id >= 0)   code[root->id] = res;  else  {   if(NULL != root->left)    dfs(root->left, res+"0");   if(NULL != root->right)    dfs(root->right, res+"1");  } }  void deleteNodes(Node *root) {  if(NULL == root)   return ;  if(NULL == root->left && NULL == root->right)   delete root;  else  {   deleteNodes(root->left);   deleteNodes(root->right);   delete root;  } } void BuildTree() {  priority_queue<Node*, vector<Node*>, NodeLess> nodes;  for(int i = 0; i < Size; ++i)  { //0 == freq[i] 的情況未處理     Node *newNode = new Node(freq[i]);   newNode->id = i;   nodes.push(newNode);  }  while(nodes.size() > 1)  {   Node *left = nodes.top();   nodes.pop();   Node *right = nodes.top();   nodes.pop();   Node *newNode = new Node(left->freq + right->freq);     newNode->left = left;     newNode->right = right;     nodes.push(newNode);  }  Node *root = nodes.top();  dfs(root, string(""));  deleteNodes(root); }

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

C語言中壓縮字串的簡單演算法小結_C 語言

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support