Text categorization (Power 8 algorithm challenge stage fifth)

Source: Internet
Author: User

This competition can be said to be just right on my appetite, and finally and machine learning on the top. My method is to use the Bayesian method, the effect reached 85.5%, here to share, other training methods of friends can also exchange.


Let's start with a little digression:

The "Small sample Theory" I wrote earlier has been perfected in the near term (I think the problem is a bit of a piece of paper for a couple of months), but I want to brag about the other person's approach to this, so it's been a long time since we've written the second half. In this text classification I used this "theoretical" basis, the statistical probability of re-optimization. Over time to share a singular value decomposition method.

Whether participle is important:

For the text classification, many people may think that "participle" is very important, but when I first saw the text classification, I guess theoretically this is completely nonsense. So my method of the sentence segmentation method from the correctness of the word is also completely nonsense, but considering my correct rate is OK, please believe that participle is not too important. (Of course, very much hope to have friends using the current comparison of the cattle segmentation algorithm after the word, with my training results classified test, if you can be two to three points above do not forget to share; I've never really verified it.)


Training process:

The training process is divided into two main steps:

Step One: Count all the fragments appearing in all the articles, and count the number of occurrences of these fragments in each category.

Part II: Estimating the probability of classification of all fragments of the fragment (only one of the comments is removed). The probability has been obtained log logarithm, the subsequent multiplication into addition to avoid the loss of precision caused by various problems.

The first one is relatively simple, I will not say, the second I do not want to say, and do not say. However, I will share the output of the two results.

Classification process:

The above all the fragments as a word, using the inverse of the maximum matching method for Word segmentation (note: Here, although the word, but because the words in the dictionary can not guarantee correctness, so simple from the legality of the word segmentation results are often wrong). Add up the opinions of all the words. The maximum category of probability is taken. This concludes. To support the scene I put the source code of the classification process below.


#-*-Coding:utf-8-*-# created by Axuanwu 2015.1.25# key Word:hash countimport numpy as Npimport mathdef getseed (str1) : "" ":p Aram str1: The UTF8 form of the entry: return: The bit random number of the hash fingerprint of the entry 256" "H = 0 for x in Str1:if ord (x) > 256:h <<= H + = ord (x) else:h <<= 6 H + ord (x) whil E (H >> d) > 0:h = (H & (2 * 256-1)) ^ (H >> 256) # number cannot be too large return Hclass Mcard (): D EF __init__ (self): self. M_num = 8 self. N_max = 16777216 self.nummax2 = self. Mcard = [0] self.        Opath = "" Self.index = [0] * 8 Self.__keys = [' first_null '] Self.i_key = 1 # new element added at position I_key SELF.INDEX2 = [0] * 8 def get_keys (self, iii=-1): if III = = -1:return Self.__keys[1:] Else : Return SELF.__KEYS[III] def flush_key (self, III): SELF.__KEYS[III] = "" # Remove the value of keys def getindex (Self, str1, foR_up=false): # Gets the 8 random positions of the entry seed = Getseed (str1) for n in range (0, self.                M_num): a = 0 K = (n + 1) seed1 = Seed if (seed >>) < 0: seed1 = Seed * (n + 15048796327) while seed1 > 0:a ^= (seed1 & (self). N_MAX-1) + k A = ((a << k) & (self. n_max-1)) |                (A >> (self.nummax2-k)) # Left loop shift seed1 >>= SELF.NUMMAX2 if for_up: Self.index2[n] = a else:self.index[n] = a def update_card (self, str1): "" ":p A Ram STR1: The Utf-8 encoded form of the word:p Aram Num: The word needs to increment the value "" "If Self.read_card (str1, True) = = 0: # The new word for the III in Self.index:if self. MCARD[III] = = 0:self.          MCARD[III] = Self.i_key if self.i_key% 10000 = = 0:print Self.i_key Self.i_key + = 1  Self.__keys.append (STR1) def read_card (self, str1, for_up=false): "" ":p Aram Str1: The Utf-8-encoded form of the word: Return: Output The value of the secondary bar "" "If For_up:for I in xrange (0, 10): # Try up to 10 times I_STR1 = STR1 + str (i) If i > 5:print i self.getindex (I_STR1) A aa = min (self. Mcard[self.index]) If AAA = = 0:return 0 return-1 else:fo                R i in xrange (0, 10): # Up to a continuous collision 10 times I_STR1 = str1 + str (i) Self.getindex (I_STR1) AAA = max (self. Mcard[self.index]) If AAA = = 0: # does not exist return 0 Elif AAA < self. N_max:if str1 = = Self.__keys[aaa]: Return AAA # print ("Warning:ba D case happened, card array maybe too short when update "+ str1) # hash bucket too little return 0 def setbase (self, num1=16777216, num2=8): "" ":p Aram NUM1: Array length parameter:p Aram num2: The number of hash positions per entry" "" SELF.N ummax2 = Int (Math.ceil (Math.log (NUM1, 2)) self. N_max = 2 * * SELF.NUMMAX2 # SELF.NUMMAX2 2 N-time self.         M_num = num2 Self.index = [0] * num2 self.index2 = [0] * num2 def set_card (self, Kk=-1, dd=8): "" " :p Aram KK: Array length parameter-1 means to take the previously defined value "" "if-1 = = kk:self. Mcard = np.repeat (0, self.            N_max) return 0 S1 = input (' Do-want-reset Mcard to Zeros,all memory would be lost [y/n]: ') if S1 = = ' Y ': self. Mcard = np.repeat (0, self. N_max) Else:print ("no Reset") else:self.setbase (KK, DD) self. Mcard = Np.repeat (0, 2 * self.nummax2) def record_num (self): "" ": Return: Returns the number of dictionary entries" "" ret Urn self.i_key-1 def card_test (self): "" "Compute Hash Collision Index" "" AAA = Self. _record bbb = self. N_max CCC = 0 for i in self. MCARD:CCC + = Int (i > 0) ddd = self. M_num Print Math.log (1.0 * ccc/bbb,) * ddd, Math.log ((1.0 * AAA * DDD-CCC)/CCC, ten) * DDD

Above is the content of my_class.py, is a hash algorithm, used to quickly find, may not be as good as Python's own dict, but it is their own imitation Bron filter implementation so handy, has been made to use. The following categorizer uses this.


__author__ = ' Axuanwu ' # coding=utf8import reimport sysimport osimport timeimport mathimport numpy as npfrom MyClass import *class readclassify (): Def __init__ (self): Self.m_card = Mcard () Self.dict_class = {} Self.classi Fy_tongji = Np.zeros ((3, 9)) Self.class_str = [] Self.m_card.set_card (2 *, 6) Self.mat_row = 3000 Self.i_file = 0 Self.class_tail = Np.array ([0.0] * self.mat_row) Self.word_count = Np.zeros (3000 9), float) # used to record the most common 3 million fragments self.class_score = Np.array ([0.0] * 9) Self.root_dir = "" Self.max _word_length = 5 Self.re_ch = re.compile (u "[\u4e00-\u9fa5]+", Re. u) Self.re_eng = re.compile (u "[a-za-z0-9+\[email protected]]+", Re.        U) Self.fazhi = 3 def set_dict_class (self): File_list = Os.listdir (Os.path.join (Self.root_dir, "Train")) i = 0 for i_dir in file_list:self.dict_class[i_dir] = i self.class_str.append (i_DIR) i + = 1 def set_fazhi (self): o_file = open (Os.path.join (OS.GETCWD (), "Canshu.txt"), "R") c        Ount_my = [0] * 0 for line in o_file:count_my[i] = Int (Line.rstrip ()) i + = 1 O_file.close () i = Len (count_my)-1 a = Self.mat_row while count_my[i] < a:a- = Count_my[i] I-= 1 Self.fazhi = max ([2, I]) def set_root (self, path= "c:\\users\\01053185\\desktop\\        Yuliao\\yuliao "): Self.root_dir = Path def load_dict (self): print" Loading knowledge takes min "        Line_dict = max (self.word_count.shape) Dict_path = open (Os.path.join (OS.GETCWD (), "Tong_ji2new.txt"), "R")            Temp_array = Np.zeros ((1, 9), float) for line in dict_path:line_s = Line.strip (). Split ("\ T") For j in Xrange (1, Len (line_s)): Temp_array[0, j-1] = float (line_s[j]) # if SUM (Temp_array)        < Self.fazhi:    # continue # Too few times not input feature dictionary self.m_card.update_card (Line_s[0].decode ("Utf-8", "ignore") # every time it's a new word            AAA = Self.m_card.read_card (Line_s[0].decode ("Utf-8", "ignore")) self.word_count[aaa,] = Temp_array If AAA = = line_dict-1: Break # if AAA = = 10000: # break DICT_PA Th.close () print "Loading knowledge done" Def cut_classify2 (self, sentence): Blocks = Re.findall (self.re_                CH, sentence) for blk in blocks:len_blk = Len (blk) i = len_blk while I >= 2:                        j = self.max_word_length # Max magnetic length while J >= 2:if (I-j) < 0: J-= 1 Continue Index_word = Self.m_card.read_card (blk[(I-J)                    : i]) if Index_word = = 0:j-= 1 Continue                Else        if Self.i_file = = Self.class_tail[index_word]: # Word is stored pass ELS E: # Print Blk[i: (i + j)] Self.class_score + = Self.word_count[index_  Word,] self.class_tail[index_word] = self.i_file J-= 1 I -= 1 blocks = Re.findall (Self.re_eng, sentence) for blk in Blocks:index_word = Self.m_card.read                _card (BLK) If self.i_file = = Self.class_tail[index_word]: # Word is stored pass else: Self.class_score + = Self.word_count[index_word,] self.class_tail[index_word] = Self.i_file def c            Ut_classify3 (self, sentence): # forward maximum match blocks = Re.findall (self.re_ch, sentence) for BLK in blocks: LEN_BLK = Len (blk) i = 0 while I < (len_blk-2): j = Self.max_word_leng              TH # Maximum Magnetic length  While J >= 2:if (i + j) > Len_blk:j-= 1 Contin                        UE Index_word = Self.m_card.read_card (blk[i: (i + j)]) if Index_word = = 0: J-= 1 Continue else:if Self.i_file = = Self                            . Class_tail[index_word]: # words are counted stored over pass else:                            # Print Blk[i: (i + j)] Self.class_score + = Self.word_count[index_word,]                    Self.class_tail[index_word] = Self.i_file break If J < 2: i + = 1 else:i + = J blocks = Re.findall (Self.re_eng, sentence) for Blk i n Blocks:index_word = self.m_card.read_card (blk) If self.i_file = = Self.class_tail[index_word]: # Words have been stored in PASs Else:self.class_score + = Self.word_count[index_word,] Self.class_tail[index_wo RD] = Self.i_file def cut_classify (self, sentence): Blocks = Re.findall (self.re_ch, sentence) for Blk in BLOCKS:LEN_BLK = Len (blk) i = len_blk while I >= 2:j = Self.max_word_                        Length # Max Magnetic length while J >= 2:if (I-J) < 0:j-= 1 Continue Index_word = Self.m_card.read_card (blk[(I-J): i]) if index                        _word = = 0:j-= 1 Continue else:                            if Self.i_file = = Self.class_tail[index_word]: # Word is stored pass else:                          # Print Blk[i: (i + j)] Self.class_score + = Self.word_count[index_word,]  Self.class_tail[index_word] = Self.i_file break if J < 2:i -= 1 Else:i-= J blocks = Re.findall (Self.re_eng, sentence) for Blk in BL Ocks:index_word = Self.m_card.read_card (blk) If self.i_file = = Self.class_tail[index_word]: # Word is stored Pass Else:self.class_score + = Self.word_count[index_word,] Self . Class_tail[index_word] = Self.i_file def classify_read (self): Class_result = Os.path.join (OS.GETCWD (), "Class_r Esult.txt ") O_file = open (Class_result," w ") class_numbers = self.word_count.shape # Dir_path = OS.P  Ath.join (Self.root_dir, "Train") Dir_list = Os.listdir (Dir_path) for sdir in Dir_list:dir_path = Os.path.join (Os.path.join (Self.root_dir, "Train"), sdir) # Dir_path = "C:/users/01053185/desktop/yuliao/yuliao          /test/c000024 "  File_list = Os.listdir (Dir_path) for files in File_list:self.i_file + 1 File_ Path = Os.path.join (Dir_path, files) Self.class_score = Np.array ([0.0] * 9) i_file = Open (fi Le_path, "R") for line in I_file:self.cut_classify3 (Line.decode ("GBK", "replace"). Strip ()) Max_pro = max (Self.class_score) for I in Xrange (0, 9): if Self.class_                        Score[i] = = Max_pro:self.classify_tongji[0, self.dict_class[self.class_str[i]] + = 1 if Sdir = = Self.class_str[i]: o_file.writelines (file_path + "\ T" + self.class_str[i]                        + "\ T" + "1\n") self.classify_tongji[1, self.dict_class[self.class_str[i]] + = 1                        Else:o_file.writelines (File_path + "\ T" + self.class_str[i] + "\ T" + "0\n")     Break   O_file.close () try:self.classify_tongji[2,] = self.classify_tongji[1,]/self.classify_tongji[0,] Except:print "Hello word!" if __name__ = = "__main__": my_classify = Readclassify () my_classify.set_root () A = Time.time () my_classify.set _dict_class () # My_classify.set_fazhi () my_classify.load_dict () # My_classify.m_card.read_card (U "Internship") print "t IME is: ", Time.time ()-A," s "My_classify.classify_read () print" Time is: ", Time.time ()-A," s "Print my_classify . Classify_tongji

You may need to change the root directory to run, and the output will be printed in Class_result, at a glance.

Finally, the above-mentioned two statistics and training output results I put on the Baidu plate, we can download by ourselves. Http://pan.baidu.com/s/1pJHpMJ5

Text categorization (Power 8 algorithm challenge stage fifth)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.