This competition can be said to be just right on my appetite, and finally and machine learning on the top. My method is to use the Bayesian method, the effect reached 85.5%, here to share, other training methods of friends can also exchange.
Let's start with a little digression:
The "Small sample Theory" I wrote earlier has been perfected in the near term (I think the problem is a bit of a piece of paper for a couple of months), but I want to brag about the other person's approach to this, so it's been a long time since we've written the second half. In this text classification I used this "theoretical" basis, the statistical probability of re-optimization. Over time to share a singular value decomposition method.
Whether participle is important:
For the text classification, many people may think that "participle" is very important, but when I first saw the text classification, I guess theoretically this is completely nonsense. So my method of the sentence segmentation method from the correctness of the word is also completely nonsense, but considering my correct rate is OK, please believe that participle is not too important. (Of course, very much hope to have friends using the current comparison of the cattle segmentation algorithm after the word, with my training results classified test, if you can be two to three points above do not forget to share; I've never really verified it.)
Training process:
The training process is divided into two main steps:
Step One: Count all the fragments appearing in all the articles, and count the number of occurrences of these fragments in each category.
Part II: Estimating the probability of classification of all fragments of the fragment (only one of the comments is removed). The probability has been obtained log logarithm, the subsequent multiplication into addition to avoid the loss of precision caused by various problems.
The first one is relatively simple, I will not say, the second I do not want to say, and do not say. However, I will share the output of the two results.
Classification process:
The above all the fragments as a word, using the inverse of the maximum matching method for Word segmentation (note: Here, although the word, but because the words in the dictionary can not guarantee correctness, so simple from the legality of the word segmentation results are often wrong). Add up the opinions of all the words. The maximum category of probability is taken. This concludes. To support the scene I put the source code of the classification process below.
#-*-Coding:utf-8-*-# created by Axuanwu 2015.1.25# key Word:hash countimport numpy as Npimport mathdef getseed (str1) : "" ":p Aram str1: The UTF8 form of the entry: return: The bit random number of the hash fingerprint of the entry 256" "H = 0 for x in Str1:if ord (x) > 256:h <<= H + = ord (x) else:h <<= 6 H + ord (x) whil E (H >> d) > 0:h = (H & (2 * 256-1)) ^ (H >> 256) # number cannot be too large return Hclass Mcard (): D EF __init__ (self): self. M_num = 8 self. N_max = 16777216 self.nummax2 = self. Mcard = [0] self. Opath = "" Self.index = [0] * 8 Self.__keys = [' first_null '] Self.i_key = 1 # new element added at position I_key SELF.INDEX2 = [0] * 8 def get_keys (self, iii=-1): if III = = -1:return Self.__keys[1:] Else : Return SELF.__KEYS[III] def flush_key (self, III): SELF.__KEYS[III] = "" # Remove the value of keys def getindex (Self, str1, foR_up=false): # Gets the 8 random positions of the entry seed = Getseed (str1) for n in range (0, self. M_num): a = 0 K = (n + 1) seed1 = Seed if (seed >>) < 0: seed1 = Seed * (n + 15048796327) while seed1 > 0:a ^= (seed1 & (self). N_MAX-1) + k A = ((a << k) & (self. n_max-1)) | (A >> (self.nummax2-k)) # Left loop shift seed1 >>= SELF.NUMMAX2 if for_up: Self.index2[n] = a else:self.index[n] = a def update_card (self, str1): "" ":p A Ram STR1: The Utf-8 encoded form of the word:p Aram Num: The word needs to increment the value "" "If Self.read_card (str1, True) = = 0: # The new word for the III in Self.index:if self. MCARD[III] = = 0:self. MCARD[III] = Self.i_key if self.i_key% 10000 = = 0:print Self.i_key Self.i_key + = 1 Self.__keys.append (STR1) def read_card (self, str1, for_up=false): "" ":p Aram Str1: The Utf-8-encoded form of the word: Return: Output The value of the secondary bar "" "If For_up:for I in xrange (0, 10): # Try up to 10 times I_STR1 = STR1 + str (i) If i > 5:print i self.getindex (I_STR1) A aa = min (self. Mcard[self.index]) If AAA = = 0:return 0 return-1 else:fo R i in xrange (0, 10): # Up to a continuous collision 10 times I_STR1 = str1 + str (i) Self.getindex (I_STR1) AAA = max (self. Mcard[self.index]) If AAA = = 0: # does not exist return 0 Elif AAA < self. N_max:if str1 = = Self.__keys[aaa]: Return AAA # print ("Warning:ba D case happened, card array maybe too short when update "+ str1) # hash bucket too little return 0 def setbase (self, num1=16777216, num2=8): "" ":p Aram NUM1: Array length parameter:p Aram num2: The number of hash positions per entry" "" SELF.N ummax2 = Int (Math.ceil (Math.log (NUM1, 2)) self. N_max = 2 * * SELF.NUMMAX2 # SELF.NUMMAX2 2 N-time self. M_num = num2 Self.index = [0] * num2 self.index2 = [0] * num2 def set_card (self, Kk=-1, dd=8): "" " :p Aram KK: Array length parameter-1 means to take the previously defined value "" "if-1 = = kk:self. Mcard = np.repeat (0, self. N_max) return 0 S1 = input (' Do-want-reset Mcard to Zeros,all memory would be lost [y/n]: ') if S1 = = ' Y ': self. Mcard = np.repeat (0, self. N_max) Else:print ("no Reset") else:self.setbase (KK, DD) self. Mcard = Np.repeat (0, 2 * self.nummax2) def record_num (self): "" ": Return: Returns the number of dictionary entries" "" ret Urn self.i_key-1 def card_test (self): "" "Compute Hash Collision Index" "" AAA = Self. _record bbb = self. N_max CCC = 0 for i in self. MCARD:CCC + = Int (i > 0) ddd = self. M_num Print Math.log (1.0 * ccc/bbb,) * ddd, Math.log ((1.0 * AAA * DDD-CCC)/CCC, ten) * DDD
Above is the content of my_class.py, is a hash algorithm, used to quickly find, may not be as good as Python's own dict, but it is their own imitation Bron filter implementation so handy, has been made to use. The following categorizer uses this.
__author__ = ' Axuanwu ' # coding=utf8import reimport sysimport osimport timeimport mathimport numpy as npfrom MyClass import *class readclassify (): Def __init__ (self): Self.m_card = Mcard () Self.dict_class = {} Self.classi Fy_tongji = Np.zeros ((3, 9)) Self.class_str = [] Self.m_card.set_card (2 *, 6) Self.mat_row = 3000 Self.i_file = 0 Self.class_tail = Np.array ([0.0] * self.mat_row) Self.word_count = Np.zeros (3000 9), float) # used to record the most common 3 million fragments self.class_score = Np.array ([0.0] * 9) Self.root_dir = "" Self.max _word_length = 5 Self.re_ch = re.compile (u "[\u4e00-\u9fa5]+", Re. u) Self.re_eng = re.compile (u "[a-za-z0-9+\[email protected]]+", Re. U) Self.fazhi = 3 def set_dict_class (self): File_list = Os.listdir (Os.path.join (Self.root_dir, "Train")) i = 0 for i_dir in file_list:self.dict_class[i_dir] = i self.class_str.append (i_DIR) i + = 1 def set_fazhi (self): o_file = open (Os.path.join (OS.GETCWD (), "Canshu.txt"), "R") c Ount_my = [0] * 0 for line in o_file:count_my[i] = Int (Line.rstrip ()) i + = 1 O_file.close () i = Len (count_my)-1 a = Self.mat_row while count_my[i] < a:a- = Count_my[i] I-= 1 Self.fazhi = max ([2, I]) def set_root (self, path= "c:\\users\\01053185\\desktop\\ Yuliao\\yuliao "): Self.root_dir = Path def load_dict (self): print" Loading knowledge takes min " Line_dict = max (self.word_count.shape) Dict_path = open (Os.path.join (OS.GETCWD (), "Tong_ji2new.txt"), "R") Temp_array = Np.zeros ((1, 9), float) for line in dict_path:line_s = Line.strip (). Split ("\ T") For j in Xrange (1, Len (line_s)): Temp_array[0, j-1] = float (line_s[j]) # if SUM (Temp_array) < Self.fazhi: # continue # Too few times not input feature dictionary self.m_card.update_card (Line_s[0].decode ("Utf-8", "ignore") # every time it's a new word AAA = Self.m_card.read_card (Line_s[0].decode ("Utf-8", "ignore")) self.word_count[aaa,] = Temp_array If AAA = = line_dict-1: Break # if AAA = = 10000: # break DICT_PA Th.close () print "Loading knowledge done" Def cut_classify2 (self, sentence): Blocks = Re.findall (self.re_ CH, sentence) for blk in blocks:len_blk = Len (blk) i = len_blk while I >= 2: j = self.max_word_length # Max magnetic length while J >= 2:if (I-j) < 0: J-= 1 Continue Index_word = Self.m_card.read_card (blk[(I-J) : i]) if Index_word = = 0:j-= 1 Continue Else if Self.i_file = = Self.class_tail[index_word]: # Word is stored pass ELS E: # Print Blk[i: (i + j)] Self.class_score + = Self.word_count[index_ Word,] self.class_tail[index_word] = self.i_file J-= 1 I -= 1 blocks = Re.findall (Self.re_eng, sentence) for blk in Blocks:index_word = Self.m_card.read _card (BLK) If self.i_file = = Self.class_tail[index_word]: # Word is stored pass else: Self.class_score + = Self.word_count[index_word,] self.class_tail[index_word] = Self.i_file def c Ut_classify3 (self, sentence): # forward maximum match blocks = Re.findall (self.re_ch, sentence) for BLK in blocks: LEN_BLK = Len (blk) i = 0 while I < (len_blk-2): j = Self.max_word_leng TH # Maximum Magnetic length While J >= 2:if (i + j) > Len_blk:j-= 1 Contin UE Index_word = Self.m_card.read_card (blk[i: (i + j)]) if Index_word = = 0: J-= 1 Continue else:if Self.i_file = = Self . Class_tail[index_word]: # words are counted stored over pass else: # Print Blk[i: (i + j)] Self.class_score + = Self.word_count[index_word,] Self.class_tail[index_word] = Self.i_file break If J < 2: i + = 1 else:i + = J blocks = Re.findall (Self.re_eng, sentence) for Blk i n Blocks:index_word = self.m_card.read_card (blk) If self.i_file = = Self.class_tail[index_word]: # Words have been stored in PASs Else:self.class_score + = Self.word_count[index_word,] Self.class_tail[index_wo RD] = Self.i_file def cut_classify (self, sentence): Blocks = Re.findall (self.re_ch, sentence) for Blk in BLOCKS:LEN_BLK = Len (blk) i = len_blk while I >= 2:j = Self.max_word_ Length # Max Magnetic length while J >= 2:if (I-J) < 0:j-= 1 Continue Index_word = Self.m_card.read_card (blk[(I-J): i]) if index _word = = 0:j-= 1 Continue else: if Self.i_file = = Self.class_tail[index_word]: # Word is stored pass else: # Print Blk[i: (i + j)] Self.class_score + = Self.word_count[index_word,] Self.class_tail[index_word] = Self.i_file break if J < 2:i -= 1 Else:i-= J blocks = Re.findall (Self.re_eng, sentence) for Blk in BL Ocks:index_word = Self.m_card.read_card (blk) If self.i_file = = Self.class_tail[index_word]: # Word is stored Pass Else:self.class_score + = Self.word_count[index_word,] Self . Class_tail[index_word] = Self.i_file def classify_read (self): Class_result = Os.path.join (OS.GETCWD (), "Class_r Esult.txt ") O_file = open (Class_result," w ") class_numbers = self.word_count.shape # Dir_path = OS.P Ath.join (Self.root_dir, "Train") Dir_list = Os.listdir (Dir_path) for sdir in Dir_list:dir_path = Os.path.join (Os.path.join (Self.root_dir, "Train"), sdir) # Dir_path = "C:/users/01053185/desktop/yuliao/yuliao /test/c000024 " File_list = Os.listdir (Dir_path) for files in File_list:self.i_file + 1 File_ Path = Os.path.join (Dir_path, files) Self.class_score = Np.array ([0.0] * 9) i_file = Open (fi Le_path, "R") for line in I_file:self.cut_classify3 (Line.decode ("GBK", "replace"). Strip ()) Max_pro = max (Self.class_score) for I in Xrange (0, 9): if Self.class_ Score[i] = = Max_pro:self.classify_tongji[0, self.dict_class[self.class_str[i]] + = 1 if Sdir = = Self.class_str[i]: o_file.writelines (file_path + "\ T" + self.class_str[i] + "\ T" + "1\n") self.classify_tongji[1, self.dict_class[self.class_str[i]] + = 1 Else:o_file.writelines (File_path + "\ T" + self.class_str[i] + "\ T" + "0\n") Break O_file.close () try:self.classify_tongji[2,] = self.classify_tongji[1,]/self.classify_tongji[0,] Except:print "Hello word!" if __name__ = = "__main__": my_classify = Readclassify () my_classify.set_root () A = Time.time () my_classify.set _dict_class () # My_classify.set_fazhi () my_classify.load_dict () # My_classify.m_card.read_card (U "Internship") print "t IME is: ", Time.time ()-A," s "My_classify.classify_read () print" Time is: ", Time.time ()-A," s "Print my_classify . Classify_tongji
You may need to change the root directory to run, and the output will be printed in Class_result, at a glance.
Finally, the above-mentioned two statistics and training output results I put on the Baidu plate, we can download by ourselves. Http://pan.baidu.com/s/1pJHpMJ5
Text categorization (Power 8 algorithm challenge stage fifth)