Note 1: Reference to study science Space Forum Su Jianlin Blog
Note 2: Record the details of the experiment recurrence and make corrections to the code based on the version update.
Note 3:python3.5;keras2.0.9
Lstm of "QLBD" Emotion Analysis Experiment (i) One-hot encoding
"QLBD" Lstm Affective Analysis Experiment (II.) participle &one-hot
"QLBD" Lstm Affective Analysis experiment Reappearance (iii) one embedding
The lstm of "QLBD" Affective Analysis Experiment (iv) Word embdding
The smallest unit in Chinese is the word, and then the first level is the word, then there are two choices when the text is expressed, one is to use the Word as the basic unit, and the other as the basic unit. Word as the basic unit when it comes to participle, this later. When using the word as the basic unit, there is a way to use one-hot encoding to express a word.
In this paper, we test the model as One-hot representation method: In words, not participle, each sentence truncated to 200 words (not enough to fill the empty string), and then the sentence in the form of one-hot matrix input LSTM model for training classification.
One-hot the advantages and disadvantages here no longer elaborate, directly from the code to see the whole idea of the experiment.
One, the program module calls
Import NumPy as NP
import pandas as PD from
keras.utils import np_utils from
keras.models import sequential
from keras.layers Import Dense, Activation, dropout from
keras.layers import LSTM
import sys
Sys.setrecursionlimit (10000) #增大堆栈最大深度 (recursive depth), which is said to default to 1000, error
Second, corpus
pos = Pd.read_excel (' E:/coding/yuliao/pos.xls ', header=none,index=none) #10677篇
pos[' label '] = 1
neg = Pd.read _excel (' E:/coding/yuliao/neg.xls ', header=none,index=none) #10428篇
neg[' label '] = 0
All_ = pos.append (neg, Ignore_index=true) #共21105篇
Third, some numerical settings
MaxLen = #截断字数
min_count = #出现次数少于该值的字扔掉. This is the simplest method of dimensionality reduction.
Four, the Corpus pretreatment one
Content = '. Join (all_[0]) #所有语料拼接成一个字符串
ABC = PD. Series (list content). value_counts () #统计每个字的频率
abc = abc[abc >= Min_count] #去掉低频字, simple dimensionality reduction
abc[:] = range (len ( ABC)) #用0-integers of 2416 are re-assigned to each word sequentially, and an integer represents a word
Word_set = set (Abc.index) #构建字典
Five, Corpus pretreatment two
def doc2num (S, maxlen): #构建将文本转化为数字向量的函数, maxlen=200
s = [i-I in S if I in Word_set]
s = s[:maxlen] #截取 200 Word
return list (Abc[s])
all_[' doc2num '] = all_[0].apply (lambda s:doc2num (S, maxlen)) #使用函数将文本转化为数字向量
# #all_中的文本向量表示形式: [8, 65, 795, 90, 152, 152, 289, 37, 22, 49, 12 ...
Vi. Disruption of data
IDX = List (len (All_))) #生成实际的索引列表
np.random.shuffle (idx) #根据索引打乱文本顺序
All_ = All_.loc[idx] #重新生成表格
VII. specification of data formats as required
#按keras的输入要求来生成数据
x = Np.array (List (all_[' Doc2num ')) #
y = Np.array (list (all_[' label ')) #
y = Y.reshape ( ( -1,1)) #调整标签形状
# x:array ([List ([8, 65, 795, 90, 152, 152, 289, 37, 22, 49, 125, 324, 64, 28, 38, 261, 1, 876, 74, 248, 43, 54, 175, 537, 213, 671, 0, 33, 713, 8, 373, 222, 1, 3, 10, 534, 29, 420, 261, 324, 64, 976, 832, 1, 120, 64, 15, 674, 106, 480, 70, 51, 169, 42, 28, 46, 95, 268, 83, 5, 51, 152, 42, 28, 0, 269, 41, 6, 256, 1, 3, 15, 10, 51, 152, 42, 28, 111, 333, 625, 211, 326, 54, 180, 74, 70, 122, 19, 19, 19, 8, 45, 24, 9, 373, 222, 0, 151, 47, 231, 534, 120, 64, 125, 72, 142, 32, 9, 4, 68, 0, 91, 70, 215, 4, 453, 353, 118, 8, 45, 24, 61, 824, 742, 1, 4, 194, 518, 0, 151, 47, 10, 54, 1206, 529, 1, 143, 921, 71, 31, 5, 603, 175, 276, 3, 691, 37, 731, 314, 372, 314, 3, 6, 17, 713, 12, 2227, 2215, 1, 109, 131, 0, 66, 8, 4, 403, 222, 824, 742, 3, 0, 5, 4, 175, 607, 276, 601,,, 462, 815, 0, 8,,, 194]), 745 (144) Sp list ([667, 426, 581, 635, 478, 196, 294, 140, 99, 1071, 1052, 0, 47, 127, 44, 4, 506, 129, 428, 559, 3 41, 382, 144, 489, 100, 667, 426, 581, 635, 478, 196, 294, 0, 622, 591, 129, 478, 265, 140, 99, 1150, 313, 0, 483, 5, 1154, 115, 323, 1432, 12 6, 290, 0, 40, 28, 6, 481, 165, 620, 154, 166, 166, 166, 166, 13, 20, 7, 458, 135, 7, 150, 187, 0, 33, 32, 3, 7, 257, 129, 881, 1036, 1, 1049, 504, 0, a, 288, 178, 106, 129, 163, 458, 135, 192, 7,, 187, ()),...],dtype=int64)
# y:array ([[0],
[0], [
1],
..., [
0],
[0],
[1]], Dtype=int64)
Viii. Modeling
#建立模型, maxlen=200
model = sequential ()
Model.add (LSTM (+, input_shape= (ABC)))
Maxlen,len ( Dropout (0.5))
Model.add (dense (1))
Model.add (Activation (' sigmoid '))
model.compile (loss= ' Binary_ Crossentropy ',
optimizer= ' Rmsprop ',
metrics=[' accuracy '])
Nine, generator
The size of the #单个句子的one hot matrix is Maxlen*len (ABC), which consumes memory
#为了方便低内存的PC进行测试, where generators are used to generate the one hot matrix
#仅在调用时才生成one Hot Matrix
batch_size = #批尺寸设置
train_num = 15000 #训练集大小
#句子不足200字的补全0行, maxlen=200, each sentence is represented as a 200*2417 matrix
Gen_matrix = Lambda Z:np.vstack ((Np_utils.to_categorical (Z, len (ABC)), Np.zeros ((Maxlen-len (z), Len (ABC))))
#定义数据生成器函数
def data_generator (data, labels, batch_size):
batches = [Range (batch_size*i, min (len (data), batch_size* (i+1)))) for I in range (int (len (data)/batch_size) +1)]
While True:
for I in batches:
xx = Np.zeros ((maxlen, Len (ABC)))
xx, yy = Np.array (List (map (Gen_matrix, Data[i ]))), Labels[i]
yield (xx, yy)
# batches:[range (0, +), # Range (+, + ),
# Range (384),...]
# np_utils.to_categorical for one-hot encoding; Np.vstack (A, B) splicing A, b two matrices
# The Map function in Python3 is not returned in the list of numbers, so we need to add the form (map ())
Ten, start training
Model.fit_generator (Data_generator (X[:train_num], Y[:train_num], batch_size), steps_per_epoch=118, epochs=50)
#注: The KERAS2 version no longer uses the Samples_per_epoch parameter, adding Steps_per_epoch, which represents how many times each epoch iteration, Integer, is calculated as: Steps_per_epoch = train_num/ Batch_size;nb_epoch also changed into epochs.
Xi. Model Testing
Model.evaluate_generator (Data_generator (x[train_num:], Y[train_num:], batch_size), steps = 50)
#注: The KERAS2 version no longer uses the Val_samples parameter, adding the steps parameter, which represents the number of rounds in the iteration.
12. Results
The experiment uses a GPU model of GeForce 940MX (the bitter student party itself, make it work), each round time spent around 155s, training 50 rounds.
The accuracy rate of training set is: 0.9511, the test set accuracy is: 0.88413771399910435.
The process is as follows:
Training process: Epoch 1/50 118/118 [==============================]-159s 1s/step-loss:0.6897-acc:0.5360 Epoch 2/ 118/118 [==============================]-155s 1s/step-loss:0.6693-acc:0.6070 Epoch 3/50 118/118 [============== ================]-154s 1s/step-loss:0.6687-acc:0.5774 Epoch 4/50 118/118 [==============================]-154s 1 s/step-loss:0.6890-acc:0.5328 Epoch 5/50 118/118 [==============================]-154s 1S/STEP-LOSS:0.6694-AC c:0.5874 epoch 6/50 118/118 [==============================]-155s 1s/step-loss:0.6162-acc:0.6806 epoch 7/50 118/1 [==============================]-155s 1s/step-loss:0.6257-acc:0.6446 Epoch 8/50 118/118 [====================== ========]-155s 1s/step-loss:0.6214-acc:0.6621 Epoch 9/50 118/118 [==============================]-158s 1s/step- loss:0.6214-acc:0.6476 Epoch 10/50 118/118 [==============================]-155s 1s/step-loss:0.6134-acc:0.62 Epoch 11/50 118/118 [==============================]-154s 1s/step-loss:0.6257-acc:0.6110 Epoch 12/50 118/118 [========================= =====]-155s 1s/step-loss:0.5853-acc:0.6810 Epoch 13/50 118/118 [==============================]-155s 1s/step-l
oss:0.5562-acc:0.6740 Epoch 14/50 118/118 [==============================]-155s 1s/step-loss:0.5573-acc:0.6978 Epoch 15/50 118/118 [==============================]-155s 1s/step-loss:0.6018-acc:0.6500 Epoch 16/50 118/118 [= = = ===========================]-155s 1s/step-loss:0.5985-acc:0.6574 Epoch 17/50 118/118 [============================ = =]-155s 1s/step-loss:0.6494-acc:0.6494 Epoch 18/50 118/118 [==============================]-155s 1s/step-loss : 0.5674-acc:0.6869 Epoch 19/50 118/118 [==============================]-157s 1s/step-loss:0.5080-acc:0.7618 Ep och 20/50 118/118 [==============================]-155s 1s/step-loss:0.5805-acc:0.7010 Epoch 21/50 118/118 [====== ========================]-155s 1s/step-loss:0.5836-acc:0.6854 Epoch 22/50 118/118 [==============================]-155s 1s/step-loss: 0.5778-acc:0.7051 Epoch 23/50 118/118 [==============================]-155s 1s/step-loss:0.5744-acc:0.6682 Epoc H 24/50 118/118 [==============================]-154s 1s/step-loss:0.5482-acc:0.6822 Epoch 25/50 118/118 [======== ======================]-158s 1s/step-loss:0.5583-acc:0.6722 Epoch 26/50 118/118 [==============================]- 155s 1s/step-loss:0.5428-acc:0.7136 Epoch 27/50 118/118 [==============================]-155s 1s/step-loss:0.5 123-acc:0.7660 epoch 28/50 118/118 [==============================]-155s 1s/step-loss:0.5067-acc:0.7396 Epoch 2 9/50 118/118 [==============================]-155s 1s/step-loss:0.5952-acc:0.7206 Epoch 30/50 118/118 [=========== ===================]-156s 1s/step-loss:0.7223-acc:0.4976 Epoch 31/50 118/118 [==============================]-15 5s 1s/step-loss:0.6380-acc:0.6132 epoch 32/50 118/118 [==============================]-158s 1s/step-loss:0.4696-acc:0.7926 Epoch 33/ 118/118 [==============================]-155s 1s/step-loss:0.4085-acc:0.8298 Epoch 34/50 118/118 [============= =================]-155s 1s/step-loss:0.3881-acc:0.8446 Epoch 35/50 118/118 [==============================]-155s 1s/step-loss:0.4269-acc:0.8049 Epoch 36/50 118/118 [==============================]-155s 1s/step-loss:0.4422-
acc:0.8034 epoch 37/50 118/118 [==============================]-155s 1s/step-loss:0.3943-acc:0.8330 Epoch 38/50 118/118 [==============================]-157s 1s/step-loss:0.3226-acc:0.8753 Epoch 39/50 118/118 [================ ==============]-155s 1s/step-loss:0.2895-acc:0.8893 Epoch 40/50 118/118 [==============================]-155s 1s /step-loss:0.2597-acc:0.9065 Epoch 41/50 118/118 [==============================]-156s 1S/STEP-LOSS:0.2800-AC c:0.8984 Epoch 42/50 118/118 [==============================]-159s 1s/step-loss:0.2323-acc:0.9189 Epoch 43/50 118/118 [================== ============]-157s 1s/step-loss:0.2111-acc:0.9266 Epoch 44/50 118/118 [==============================]-157s 1s/s tep-loss:0.1933-acc:0.9352 Epoch 45/50 118/118 [==============================]-156s 1S/STEP-LOSS:0.1822-ACC: 0.9379 epoch 46/50 118/118 [==============================]-156s 1s/step-loss:0.1721-acc:0.9419 epoch 47/50 118/1 [==============================]-156s 1s/step-loss:0.1632-acc:0.9490 Epoch 48/50 118/118 [===================== =========]-156s 1s/step-loss:0.1535-acc:0.9486 Epoch 49/50 118/118 [==============================]-157s 1s/step -loss:0.1489-acc:0.9511 Epoch 50/50 118/118 [==============================]-156s 1s/step-loss:0.1497-acc:0. 9511
#测试: [0.31111064370070929, 0.88413771399910435] #[loss,acc]
13. Predict individual sentences
def predict_one (s): #单个句子的预测函数
s = Gen_matrix (Doc2num (S, maxlen))
s = S.reshape ((1, s.shape[0], s.shape[1]))
return model.predict_classes (S, verbose=0) [0][0]
Comment = Pd.read_excel (' E:/coding/comment/sum.xls ') #载入待预测语料
comment = comment[comment[' ratecontent '].notnull ( ] #评论非空
comment[' text '] = comment[' ratecontent ') #提取评论共11182篇
#取一百篇用模型预测, too much time-consuming longer
comment[' text '][500:600].apply (lambda S:predict_one (s))
#结果如图:
(After the experiment)