Using association rules to explore the relationship between TCM syndromes and malignant tumors

Source: Internet
Author: User

Target :
Using pathological information to explore the relationship between TCM syndromes and TNM staging of breast cancer

thinking and Flow :
Objective to explore the relationship between TCM syndromes and TNM staging of breast cancer, and adopt Association rules model
After confirming the model, the data of the patients ' TCM syndromes and the TNM staging of breast cancer need to be sorted out. The data should be preprocessed, including data cleaning, attribute specification, data transformation, etc. to meet the need of mining.

getting Data - data preprocessing - building models

First, access to data

Chinese Medicine syndrome: ' Liver-qi stagnation syndrome type coefficient ', ' heat toxin accumulation syndrome ', ' Chong-ren-syndrome-type coefficient ', ' deficiency syndrome type coefficient ', ' spleen-stomach weakness syndrome ', ' syndrome type of hepatic-kidney yin deficiency '
TNM staging of breast cancer: h1:i, H2:ii, H3:III, H4:iv. Phase I is lighter, IV is heavier

Data set size is (930 rows, 7 columns), no null value

Second, data preprocessing

In the last step we found that the data has been very clean, where the main task of data preprocessing is to discretization of each attribute, and to cluster each attribute into 4 classes. This is done to accommodate the needs of the algorithm because the association rule algorithm cannot handle continuous data

The key to clustering each attribute into the 4 class is to find the right dividing point. The dividing point is determined by clustering algorithm to find the cluster center of each attribute, taking the average value of the adjacent cluster center.

The range and corresponding label of syndrome type coefficient of liver-qi stagnation:

The code is as follows:

def f (x): From
    sklearn.cluster import kmeans
    model = Kmeans (n_clusters=4, n_jobs=4)
    Model.fit (D[[x]].as_ Matrix ())

    centers_d = PD. Dataframe (Model.cluster_centers_). Sort_values (by = 0)
    group = [0] + list (centers_d.rolling (2). mean (). Iloc[1:][0]  + [D[x].max ()]
    s = pd.cut (d[x], group, labels = [x + str (i) for I in range (4)]) return

    s

discretization_d = Pd.concat (' F (' syndrome of liver-qi stagnation '), f (' accumulation coefficient of heat toxin '),
                              F (' Chong-ren imbalance syndrome type coefficient '), f (' Qi and blood two deficiency syndrome '),
                              f (' Spleen and stomach weakness syndrome type coefficient '), f (' liver and Kidney yin deficiency syndrome type coefficient '), d[' TNM Staging ']],axis=1)

The resulting dataset is:

Construction model and its application

According to the purpose of mining, the association rule model is adopted here. Infer information about another property from one attribute, based on the associated relationship that is being mined

The key of association algorithm is to determine the appropriate minimum support and minimum confidence, but there is no uniform standard. The excavation after several adjustments and combined with the business, to determine the minimum support of 5.9%, the minimum confidence level of 75%, the Association rules Code is as follows:

def connect_string (x, MS): #自定义连接函数, to implement l_{k-1} to c_k connection x = List (map (Lambda i:sorted (I.split (MS), x)) L = Len (x[0) r = [] for i in range (len (x)): for J in Range (I,len (x)): if x[i][:l-1] = = X[j][:l-1] and x[i ][L-1]!= X[j][l-1]: R.append (x[i][:l-1]+sorted ([x[j][l-1],x[i][l-1])) return R def find_rule (D, SUP Port, confidence, MS = U '--'): #寻找关联规则的函数 result = PD. Dataframe (index=[' support ', ' confidence ']) #定义输出结果 support_series = 1.0*d.sum ()/len (d) #支持度序列 column = List (Suppo
        Rt_series[support_series > Support].index) #初步根据支持度筛选 k = 0 while len (column) > 1:k = k+1 Print (U \ n is doing the%s search ... '%k) column = connect_string (column, ms) print (U ' number:%s ... '%len (column)) s f = Lambda I:d[i].prod (axis=1, numeric_only = True) #新一批支持度的计算函数 #创建连接数据, which is a time-consuming, memory-intensive step.
        When the dataset is large, parallel operation optimization can be considered. D_2 = PD. Dataframe (List (map (sf,column)), index = [Ms.join (i) for I in column]).

        TSupport_series_2 = 1.0*d_2[[ms.join (i) for I in Column]].sum ()/len (d) #计算连接后的支持度 column = List (support_series_2[su Pport_series_2 > Support].index) #新一轮支持度筛选 support_series = Support_series.append (support_series_2) col
            Umn2 = [] for i in column: #遍历可能的推理, such as whether {a,b,c} is a+b-->c or B+c-->a or c+a-->b. i = I.split (ms) for J-in range (len (i)): Column2.append (i[:j]+i[j+1:]+i[j:j+1]) cofide Nce_series = PD. Series (Index=[ms.join (i) for I-column2]) #定义置信度序列 for i in Column2: #计算置信度序列 COFIDENCE_SERIES[MS.J Oin (i)] = Support_series[ms.join (sorted (i))]/support_series[ms.join (I[:len (i)-1])] for I in Cofidence_series[cofi Dence_series > Confidence].index: #置信度筛选 result[i] = 0.0 result[i][' confidence '] = Cofidence_se Ries[i] result[i][' support ' = Support_series[ms.join (sorted (ms))] result = result. T.sort_values (by = [' confidence ', ' SuppoRT '], ascending = False) print (U ' \ n results: ') return result
 

Package the above code into a library, enter the dataset, and run the result:

It can be seen that the liver-qi stagnation syndrome type coefficient 2, hepatic and Kidney yin deficiency syndrome type coefficient 3 to H4 the maximum support reached 7.95%, the maximum confidence reached 88%. This shows that when the syndrome of liver-qi stagnation is in (0.258, 0.352), and the coefficient of syndrome of Yin deficiency is in (0.355, 0.607), the probability of the TMN being diagnosed as H4 is 88%, and the probability of this happening is 7.95%.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.