problem
Suppose that the 12 sales price group has been sorted as follows: 5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215 use each of the following methods to divide them into four boxes. When equal frequency (equal depth) is divided, 15 is in the first few boxes. Equal width is divided in the first few boxes.
The problem of the box is divided into supervised box and unsupervised compartment. non-supervised sub-box and other wide compartment
The value range of the variable is divided into K-width intervals, each of which is treated as a compartment.
In this problem, the range of variables is 5–215,k 4. (215-5) The/4=52.5 dividing point is the data of 57.5,110,162.5,4 box
Box A: 5, 35, 50, 55
B Box: 72, 92
C Box: Empty
D box: 204, 215 equal frequency (equal depth) sub-box
The observed values are arranged in order from small to large, according to the number of observations divided into K, each part as a compartment, for example, the smallest number of 1/k proportional to the observation of the formation of the first compartment, and so on.
The number of observations in this question is 12.k=4. There are 3 data in each case.
Box A: 5, 10, 11,
B Box: 35
C-Box: 50, 55,72
D Box: 92
, 204, 215 K-Cluster box
K-Means clustering method is used to gather the observed values into K class, but in the process of clustering, it is necessary to ensure the order of the box: all the observations in the first compartment are less than the observations in the second, and all the observations in the second compartment are smaller than the observed values in the third compartment. Hand is too time-consuming, should not appear in the written examination. supervised sub-box
Considering the value of the dependent variable in the compartment, the minimum (minimumentropy) or minimum description length (minimumdescriptionlength) of the scoring box is achieved.
(1) Assuming that the variable is a classified variable, the desirable value is 1,...,j. Order PL (j) indicates the proportion of observations with a value of J in the L=1,...,k,j=1,...,j in the L-box, then the entropy of the L-box is JJ=1[-PL (j) Xlog (PL (j))]. If the proportions of the variables in the L-box are equal to each other, that is, pl (1) =...=pl (j) =1/j, then the entropy value of the L-box is maximum; If the dependent variable in the L-box is only one value, that is, a pl (j) equals 1 and the other class is equal to 0, Then the entropy of the L-box reaches the minimum value.
(2) The number of observations in the L-box is represented by RL as the proportion of all observations; then the total entropy value is KL=1RLXJJ=1[-PL (j) Xlog (PL (J))]. The total entropy value needs to be minimized, which means that the compartment is able to differentiate the various categories of dependent variables to the fullest extent.