In the process of estimating CTR, it is theoretically possible to have such a few types of characteristic information:
User's information (user input query, including user's age, consumption level, historical operation behavior)
Characteristics of the AD (item's attributes, item popularity, advertiser rating, etc.)
Historical feedback characteristics (using historical records, the Pv,click information that has been generated corresponds to the extraction of some feature information, using historical real CTR data to estimate) such as the real-time CTR of each ad, advertising and gender-cross Ctr
In the huge amount of data, the first can guarantee that the training data is sufficient, in the feature of the deletion is mainly to take into account the characteristics of the training sample balance problem
Because in advertising the full amount of the baby is a large part of the long tail, then for some of the characteristics of a few coverage samples corresponding to the sample can be the characteristics of training out is a problem
1 Feature Selection
In considering whether a factor can be used as a feature, first of all ensure that this feature in the data is differentiated, such as query is a dress, basketball, and so on, the user's gender has a great distinction. Again, such as the user's age, 20 years old, 30 years old in the advertisement recommendation, can not say 30 number big, then he corresponds to CTR will be high, such as PV High to CTR will have influence, but this effect is not necessarily useful, PV high not necessarily CTR is high, so this is non-linear features, so find Some of the corresponding features need to do some follow-up on features (such as discretization).
The characteristics can be considered, but the characteristics are not as much as possible, because once the characteristics are too many, but the corresponding training samples are insufficient, then it will certainly lead to the corresponding training weight of the feature is not allowed. At the same time, too many features may lead to overfitting, resulting in poor generalization ability. For example in the advertisement querylength=100 this characteristic, the characteristic itself is meaningful, but possibly this characteristic corresponding query is not topquery namely belongs to the long tail flow corresponding part, therefore the corresponding training sample is insufficient, causes the final training out the cent not to be allowed.
The actual use also found that the ad Feedback Ctr This feature is also very effective, this feature means that the current ads are running, has been put part of, this part of the click-through rate can be regarded as the rate of the ads, can also be considered the quality of this advertisement is a manifestation, The CTR used to estimate a traffic is very effective.
2 discretization of features
First say your understanding: Why to do discrete, such as said before the user's age is a useful feature, but for age in 20~30 this interval, CTR is no obvious distinction, age of 20 years old with 30 years of this two number 20, 30 size comparison is meaningless, The addition and subtraction are meaningless, in the optimization calculation and the actual calculation of CTR will involve the comparison of the size of these two numbers. such as w.x, in the case of W has been determined that the value of a feature of X is 20, or the value of 30,w.x is very large, even if the logical formula to compare, the value is relatively large, but often 20-year-old people with 30-year-old people to the same advertising interest is not so big. Therefore, discretization is required.
There are many ways to discretization here:
1 directly according to the characteristics of the value of its own discretization, such as item_id then according to the specific values assigned to 0,1,2,3 .... (But it is important to note that you may need to do a stage operation when the data volume is large)
2 According to the characteristics of other information, the constant frequency discrete design for several dimensions according to the specific circumstances of the .....
The principle of discretization is mainly for the distinction of the characteristics in the interval, discrete, one of which is discrete in the continuous interval, in different interval scale corresponding to the actual weight of different meanings, but also better to ensure the balance of training samples
If the number 1 feature is the CTR of the ad itself, assuming that the Internet Ad CTR conforms to a long tail distribution, called the logarithmic normal distribution, its probability density is the following figure (note that this is hypothetical, does not represent the real data, from the real data observation is in line with such a shape, It seems that Yahoo's smooth paper says it fits the beta distribution.
You can see that most of the ads are in a small range of clicks, the higher the click-through ads less, and these ads cover less traffic. In other words, when the click-through rate is around 0.2%, if ad A's CTR is 0.2%, ad B's Ctr is 0.25%, ad B's CTR is higher than ad a 0.05%, which is actually enough to indicate that ad B is a lot better than Guang a, but when the CTR is about 1%, the ad a CTR is 1%. , ad B's click-through rate is 1.05%, and there is no way to say that ad B is much better than ad a, because there are not many ads in this 0.05% interval, the two ads basically can be considered similar. At the same time, this figure can also be seen as the distribution of traffic, the horizontal axis is the ratio of advertising, ordinate the PV value, you can think of PV concentrated in a small number of ads.
That is, the click-through rate in different intervals, should be considered as a different weight factor, because the ad click-through rate consists of a number of 1 of the characteristics of the user's click on the ad's probability is not a complete positive correlation, there may be a greater value of the characteristics of the more important, there may be a value increase to a certain extent, the
For such a problem, Baidu has scientists have proposed to discretization of continuous features. They believe that the continuous value of features in different intervals of importance is not the same, so the hope that continuous features in different areas have different weights, the method of implementation is to divide the characteristics of the interval, each interval is a new feature.
The realization is to use the equal frequency discretization method: 1) for the above number 1 of the feature, the first statistical history of each display record in the value of the character 1 of the order of the values, assuming that there are 10,000 display records, each display record of this eigenvalue is a different floating point, For all the display records according to this floating point number from low to high order, take the lowest 1000 display records of the eigenvalues as an interval, ranked 1001 to 2000 of the display record of the eigenvalues as an interval, and so on, a total of 10 intervals. 2) The feature number is rearranged, for the ranking from 1 to 1000 of the 1000 display records, their original number 1 features into the new feature number 1, the value is 1; for records from 1001 to 2000, their original number 1 is changed to a new feature number 2, a value of 1, And so on, the new feature number has 1 to 10 total of 10. For each display record, if it is ranked 1 to 1000, the new feature number is only the value of the number 1 is 1, 2 to 10 is 0, the other display record is similar, so that the ad itself's Ctr occupies 10 feature numbers, it becomes discretized into 10 features.
Equal-frequency discretization needs to do with each of the original features, that is, the original numbering of 1 to 13, will be discretized into a lot of numbers, if each feature is discretized into 10, then eventually there will be 130 features, training results W will be a 130-dimensional vector, respectively, corresponding to the weight of 130 features.
The actual application table name, the discretization characteristic can fit the non-linear relation in the data, obtains the better result than the original continuous characteristic, and in the online application, does not need to do multiplication, also speeds up the calculation Ctr speed.
One more example:
For example, Queryid/itemid corresponds to the value of Pv.click, the two-dimensional discretization, because 1000:400 10:4 in fact, the corresponding effect on the CTR of the benchmark is the same, so the two-dimensional discretization of their values should be similar.
4 feature crossover/Multidimensional features (enhanced presentation information)
Often one-dimensional features are meaningless
For example, a person is 20 years old, then in the number 2 features above, it has been 1, the advertisement for the basketball is 1, the advertisement for cosmetics is 1, so the result of training is 2 of the weight of the significance is-20-year-old people click All the possibilities of advertising is this weight, this is actually unreasonable. So there are many combinations of features in feature selection, such as gender/AD type: male/cosmetic, female/Cosmetic These combination features have specific meanings
Of course, the combination features need to be discretized when required, but generally directly according to their values discretization less, generally further using pv,click,ctr to represent
5 feature filtering and correction
Data smoothing and regularization, followed by a detailed introduction
6 Feature verification
Direct observation of CTR, Chi-square test, single-feature AUC