Full-text search, data mining, recommendation engine series 7-item similarity Algorithms

Source: Internet
Author: User

In actual projects, similarity calculation is required in many cases. For example, in e-commerce systems, users who like this product often prefer this product, generally, similarity calculation is one of the methods to implement this function, which can be considered as a content-based recommendation system application. At the same time, similarity calculation can not only be used to recommend products, but also use the same algorithm to calculate user similarity and recommend other users of interest to users. Unlike text analysis, similarity calculation is generally based on interaction data with users, such as voting, scoring, browsing, and purchasing commodities. After appropriate procedures, digitize the interactive data, such as browsing, purchasing, and voting data in 0/1, and calculate the score based on the actual score.

These algorithms have two obvious advantages over text analysis algorithms: first, text analysis algorithms need to handle English and Chinese problems, and different languages have different processing methods, for example, Chinese Word Segmentation is much more complex than English word segmentation. However, similarity calculation does not solve the problem in different languages. Second, similarity calculation is based on the interaction data between users and systems, in this way, some hotspot entries can be better reflected, so that the recommendation results are more time-sensitive.

Of course, the recommendation is based on the data mining technology, which will inevitably be affected by the noise in the raw data, and because of the current network environment, the adoption of technologies such as shuijun, portal site location marketing, and Seo may make some poor-quality projects a hot spot, thus reducing the recommendation quality. Therefore, when using similarity recommendation, you need to consider a variety of factors to find the most suitable recommendation algorithm.

The following describes the similarity calculation method based on the user's scoring of commodities:

  Product1 Product2 Product3
User1 3 4 2
User2 2 2 4
User3 1 3 5

Table 1

For the sake of simplicity, we assume that three users have scored three items, as shown in table 1. We first want to use this data to calculate the product similarity, so we need to reorganize the data:

  User1 User2 User3
Product1 3 2 1
Product2 4 2 3
Product3 2 4 5

Table 2

After the raw data is prepared, the similarity calculation algorithm is shown below.

Generally, similarity is sorted by similarity. We define the iteminfo class to save the information required in the similarity sorting list.

Public class iteminfo implements comparable <iteminfo> {
Public String getkey (){
Return key;
}
Public void setkey (string key ){
This. Key = key;
}
Public double getval (){
Return val;
}
Public void setval (double Val ){
This. Val = val;
}
Private string key = NULL;
Private double val = 0.0;
@ Override
Public int compareto (iteminfo O ){
// Todo auto-generated method stub
Iteminfo info = (iteminfo) O;
If (Val> info. getval ()){
Return-1;
} Else {
Return 1;
}
}
}

We have implemented a comparison interface to sort the List of this class.

We need to define a storage list class for similarity between a product and other products:

Public class SmlrList {
Public SmlrList (){
Rates = new Hashtable <String, Double> ();
SortedKeys = new Vector <String> ();
SortedVals = new Vector <ItemInfo> ();
}
Public List <ItemInfo> getSortedVals (){
Return sortedVals;
}
Public void setSortedVals (List <ItemInfo> sortedVals ){
This. sortedVals = sortedVals;
}
Public List <String> getSortedKeys (){
Return sortedKeys;
}
Public void setSortedKeys (List <String> sortedKeys ){
This. sortedKeys = sortedKeys;
}
Public hashtable <string, double> getrates (){
Return rates;
}
Public void setrates (hashtable <string, double> rates ){
This. Rates = rates;
}
 
Private list <iteminfo> sortedvals = NULL;
Private list <string> sortedkeys = NULL; // stores the key values sorted by similarity.
Private hashtable <string, double> rates = NULL; // similarity list with other elements
}
The code above shows that the similarity between the product and other products is stored in the rates list. for debugging convenience, we have added sortedkeys to display the similarity by product sequence number, sortedvals is the list sorted by similarity from large to small. Note: In the actual system, only sortedvals can be obtained. Other attributes must be used for debugging purposes.

The call code is as follows:

/**
* Calculate the similarity list of items, such as product and location similarity or user Similarity
* The raw data is as follows:
*
*/
Public void test (){
Hashtable <string, smlrlist> itemssmlrlist = NULL;
Vector <string> sorteditemid = NULL; // Number of sorted entries
Hashtable <string, hashtable <string, double> ratedata = NULL; // information of each user rating
Hashtable <string, double> itemratedata = NULL;

// Initialize the original data
Ratedata = new hashtable <string, hashtable <string, double> ();
Int rowid = 1;
Int colid = 1;
// Add the third line
Rowid = 3;
Itemratedata = new hashtable <string, double> ();
Colid = 1;
Itemratedata. Put ("+ colid, 2.0 );
Colid = 2;
ItemRateData. put ("+ colId, 4.0 );
ColId = 3;
ItemRateData. put ("+ colId, 5.0 );
RateData. put ("" + rowId, itemRateData );
// Add the first line
RowId = 1;
ItemRateData = new Hashtable <String, Double> ();
ColId = 1;
ItemRateData. put ("+ colId, 3.0 );
ColId = 2;
ItemRateData. put ("+ colId, 2.0 );
ColId = 3;
ItemRateData. put ("+ colId, 1.0 );
RateData. put ("" + rowId, itemRateData );
// Add the second line
RowId = 2;
ItemRateData = new Hashtable <String, Double> ();
ColId = 1;
ItemRateData. put ("+ colId, 4.0 );
ColId = 2;
ItemRateData. put ("+ colId, 2.0 );
ColId = 3;
ItemRateData. put ("+ colId, 3.0 );
RateData. put ("" + rowId, itemRateData );

SortedItemId = new Vector <String> (rateData. keySet ());
Collections. sort (sortedItemId );
// Normalize and display the original data
Hashtable <String, Double> normUserRateData = null;
Vector <String> sortedUk = null;
For (String rowKey: sortedItemId ){
Double sum = 0.0;
For (Double dbl: rateData. get (rowKey). values ()){
Sum + = dbl. doubleValue () * dbl. doubleValue ();
}
Sum = Math. sqrt (sum );
NormUserRateData = new Hashtable <String, Double> ();
ItemRateData = rateData. get (rowKey );
For (String colKey: itemRateData. keySet ()){
NormUserRateData. put (colKey, itemRateData. get (colKey). doubleValue ()/sum );
}
RateData. remove (rowKey );
RateData. put (rowKey, normUserRateData );
// Print
SortedUk = new Vector <String> (rateData. get (rowKey). keySet ());
Collections. sort (sortedUk );
For (String suk: sortedUk ){
System. out. print ("" + suk + ":" + rateData. get (rowKey). get (suk). doubleValue ());
}
System. out. print ("\ r \ n ");
}


// Calculate the similarity between entries
ItemsSmlrList = new Hashtable <String, SmlrList> ();
SmlrList smlrList = null;
ItemInfo itemInfo = null;
Int I = 0;
Int j = 0;
Double smlrVal = 0.0;
For (I = 0; I <sortedItemId. size (); I ++ ){
SmlrList = new SmlrList ();
For (j = 0; j <sortedItemId. size (); j ++ ){
SmlrVal = calDotProd (new Vector <Double> (rateData. get (sortedItemId. get (I). values ()),

New Vector <Double> (rateData. get (sortedItemId. get (j). values ()));
SmlrList. getSmlrs (). put (sortedItemId. get (j), smlrVal );
SmlrList. getSortedKeys (). add (sortedItemId. get (j ));
ItemInfo = new ItemInfo ();
ItemInfo. setKey (sortedItemId. get (j ));
ItemInfo. setVal (smlrVal );
SmlrList. getSortedVals (). add (itemInfo );
}
Collections. sort (smlrList. getSortedKeys ());
Collections. sort (smlrList. getSortedVals ());
ItemsSmlrList. put (sortedItemId. get (I), smlrList );
}

// Display similarity results
SmlrList sl02 = null;
For (String uk2: sortedItemId ){
Sl02 = itemsSmlrList. get (uk2 );
System. out. print (uk2 + ":");
For (String uk3: sl02.getSortedKeys ()){
System. out. print ("" + sl02.getSmlrs (). get (uk3) + "[" + uk3 + "]");
}
System. out. print ("\ r \ n ");
}
System. out. println ("************************************* *");
For (String rowKey: sortedItemId ){
SmlrList = itemsSmlrList. get (rowKey );
System. out. print (rowKey + ":");
For (ItemInfo itemInfo1: smlrList. getSortedVals ()){
If (! ItemInfo1.getKey (). equals (rowKey )){
System. out. print ("" + itemInfo1.getVal () + "[" + itemInfo1.getKey () + "]");
}
}
System. out. print ("\ r \ n ");
}
}

The program running result is:

1:0. 8017837257372732. 5345224838248488. 2672612419124244
1:0. 7427813527082074. 3713906763541037. 5570860145311556
1:0. 29814239699997197. 5962847939999439. 7453559924999299
1. 1.0 [1] 0.9429541672723838 [2] 0.756978119245116 [3]
2. 0.9429541672723838 [1] 1.0 [2] 0.8581366251553131 [3]
3. 0.756978119245116 [1] 0.8581366251553131 [2] 1.0 [3]
**************************************
1: 0.9429541672723838 [2] 0.756978119245116 [3]
2: 0.9429541672723838 [1] 0.8581366251553131 [3]
3: 0.8581366251553131 [2] 0.756978119245116 [1]

 

Similarly, if we need to calculate the user similarity, we only need to convert the raw data into a program in Table 1 format.

As mentioned above, with the above classes and methods, we can easily find the similarity between any two items. At the same time, we can calculate the similarity of a certain item in a way from large to small, shows the similarity between the entry and other entries.

Similarity measurement uses the dot product of the vector. The larger the dot product value, the larger the similarity of the corresponding item. The specific calculation method is as follows:

Public double calDotProd (List <Double> vec1, List <Double> vec2 ){
Double dotProd = 0.0;
For (int I = 0; I <vec1.size (); I ++ ){
DotProd + = vec1.get (I). doubleValue () * vec2.get (I). doubleValue ();
}
Return dotProd;
}

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.