Mahout (1) Data bearer

Source: Internet
Author: User

Mahout(1) data bearing

Recommendation data is processed on a large scale. In a cluster environment, the data to be processed may be several GB. Therefore, Mahout optimizes recommendation data.

Preference

In Mahout, user Preference is abstracted as a Preference, including userId, itemId, and Preference value (user's Preference for item ). Preference is an interface with a general implementation of GenericPreference.

 

Because user preference data is large, we usually choose to put it into a set or array. At the same time, because of the memory consumption mechanism of Java objects, using Collection <Preference> and Preference [] in large data volumes is very inefficient. Why?

In Java, the number of bytes occupied by an object = 8 bytes + bytes occupied by the basic data type + bytes occupied by the object reference

(1) Let's talk about the basic 8 bytes.

In JVM, each object (except an array) has a header with two characters. The first word stores some flag information of the object, such: the information such as the lock flag bit and gc has been performed several times. The second byte is a reference that points to the information of this class. JVM reserves 8 bytes of space for these two characters.

In this way, the new Object () occupies 8 bytes, so it may be an empty Object.

(2) number of bytes occupied by the basic type

Byte/boolean 1 bytes

Char/short 2 bytes

Int/float 4 byte

Double/long 8 bytes

(3) number of bytes occupied by Object Reference

Reference 4 bytes

Note: If a data member exists, separate the data member from the object reference based on the basic type. The basic type is accumulated by (2), and then aligned to eight multiples. The object reference is accumulated by 4 bytes, and then aligned to a multiple of 8.

[Java]View plaincopyprint?

1 class test { 2     Integer i; 3     long l; 4     byte b; 5 } 

 

8 (basic) + 16 (data member-basic type: 8 + 1, aligned to 8) + 8 (data member-Object Reference Integer, 4, aligned to 8) = 32 bytes

 

In this case, a GenericPreference object occupies 28 bytes, userId (8 bytes) + itemId (8 bytes) + preference (4 bytes) + basic 8 bytes = 28. If we use Collection <Preference> and Preference [], we will waste a lot of these 8 bytes. Imagine that if our data volume is GB or TB, this overhead will be hard to bear.

 

Therefore, Mahout encapsulates a PreferenceArray to indicate a set of preferred data.

 

 

We can see that GenericUserPreferenceArray contains a userId, an itemId array long [], and a user's preference rating data float []. Instead of a set of Preference objects. Next we will make a comparison to create a PreferenceArray and a Preference array respectively.

 

When the size is 5 but only one item of data is preferred, the PreferenceArray requires 20 Bytes (userId 8 bytes + preference 4 bytes + itemId 8 bytes ), preference [] requires 48 bytes (Basic 8 bytes + a Preference object 28 Bytes + 4 null references 4 × 3 12 bytes ). If there are more than one expected data, there will be only one itemId in PreferenceArray, so that it occupies very little 8 bytes. Therefore, PreferenceArray uses its special implementation to save 4 times of memory.

Use the original saying "Mahout has alreadly rewritable Ted an 'array of Java objects'" in "mahout in action" -- "mahout has transformed the Java object array ". PreferenceArray and its specific implementation reduce memory overhead much more valuable than its complexity, it reduces memory overhead by nearly 75% (compared with Java collections and object arrays)

In addition to the PreferenceArray, Mahout uses a large number of very typical data structures such as Map and Set. However, Mahout does not directly use common Java collections such as HashMap and TreeSet, instead, two APIs, FastByIDMap and FastIDSet, are specifically implemented for the Mahout recommendation. These two data structures are encapsulated to reduce memory overhead and improve performance. There are mainly the following differences between them:

  • Like HashMap, FastByIDMap is also based on hash. However, FastByIDMap uses linear detection to solve hash conflicts, rather than segmentation chains;
  • Keys and values of FastByIDMap are of the long type, rather than objects. This is an improvement based on memory overhead savings and performance improvement;
  • FastByIDMap is similar to a cache zone. It has the concept of "maximum size". When we add a new element, if the size is exceeded, elements that are not frequently used will be removed.

FastByIDMap and FastIDSet have significant improvements in storage. Each element of FastIDSet occupies 14 bytes on average, while HashSet requires 84 bytes. Each entry of FastByIDMap occupies 28 bytes, while HashMap requires 84 bytes.

DataModel

The input actually accepted by the Mahout recommendation engine is DataModel, which is a compressed representation of user-preferred data. The specific implementation of DataModel allows you to extract user preferences from any type of data source. You can easily return the user ID list and count associated with an item from the input preferences, and the number of all users and items in the input data. The specific implementation includes the memory version of GenericDataModel, FileDataModel that supports file reading, and JDBCDataModel that supports database reading.

 

GenericDataModel is the memory Implementation of DataModel. It is applicable to the construction of recommendation data in the memory. It only serves as the input of the recommendation engine to accept user preferencedata and stores a PreferenceArray hashed by user ID and item ID, the PreferenceArray stores all user preferences of the user ID or item ID.

 

FileDataModel supports file reading. Mahout does not have many strict requirements on the file format, as long as the file format meets the following requirements:

  • Each row contains a user Id, item Id, and user preferences.
  • Separated by commas or tabs
  • *. Zip and *. gz files are automatically decompressed. (We recommend that you use compressed data storage when the data volume is too large)

FileDataModel reads data from the file, and then loads the data into the memory in the form of GenericDataModel. For details, refer to the buildModel method in FileDataModel.

 

JDBCDataModel supports database read operations. Mahout provides default support for MySQL MySQLJDBCDataModel, which has the following requirements for user-preferred Data Storage:

  • The User ID column must be BIGINT and not empty.
  • The item ID column must be BIGINT and not empty.
  • The user-preferred value column must be FLOAT
  • We recommend that you create an index on the user ID and item ID.

Sometimes, we ignore user preference values and only care that there is no association between users and items. This association is called "boolean preference" in Mahout ". The reason for this kind of preference is that the association between the user and the item either exists or does not exist. Remember that the association does not exist, but does not mean like or dislike. The last "boolean preference" has three states: Like, dislike, and irrelevant.

When there is a large amount of noise data in the preference data, this special preference evaluation method is meaningful. At the same time, Mahout provides a memory version of DataModel -- GenericBooleanPrefDataModel for "boolean preference ".

 

We can see that GenericBooleanPrefDataModel does not store the preference values, but only stores the associated userId and itemId. Note the difference with GenericDataModel. GenericBooleanPrefDataModel uses FastIDSet and only the associated Id, no preference value. Therefore, some of its methods (inherited from DataModel) such as getItemIDsForUser () have a better execution speed, while getPreferencesFromUser () has a worse execution speed because GenericBooleanPrefDataModel does not store preference values, by default, the user's preference value for items is 1.0.

[Java]View plaincopyprint?

 

 1 @Override 2  3 public Float getPreferenceValue(long userID, long itemID) throws NoSuchUserException { 4  5 FastIDSet itemIDs = preferenceFromUsers.get(userID); 6  7 if (itemIDs == null) { 8  9 throw new NoSuchUserException(userID);10 11 }12 13 if (itemIDs.contains(itemID)) {14 15 return 1.0f;16 17 }18 19 return null;20 21 }22 23  

 

 

 

 

 

 

 

 

 

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.