Mahout Data Bearer

Source: Internet
Author: User
Tags hash

Recommended data processing is large-scale, the next time in the cluster environment to process the data may be several gigabytes, so mahout for the recommended data optimization.

Preference

In Mahout, the user's preferences are abstracted as a preference, containing userid,itemid and preference values (user preferences for item). Preference is an interface, and it has a common implementation that is genericpreference.

Because the user's preference data is large, we usually choose to put it into a collection or an array. At the same time, due to the memory consumption mechanism of Java objects, it is very inefficient to use collection<preference> and preference[in large amount of data. Why, then?

In Java, the number of bytes occupied by an object = The byte of the basic 8-byte + base data type + object reference

(1) First say this basic 8 bytes

In the JVM, each object (except the array) has a head, which has two words, the first word stores some of the object's flag bit information, such as: Lock flag bit, experienced several GC and other information, the second byte is a reference to this class information. The JVM left 8 bytes of space for the two characters.

In this case, the new object () occupies 8 bytes, which is afraid it is an empty object

(2) The number of bytes occupied by the base type

Byte/boolean 1bytes

Char/short 2bytes

Int/float 4byte

Double/long 8bytes

(3) The number of bytes occupied by the object reference

Reference 4bytes

Note: In practice, data members are counted separately according to the basic type and object reference. The base type is incremented by (2), then aligned to 8 multiples, and the object reference is incremented by each 4 byte and then aligned to a multiple of 8.


Account for 8 (Basic) + 16 (data member-BASIC type: 8 + 1, aligned to 8) + 8 (data member--object reference integer,4, aligned to 8) = 32 bytes

In this case, a Genericpreference object takes up 28 bytes, userId (8bytes) + itemId (8bytes) + preference (4bytes) + basic 8bytes = 28. If we use collection<preference> and preference[], we're going to waste a lot of this basic 8 bytes. Imagine that if our data volume is GB or terabytes, this overhead is hard to bear.

For this mahout encapsulates a preferencearray that represents a collection of preference data


We see that Genericuserpreferencearray contains a userid, a itemid array long[], a user's preference for scoring data float[]. Rather than a collection of preference objects. Let's make a comparison by creating an array of preferencearray and preference, respectively.


In the case of size 5, but contains only one preference data: Preferencearray requires 20Bytes (userId 8bytes + preference 4bytes + itemid8bytes), and preference[] Requires 48 bytes (base 8bytes + A Preference object 28bytes + 4 null references 4x3 12Bytes). If you have multiple preference data, Preferencearray will have only one itemid, so that it takes up little 8Bytes. So Preferencearray saves 4 times times as much memory with its special implementation.

In the book "Mahout in Action," Mahout has alreadly reinvented an ' array of javaobjects '--"Mahout has rebuilt the Java object array." Preferencearray and its specific implementation reduces memory overhead, which is far more valuable than its complexity, reducing nearly 75% of memory overhead (relative to Java collections and object arrays)

In addition to preferencearray,mahout, very typical data structures like map and set are used, but Mahout does not directly use common Java collection implementations such as HashMap and TreeSet. Instead, the two api,fastbyidmap and Fastidset are specifically recommended for mahout, and the main purpose of these two data structures is to reduce memory overhead and improve performance. There are mainly the following differences between them:

· Like HashMap, Fastbyidmap is also based on hash. However, Fastbyidmap uses linear probes to resolve hash conflicts, rather than dividing chains;

· The key and value for Fastbyidmap are long, not object, based on the improved memory overhead and performance improvements;

· Fastbyidmap is similar to a buffer, it has a concept of "maximumsize", when we add a new element, if we exceed this size, those elements that are not frequently used will be removed.

The Fastbyidmap and fastidset improvements in storage are significant. Each element of the Fastidset occupies an average of 14 bytes, while the HashSet requires 84 bytes, each entry of Fastbyidmap occupies 28 bytes, and HashMap requires 84 bytes.

Datamodel

The input that the Mahout recommendation engine actually accepts is Datamodel, which is a compression representation of the user preferences data. The specific implementation of Datamodel supports the extraction of user preferences from any type of data source and makes it easy to return the list of user IDs and count counts associated with an item in the input preferences data, as well as the number of users and items in the input data. The implementation includes the Genericdatamodel of the memory version, the Filedatamodel supporting the file reading and the Jdbcdatamodel supporting the reading of the database.

Genericdatamodel is the Datamodel memory version implementation. Suitable for constructing recommendation data in memory, it only accepts user's preference data as input of recommendation engine, and holds a preferencearray that is hashed by user ID and item ID. In Preferencearray, all user preferences for this user ID or item ID are matched.

Filedatamodel support file read, Mahout file format does not have too many stringent requirements, as long as the format to meet the OK:

· Each row contains a user ID, item ID, user preferences

· Comma-separated or tab-separated

· *.zip and *.gz files are automatically decompressed (Mahout recommended for compressed data storage when data volume is too large)

Filedatamodel reads the data from the file and then loads the data into memory in Genericdatamodel form to view the Buildmodel method in Filedatamodel.

Jdbcdatamodel supports read operations on the database, Mahout provides default support for MySQL Mysqljdbcdatamodel, which has the following requirements for storage of user preference data:

· The User ID column needs to be bigint and not empty

· The item ID column needs to be bigint and not empty

· User preference columns need to be float

· It is recommended that you index the user ID and item ID

Sometimes, we will ignore the user's preferences, only concerned about the existence of the relationship between the user and the object, this association is called "Boolean preference" in the mahout. This kind of hobby is because the user and the item's Association either exists, or does not exist, remember to simply indicate that the association relationship does not exist, does not represent likes and dislikes. In fact, a "Boolean preference" can have three states: like, dislike, and have no relationship.

In the case of a large amount of noise data in the preference data, this particular preference assessment is meaningful. At the same time, Mahout provides a memory version of the Datamodel--genericbooleanprefdatamodel for "Boolean preference"

As you can see, Genericbooleanprefdatamodel does not store preference values, only stores the associated UserID and Itemid, the difference between attention and Genericdatamodel, Genericbooleanprefdatamodel uses Fastidset, only the associated ID, no preference value. As a result, some of its methods (inherited from Datamodel), such as Getitemidsforuser (), have better execution speed, while Getpreferencesfromuser () performs worse. Because Genericbooleanprefdatamodel did not store preference values, it defaults to the value of the items are 1.0


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.