Software version:
hadoop:2.6.0; mahout:1.0 (self-compiled, using only two jar files); Spring:4.0.2;struts:2.3;hibernate:4.3;jquery Easyui : 1.3.6;mysql:5.6; browser: chrome;myeclipse:10.0;
Hadoop platform Configuration:
Node1:namenode/resourcemanger/datanode/nodemanager memory:2g
Node2:nodemanager/datanode/secondarynamenode/jobhistoryserver memory:1.5g
Node3:nodemanager/datanode memory:1.5g
Code Download: (Tomorrow update, the upload is too slow tonight)
System Operation:
(where Ratings.dat and gender.dat files need to be downloaded from http://www.occamslab.com/petricek/data/)
1. Database section:
1) Change the configuration/db.properties file to the user's own user name, password and database;
2) Start Tomcat and let hibernate automatically generate related tables. When you start, you can see the following table in the database:
Copy the 49th line of the Sql/treenode.sql statement to the command line, and then the browser accesses a tomcat-published project, such as the http://localhost:80/rec/here (or http://localhost/rec/ basic.jsp) to access the project homepage with the following interface:
3) Download the relevant data files, download the following files after decompression:
Modify configuration/ The path of the corresponding file in the Util.properties file, modify the Rating.proportion, which is the proportion of data ratings.dat separated, because the cloud platform running collaborative filtering algorithm is time-consuming, and consumes large memory, if the use of all data (245M) then run the CF algorithm than Slower, there is a way to upload a portion of the data, that is, to split the Ratings.dat first, and then upload a smaller scale to the cloud platform for calculation. If you are using a cloud platform with sufficient computing resources, you can set the scale to 1, while uploading larger files;
4) Initialize User basic information table (i.e. Gender.dat data warehousing)
Select "User table initialization" In the browser navigation list or click "Initialize User table" in the top right system settings to import the user table data; When the insert is complete, the browser will be prompted:
Tomcat will have a log prompt for insertion time and data volume, the number of records in the database query T_user table and the same amount of data;
5) inserting user scoring data (i.e. Ratings.dat data warehousing)
Because of this large amount of data, you do not use Hibernate insertion and are inserted directly using the command line. Copy the 4th line in the Sql/import_ratings.sql (need to modify the Ratings.dat file location) to the MySQL editor (or DOS command Window execution line), generally time consuming about 300s After the insert is complete, you can query to the table T_user_item_pref there are 17,359,346 records.
Since the query is later based on the UserID, it is usually time-consuming to run the 16th-line command, as it is indexed: 4mins.
2. Cloud Platform Data section
1) Split Ratings.dat file
Click "Split rating Data" in the top right corner of the browser to split the data, and the following tips will be given when the segmentation is complete:
, the background will also prompt data segmentation complete!
The segmented data in the original Ratings.dat in the folder, divided into two parts, a big for the larger part, a small for the smaller part;
2) cloud Platform verification:
Click "Cloud Platform Verification" in the navigation, in the tab page in the corresponding location input namenode, ResourceManager IP and port (because this machine has done the IP and machine name Mapping, so the use of machine name can also), verify the success of the prompt:
3) Upload Ratings.dat file to cloud Platform
Click "Data Upload" in the navigation, in the tab that appears, select the smaller part of the small that you just split, click Upload to upload the data to the cloud platform;
,
3. Cloud Platform Algorithms section
0) Copy the directory below the Webroot/web-inf/lib three packets to all nodes in the cluster:
1) Top algorithm: Calculates the top of the average score in all scores, in the page setup, such as:
Because the cluster has three nodes, so here set the number of reducer is 3, the minimum score is the number of people who scored, if an item has only one person scored, and is 10 points (highest score), then its average score is 10 points, so obviously inappropriate, so set this value can be excluded some unreasonable value;
After setting the parameters, click Submit, will give the task is submitted to the Cloud Platform prompt, task submission completed, you can open the algorithm monitoring interface, as follows:
When the task runs, the top data will be parsed into the library:
After the storage is successful, 300 records are queried in the table t_rec_top;
2) Call the collaborative filtering algorithm
Click "Collaborative filtering Algorithm" in the navigation, set the parameters in the pop-up tab page, click Submit, similar to the top algorithm, but its monitoring interface is different, generally as follows (this algorithm runs the event longer):
Run Procedure 1:
Run Procedure 2:
Run Procedure 3:
CF algorithm analysis is mainly to update the user's recommendation data.
3) Call single-user referral
The single-user recommendation is to show the recommended results on the cloud platform, such as the following data
In the single-user recommended interface you can see:
4) New user recommended interface:
Note Before invoking the new user recommendation algorithm, the CF algorithm needs to be called first, because the input data of the new user recommendation algorithm is the data generated by a step in the middle of the CF algorithm;
The new user recommends selecting the gender of the top user to display first:
Then, when determined, the top user is displayed and can be scored (this is only a blind score, because there is no additional information)
The default rating is 5, you can edit it, save it after editing, and then click Recommend. Click recommend similar top algorithm, will submit to the cloud Platform new User Referral task, then the monitoring interface:
After a similar recommendation is queried, the recommended list is given (set to 5 temporarily):
4. Some code ideas and implementation
1) cloud Platform algorithm submission and monitoring
Submit tasks in a multi-threaded manner. First initialize the Jobid to null (NULL), and then after the new thread launches the Cloud Platform task, it initializes the Jobid and then takes a certain amount of time to obtain the Jobid, which, if obtained, indicates that the task was submitted successfully and the foreground returns. After receiving the correct information returned by the front desk, open the monitoring interface and turn on the timer to request the Jobid task running at a certain time, and get the return result to display. When the task finishes running, close the timer and prompt.
The code is as follows:
public void CF () throws ioexception{ try{ //Initialize Jobid utils.initialjobid (); Submit Task New Thread (New Cfthread (Numrecommenderations,similarityclassname,maxperfperuser, Minperfperuser, Maxsimilaritiesperitem,maxprefsperuserinitemsimilarity, integer.parseint (reducersize)). Start (); while (Utils.getjobid () ==null) { thread.sleep (500);//Pause 500 milliseconds, initialize Jobid } }catch (Exception e) { Utils.stringtowriter (Utils.fialcheck); return; } Utils.stringtowriter (Utils.passcheck); Task submitted successfully, initialize Jobidlistutils.initialjobidlist (utils.cfjobnum); return; }
It is important to note that the method of obtaining Jobid is added to the Jobsubmitter class, because Jobid is a private variable, so a static method can be obtained directly from the outside.
At the same time, due to multiple Mr Tasks in the CF algorithm, it is necessary to initialize the Jobid that will run according to the current Jobid or Mr Task number, so that you do not need to get it again when you run that Jobid.
2) page Monitoring front desk backstage:
Front desk, timed Ajax refresh, re-reload DataGrid data
function Monitor_cf_refresh () {//Console.info ("Monitor_cf_refresh () function");//Do not open, will repeat multiple $.ajax ({//Ajax submission URL: ' Cloud/cloud_ Cfmonitor.action ',//Data: "id=" + row,datatype: "JSON", success:function (data) {if (data.finished = = ' ERROR ') {//Get information Error, return data set to 0, otherwise normal return//$.messager.alert (' Prompt ', ' Get task information error, please check background log! ')///Set prompt: Clearinterval (monitor_cf_interval); $ (' #returnMsg_monitorcf '). HTML (' Get task info Error! ');//Console.info (data);} else if (data.finished = = ' true ') {//$ (' #returnMsg_monitorcf '). HTML (' task running ... ');//Direct display of tasks run in//All tasks run successfully then stop timer$ (' # Cfmonitorid '). DataGrid (' LoadData ', data.rows);//Set multiple times clearinterval (Monitor_cf_interval); $ (' #returnMsg_monitorcf '). HTML (' Task run complete! ')///here to the background request to parse CF results, and storage, directly call CFRESULT2DB () function can cfresult2db ();} else{//set prompt and change page data, multi-line Display job task information//------set DataGrid Data $ (' #cfMonitorId '). DataGrid (' LoadData ', data.rows);}});
Backstage: After obtaining the task information in the background, append a property value finished, indicating that all tasks are completed when the task status data is returned, and its code is as follows:
/** * Test Monitoring page * @throws IOException */public void cftmp () throws ioexception{Utils.stringtowriter (Uti Ls. Passcheck); return; }/** * CF monitoring * @date 2015/2/27 23:56 Algorithm monitoring complete, the page is not perfect * @throws IOException */public void Cfmonitor () Throws ioexception{map<string,object> jsonmap = new hashmap<string,object> (); List<currentjobinfo> currjoblist =null; try{//log.info ("Cfmonitor ..."); Currjoblist = Utils.getjoblist (utils.cfjobnum); Jsonmap.put ("Rows", currjoblist);//Put data if (currjoblist==null| | Currjoblist.size () <=0) {Jsonmap.put ("finished", "error"); Log.info ("CF algorithm failed to run! "); }else if (currjoblist.size () ==utils.cfjobnum) {/////When All tasks are completed, need to send completion message//completion message, data parsing required//inbound use foreground send request to U Seraction Currentjobinfo lastjob = Currjoblist.get (utils.cfjobnum-1); if (utils.isjobfinished (lastjob)) {Jsonmap.put ("finished", "true"); Log.info ("CF algorithm runs complete! "); } }else{Jsonmap.put ("Finished", "false"); }//currjoblist = utils.gettmpjoblist (utils.cfjobnum);//test code//Log.info (json.tojsonstring (currjoblist)); Print View Utils.stringtowriter (json.tojsonstring (Jsonmap));//Use JSON data transfer}catch (Exception e) {e.printstacktrace (); Utils.stringtowriter (Utils.fialcheck); return; } return; }
Above, if have the question, may send the mail to fansy1990@foxmail.com, welcome the exchange ...
Share, grow, be happy
Down-to-earth, focus
Reprint Please specify blog address: http://blog.csdn.net/fansy1990
Mahout Case actual combat--dating recommender system