In matrix decomposition, there are some common problems, that is, the matrix element is only 0 and 1, corresponding to the actual application scenario: the user clicks on news, purchase of certain items. The matrix decomposition Result Based on graphchi is not ideal. After researching the relevant literature, the Code mainly implements the plsa-based decomposition method. For details, see references later.
#! /Usr/local/bin/Python #-*-coding: UTF-8-*-import sysimport mathimport numpy as npimport stringimport random "reimplement algorithm of random which may be usefull in one class Matrix Factorization" Get matrix that all elements are 1 "def get_unit_matrix (row, COL): Return NP. ones (row, col), dtype = int) def norm_zero (array): Col, row = array. shape for C in xrange (COL): For R in xrange (ROW): If array [C] [r] <0.000001: array [C] [r] = 0.000001 return array "" X: binary data matrix, each column is an observation k: number of aspects (components) to estimate ITER: max number of iterations to do ufile: Save matrix result U vfile: Save martix result V "def run_bernoulli (x, K, ITER, ufile, vfile): x = NP. array (x) t, n = x. shape S = NP. array (NP. random. rand (k, n) s_sum = NP. add. reduce (s) s_sum_ext = NP. tile (s_sum, (k, 1) S = s/s_sum_ext A = NP. array (NP. random. rand (T, k) um_a = get_unit_matrix (T, k) a_temp = um_a-A um_x = get_unit_matrix (t, n) x_temp = um_x-x L = [] For I in xrange (ITER): # Step1 as = NP. DOT (A, S) # A * S as_norm = norm_zero (AS) # Max (EPS, A * S) x_as_norm = x/as_norm # (X. /MAX (EPS, A * S) First = NP. DOT (a.t, x_as_norm) # a' * (X. /MAX (EPS, A * S) S_1 = S * First # S. * (a' * (X. /MAX (EPS, A * S) as_temp = um_x-as # 1-A * s as_temp_norm = norm_zero (as_temp) # Max (EPS, 1-A * s) x_as_temp_norm = x_temp/as_temp_norm # (32a ). /MAX (EPS, 1-A * s) Second = NP. DOT (a_temp.t, x_as_temp_norm) # (1-A) '* (_x ). /MAX (EPS, 1-A * s) S_2 = S * Second # S. * (1-). /MAX (EPS, 1-A * s) S = S_1 + S_2 # S. * (a' * (X. /MAX (EPS, A * S) + S. * (1-). /MAX (EPS, 1-A * s) s_sum = NP. add. reduce (s) s_sum_ext = NP. tile (s_sum, (k, 1) S = s/s_sum_ext # step2 as = NP. DOT (A, S) # A * S as_norm = norm_zero (AS) # Max (EPS, A * S) x_as_norm = x/as_norm # (X. /MAX (EPS, A * S) A = A * NP. DOT (x_as_norm, s.t) # Step3 a_temp_s = NP. DOT (a_temp, S) # A1 * s a_temp_s_norm = norm_zero (a_temp_s) # Max (EPS, A1 * s) a_temp_s_norm_x = x_temp/a_temp_s_norm # (32a ). /MAX (EPS, A1 * s) a_temp = a_temp * (NP. DOT (a_temp_s_norm_x, s.t) # A1. * (416x ). /MAX (EPS, A1 * s) * s') A = A/(a + a_temp) #. /(a + A1) a_temp = um_a-A # 1-A # step4 compute loss function as = NP. DOT (A, S) # A * S as_norm = norm_zero (AS) # Max (EPS, A * S) as_norm = NP. log (as_norm) # log (max (EPS, A * S) loss_first = x * as_norm # X. * log (max (EPS, A * S) as_temp = um_x-as # 1-A * s as_temp_norm = norm_zero (as_temp) # Max (EPS, 1-A * s) as_temp_norm = NP. log (as_temp_norm) # log (max (EPS, 1-A * s) loss_second = x_temp * as_temp_norm # (3o ). * log (max (EPS, 1-A * s) loss = loss_first + loss_second # X. * log (max (EPS, A * S) + (32a ). * log (max (EPS, 1-A * s) loss_sum = NP. add. reduce (loss) # sum (X. * log (max (EPS, A * S) + (32a ). * log (max (EPS, 1-A * s) loss = NP. sum (loss_sum)/(n * t) # sum (X. * log (max (EPS, A * S) + (32a ). * log (max (EPS, 1-A * s)/(n * t) L. append (loss) RMSE = estimate_rmse (X, as) if I> 1: If math. FABS (L [I]-l [I-1]) <0.00001: Break print "iter = % d, loss = % F, RMSE = % F" % (I, loss, RMSE) NP. savetxt (ufile, A) NP. savetxt (vfile, S)
References:
1. The aspect Bernoulli model: multiple causes of presences and absences (this article describes algorithms)
2. One-class Matrix completion with low-density factorizations