1. Background
recently participated in a review, is about personalized news recommendations. The plain is to give you a person's browsing record, predicting his next browsing record. Spent a week writing an integrated system, can be a key to recommend news, but the accuracy rate is not ideal, so sent here to hope that you give some advice. The code borrowed from the participle section of the Jieba participle. The data set and code are given below.
2. Data set
A total of five fields, separated by tab. The user number, the news number, the time number, the headline, the day of the current month (3 is number 3rd)
3. Code section
Take a look at the demo diagram First
(1) algorithm description For example, a simple explanation of the algorithm, in fact, is relatively simple, inappropriate place to hope that everyone correct. We have one of the following data
57389361006498791394550848MH370 Flight FALSE Passport passenger identification (update) 11
5738936 the user at number 11th saw the "MH370 flight fake passport passenger ..." This piece of news. We found the hot word of number 11th by Jieba as follows.
The 3,113 anniversary of the loss of the United passports of the passport holders of the invisible passport of Kuala Lumpur
we found that the two keywords of "flight" and "passport" appeared in the news. So we recommend5738936 The user, number 11th appears "Flight", "passport" other news. At the same time we have dealt with the recommendation set, for example, 5738936 of the news will not appear, very low-heat news will not appear.
(2) How to use the whole system uses one-button start-up, which is very convenient to use. First set up a test folder, and then create a new three folder in test, notice the name to be unified with the diagram, because the news is time-lapse, every day to separate to calculate, to store every day of content into documents. Test documents, such as, can be generated automatically. (The github link below provides the complete test document structure)
when using, first set the path parameter of the test folder in global_param.py. all set up, just find the Wordsplite_test package below the main () function, run the program.
Global_param Setting parameter description:Number_jieba: Controlling the number of extracted keywords
Number_day: The number of days to predict from the first dayhot_rate: Forecast Set forecast News heat, the greater the value of the higher the heat
(3) Code flow
First we look at the main ().
Import get_day_dataimport get_keywordsimport get_keynewsimport delete_repeatimport get_hot_resultimport Global_ ParamDef Main (): For I in range (1,global_param.number_day): get_day_data. Transfordata (i) get_day_data. Transfordataset (i) get_keywords. Get_keywords (i) get_keynews. Get_keynews (i) delete_repeat.delete_repeat () Get_hot_result.get_hot_result (global_param.hot_rate) Main ( )
1. First Get_day_data. Transfordata (i) function, find the last time I browsed the news of the user behavior, stored in the Test/train_lastday_set directory.
2.get_day_data. Transfordataset (i) function, distinguish every day of news, stored in the test/train_date_set1 directory
3.get_keywords. Get_keywords (i) function, call Jieba Library, pick out the hottest keywords of every day, store under Test/key_words
4.get_keynews. Get_keynews (i) function, through the last time each user browsed news, compared to see if there is a hot keywords on the day. If present, it is recommended that the same day include this keywords other news. Cycle global_param.number_day days, generate Test/result.txt files
5. Delete_repeat.delete_repeat () function, remove duplicates from result, generate test/result_no_repeat.txt
6.Get_hot_result.get_hot_result (global_param.hot_rate) function, because the Result_no_repeat function generated above may appear, each user recommends too many cases, affecting the accuracy rate. So using this function to control the quantity, each user only recommended candidates with relatively high news popularity. Final result set Test/result_no_repeat_hot.txt
Note: The Result.txt file under test is manually emptied every time the program is executed, and other files are automatically generated without processing. Project Address: Https://github.com/X-Brain/News-Recommend-System (src folder is code, test is data, and document structure)Hope that you have any suggestions, can be in the blog message, or on GitHub issue, hope that more people participate in the contribution.
/********************************
* This article from the blog "Bo Li Garvin"
* Reprint Please indicate the source : Http://blog.csdn.net/buptgshengod
******************************************/
News Personalization recommendation System (Python)-(with source data set)