Recently made a prediction about whether investors re-investment in the project, the need for customer-derived data for post-cleaning modeling analysis, I currently choose the model is xgboost, seemingly data must be all numeric.
The data structure is as follows:
In this, what we need to do is to replace the ' yes ' in the first column with 1, and the fourth, seventh, and eighth columns of the word Fuye replaced by numbers.
The specific requirements are as follows:
Platform Label Replacement: 0, Na;1, pc;2, Wap;3, ios;4, andriod;
Product Label Replacement: 0, Na;1, novice exclusive, 2, direct to the scattered standard, 3, regular treasure, 4, two-handed plan, 5, the novice standard.
First we read the data, the code is as follows:
Hnjb<-read.csv (' f:/rdata/hnjb/Investment user base information table 3.csv ', na.string= ' na ', header=t)
Then we convert the data into a character type, which is easy to replace
Hnjb[] <-lapply (HNJB, As.character)
Ready to start replacing
Hnjb[is.na (HNJB)]<-0
hnjb[hnjb== ' is ']<-1
hnjb[hnjb== ' PC ']<-1
hnjb[hnjb== "WAP"]<-2
hnjb[hnjb== ' iOS ']<-3
hnjb[hnjb== ' Android ']<-4
hnjb[hnjb== ' Novice exclusive ']<-1
hnjb[hnjb== ' Direct scatter ']<-2
hnjb[hnjb== ' Regular treasure ']<-3
hnjb[hnjb== ' Double collection plan ']<-4
hnjb[hnjb== ' Novice label ']<-5
The results are as follows:
Well, the character substitution is done.
After we transcode these characters to numeric, we can import xgboost for modeling analysis, but after the time variable is converted to a string, and then to numeric becomes NA, the next article I will tell you concrete solutions.