The most recently written code is the R script, which is more and more powerful. Now use it to do some data analysis and do some simulation.
Collect a few regular functions here.
1. Batch replaces the data in the database frame
I. Replace all data for 0 with 100
Res2$valuex[res2$valuex%in% 0]<-100
Ii. replace na with 0
Res2$valuex[is.na (Res2$valuex)]<-0
2. CDF Line
The CDF (cumulative distribution function) is a good tool for clearly understanding the distribution of data.
Showcdf<-function (Data,field) {
RES_CDF=ECDF (data)
Plot (Res_cdf,main=paste (' CDF of ', field)
#显示中位数, four points, maximum, and twice times the maximum value (as the case may be removed)
Summarydata=boxplot.stats (data) $stats
Summarydata[6]=summarydata[5]*2
For (index in 3:length (Summarydata)) {
Tempv=as.numeric (Summarydata[index])
R_value=floor (RES_CDF (TEMPV) *10000)/100
Lines (c (TEMPV,TEMPV), C (r_value/100,0), col= ' Red ', lwd=2,lty=3)
Label=paste (' <-', Floor (tempv*100)/100, ': ', r_value, '% ', sep= ')
Text (Tempv,index*0.15,label,cex=0.8,adj=c (0,1))
}
}
Effect:
* With the following statement can show the specific probability of the points, such as:
Y<-quantile (Data,c (0.5,0.99))
The 50% and 99% points are taken out of the data.
3. Read data from MySQL
Library (' Rmysql ')
Readdatafrommysql<-function (tablename,targetdate) {
Drv<-dbdriver (' MySQL ')
Con<-dbconnect (drv,host= ' xxx.xxx.xxx.xxx ', port=3006,username= ' xx ', password= ' xxxx ', dbname= ' xxxx ')
Sqlstatement=paste ("Select * from", TableName)
if (nchar (targetdate) >0) {
SQLStatement = Paste (SQLStatement, "where date= '", TargetDate, "'", sep= ')
}
Print (SQLStatement)
Data=dbgetquery (Con, sqlstatement)
Dbdisconnect (Con)
Return (data)
}
* For SQLite or other database can correspond to the transformation.
4. Problem solving
For the 3rd chapter of the header data analysis, the solution of the optimization problem requires the installation of Lpsolve packages and R kits on the system: Lpsolve and Lpsolveapi.
Library (Lpsolve)
F2.obj<-c (5,4)
F2.con<-matrix (c (1,0,0,1,100,125), nrow=3,byrow=t)
F2.dir<-c (' <= ', ' <= ', ' <= ')
F2.rhs<-c (400,300,50000)
LP (' Max ', F2.OBJ,F2.CON,F2.DIR,F2.RHS) $solution
Reference: http://lpsolve.sourceforge.net/5.5/R.htm
5. Get parameters when executing from command line
#main entry
Args <-Commandargs (trailingonly = TRUE)
if (length (args) <1) {
Print ("Wrong parameters, please specify the target date!", quote = F)
} else {
Callprocessfunction (Args[1])
}
This can be done in such a way as:
Rscript xxx. R 2014-01-13
6. Remove abnormal data by box diagram (BoxPlot)
Removeoutdata<-function (data) {
result = Data[!data%in% boxplot.stats (data) $out]
Return (Result)
}
7. Use String Filter data
Filterdata<-function (Data,url) {
Rows=grep (Url,data$url)
Return (Data[c (rows),])
}
8. Using Ggplot2 Drawing
Ggplot2 provides very powerful features, if the plot series needs to be drawn many times, Ggplot2 can be a basic sentence, very worthy of learning applications.
Put a picture here for your reference:
9. Bars
Drawbars<-function (Data,xlab) {
Labels <-C ("A", "B", "C", "D")
Maxvalue=max (Max (data$a), Max (data$b), Max (data$c), Max (data$d))
Ylim<-c (0,maxvalue*1.1)
Datax<-rbind (data$a,data$b,data$c,data$d)
Barplot (t (datax), beside=true,col=terrain.colors (Length (data$t0)), Offset=0,names.arg = Labels,ylim=ylim,xlab=xlab )
Box ()
}
Effect:
10. Classification
Datacluster<-function (data,col,clusternum) {
Require ("FPC")
Require (cluster)
Z2<-na.omit (Data[,col])
Km <-Kmeans (Z2, Clusternum)
Clusplot (data, Km$cluster, Color=true, Shade=true, labels=2, lines=0)
}
Effect:
* Data visualization can help you analyze problems, such as analyzing the loading process:
11. Conversion of the factor series to numeric
Some of the frame loaded from the file, the sequence may be factors, can not be directly converted to numeric, then the following function is required:
Asnumeric <-function (x) as.numeric (As.character (x))
Factorsnumeric <-function (d) modifylist (d, Lapply (d[, sapply (d, Is.factor)], asnumeric)
The above function is simpler to use:
data.x = Asnumeric (data.x)
The key is to switch to the string before you can move to the correct number.
* Before converting, if there are any outliers, such as NULL, remember to convert the first one, or filter it out.
If the data contains a comma, you can try this:
AsNumeric2 <-function (x) as.numeric (gsub ('![ [: Alnum:]] *[[:space:]]| [[:p UNCT:]] ', ', As.character (x)))
12. Operation with the name of the column
Taking the value of a field name increases the flexibility of the application, as follows:
As.matrix (res[c (' data ')]) is equivalent to Res$data
This usage solves the problem of not responding to data changes when specifying data with column numbers. Like what:
Keys<-c (' data_sum ', ' data1 ', ' data2 ')
For (key in keys) {
Data[c (Key)]<-asnumeric (As.matrix (Data[c (key))) #转为数值型
Data[c (Key)][is.na (Data[c (key)), 1]<-0 #将所有NA赋为0
}
13. Observe data distribution type
Datadistribution<-function (x,na.omit=f) {
if (na.omit) {
x<-x[!is.na (x)]
}
m<-mean (x)
n<-length (x)
s<-sd (x)
skew<-sum ((x-m) ^3/s^3)/n
kurt<-sum ((x-m) ^4/s^4)/n-3
Return (c (N=n,mean=m,stdev=s,skew=skew,kurtosis=kurt))
}
How to use:
Sapply (Base_data[c (' A ', ' B ')],datadistribution)
14. Group Count
Using the aggregate function can do some of the work of grouping statistics brilliantly, but you can't use length directly. This is done by customizing a function to count only the unique values.
Fun<-function (x) {return (length (unique (x))}
Res<-aggregate (Values~groupby,data=data, Fun=fun)
Another handy is the summarise function of the PLYR Toolkit:
Library (PLYR)
sdata<-ddply (data,c (' Field2 '), Summarise,n=length (RT), Mean=mean (RT), SD=SD (RT), Se=sd/sqrt (N )
print (' Result of ddply function: ')
print (sdata)
15. The string operation in string operation R is often done using regular expressions. To remove the trailing spaces of a string:
Trim <-function (x) gsub ("^\\s+|\\s+$", "", X)
Here is an example of using grep to find strings and delimited strings:
Strval<-trim (Temp[j])
if (Length (grep (' ^max-age ', strval)) >0) {
Values<-strsplit (strval, ' = ')
Data$cache_max_age[i]<-as.numeric (Values[[1]][2])
}
16. Date conversion The following is a convert GMT date string to a POSIX date value in seconds: datetonum<-function (x) As.numeric (as. Posixct (Strptime (Trim (x), "%a,%d%b%Y%h:%m:%s GMT"))
The formatted string that follows must match the passed-in string.