Several useful r small functions _ productivity

Source: Internet
Author: User

The most recently written code is the R script, which is more and more powerful. Now use it to do some data analysis and do some simulation.


Collect a few regular functions here.

1. Batch replaces the data in the database frame

I. Replace all data for 0 with 100

Res2$valuex[res2$valuex%in% 0]<-100

Ii. replace na with 0

Res2$valuex[is.na (Res2$valuex)]<-0


2. CDF Line

The CDF (cumulative distribution function) is a good tool for clearly understanding the distribution of data.


Showcdf<-function (Data,field) {

RES_CDF=ECDF (data)

Plot (Res_cdf,main=paste (' CDF of ', field)


#显示中位数, four points, maximum, and twice times the maximum value (as the case may be removed)

Summarydata=boxplot.stats (data) $stats

Summarydata[6]=summarydata[5]*2


For (index in 3:length (Summarydata)) {

Tempv=as.numeric (Summarydata[index])

R_value=floor (RES_CDF (TEMPV) *10000)/100


Lines (c (TEMPV,TEMPV), C (r_value/100,0), col= ' Red ', lwd=2,lty=3)


Label=paste (' <-', Floor (tempv*100)/100, ': ', r_value, '% ', sep= ')

Text (Tempv,index*0.15,label,cex=0.8,adj=c (0,1))

}

}

Effect:


* With the following statement can show the specific probability of the points, such as:

  Y<-quantile (Data,c (0.5,0.99))

The 50% and 99% points are taken out of the data.


3. Read data from MySQL

Library (' Rmysql ')

Readdatafrommysql<-function (tablename,targetdate) {

Drv<-dbdriver (' MySQL ')

Con<-dbconnect (drv,host= ' xxx.xxx.xxx.xxx ', port=3006,username= ' xx ', password= ' xxxx ', dbname= ' xxxx ')

Sqlstatement=paste ("Select * from", TableName)


if (nchar (targetdate) >0) {

SQLStatement = Paste (SQLStatement, "where date= '", TargetDate, "'", sep= ')

}

Print (SQLStatement)

Data=dbgetquery (Con, sqlstatement)

Dbdisconnect (Con)

Return (data)

}

* For SQLite or other database can correspond to the transformation.


4. Problem solving

For the 3rd chapter of the header data analysis, the solution of the optimization problem requires the installation of Lpsolve packages and R kits on the system: Lpsolve and Lpsolveapi.

Library (Lpsolve)


F2.obj<-c (5,4)

F2.con<-matrix (c (1,0,0,1,100,125), nrow=3,byrow=t)

F2.dir<-c (' <= ', ' <= ', ' <= ')

F2.rhs<-c (400,300,50000)

LP (' Max ', F2.OBJ,F2.CON,F2.DIR,F2.RHS) $solution


Reference: http://lpsolve.sourceforge.net/5.5/R.htm


5. Get parameters when executing from command line

#main entry

Args <-Commandargs (trailingonly = TRUE)

if (length (args) <1) {

Print ("Wrong parameters, please specify the target date!", quote = F)

} else {

Callprocessfunction (Args[1])

}


This can be done in such a way as:

Rscript xxx. R 2014-01-13



6. Remove abnormal data by box diagram (BoxPlot)

Removeoutdata<-function (data) {

result = Data[!data%in% boxplot.stats (data) $out]

Return (Result)

}


7. Use String Filter data

Filterdata<-function (Data,url) {

Rows=grep (Url,data$url)

Return (Data[c (rows),])

}


8. Using Ggplot2 Drawing

Ggplot2 provides very powerful features, if the plot series needs to be drawn many times, Ggplot2 can be a basic sentence, very worthy of learning applications.

Put a picture here for your reference:



9. Bars

Drawbars<-function (Data,xlab) {
Labels <-C ("A", "B", "C", "D")

Maxvalue=max (Max (data$a), Max (data$b), Max (data$c), Max (data$d))
Ylim<-c (0,maxvalue*1.1)

Datax<-rbind (data$a,data$b,data$c,data$d)
Barplot (t (datax), beside=true,col=terrain.colors (Length (data$t0)), Offset=0,names.arg = Labels,ylim=ylim,xlab=xlab )
Box ()
}

Effect:



10. Classification

Datacluster<-function (data,col,clusternum) {
Require ("FPC")
Require (cluster)

Z2<-na.omit (Data[,col])

Km <-Kmeans (Z2, Clusternum)

Clusplot (data, Km$cluster, Color=true, Shade=true, labels=2, lines=0)
}

Effect:


* Data visualization can help you analyze problems, such as analyzing the loading process:



11. Conversion of the factor series to numeric

Some of the frame loaded from the file, the sequence may be factors, can not be directly converted to numeric, then the following function is required:

Asnumeric <-function (x) as.numeric (As.character (x))

Factorsnumeric <-function (d) modifylist (d, Lapply (d[, sapply (d, Is.factor)], asnumeric)

The above function is simpler to use:

data.x = Asnumeric (data.x)

The key is to switch to the string before you can move to the correct number.

* Before converting, if there are any outliers, such as NULL, remember to convert the first one, or filter it out.

If the data contains a comma, you can try this:

AsNumeric2 <-function (x) as.numeric (gsub ('![ [: Alnum:]] *[[:space:]]| [[:p UNCT:]] ', ', As.character (x)))


12. Operation with the name of the column

Taking the value of a field name increases the flexibility of the application, as follows:

As.matrix (res[c (' data ')]) is equivalent to Res$data

This usage solves the problem of not responding to data changes when specifying data with column numbers. Like what:

Keys<-c (' data_sum ', ' data1 ', ' data2 ')

For (key in keys) {
Data[c (Key)]<-asnumeric (As.matrix (Data[c (key))) #转为数值型
Data[c (Key)][is.na (Data[c (key)), 1]<-0 #将所有NA赋为0
}


13. Observe data distribution type

Datadistribution<-function (x,na.omit=f) {
  if (na.omit) {
    x<-x[!is.na (x)]
  }
  
  m<-mean (x)
  n<-length (x)
  s<-sd (x)
  skew<-sum ((x-m) ^3/s^3)/n
  kurt<-sum ((x-m) ^4/s^4)/n-3
  Return (c (N=n,mean=m,stdev=s,skew=skew,kurtosis=kurt))
}
How to use:
Sapply (Base_data[c (' A ', ' B ')],datadistribution)

14. Group Count

Using the aggregate function can do some of the work of grouping statistics brilliantly, but you can't use length directly. This is done by customizing a function to count only the unique values.

Fun<-function (x) {return (length (unique (x))}

Res<-aggregate (Values~groupby,data=data, Fun=fun)


Another handy is the summarise function of the PLYR Toolkit:

Library (PLYR)
sdata<-ddply (data,c (' Field2 '), Summarise,n=length (RT), Mean=mean (RT), SD=SD (RT), Se=sd/sqrt (N )
print (' Result of ddply function: ')
print (sdata)


15. The string operation in string operation R is often done using regular expressions. To remove the trailing spaces of a string:

Trim <-function (x) gsub ("^\\s+|\\s+$", "", X)

Here is an example of using grep to find strings and delimited strings:

Strval<-trim (Temp[j])
if (Length (grep (' ^max-age ', strval)) >0) {
Values<-strsplit (strval, ' = ')
Data$cache_max_age[i]<-as.numeric (Values[[1]][2])
}


16. Date conversion The following is a convert GMT date string to a POSIX date value in seconds: datetonum<-function (x) As.numeric (as. Posixct (Strptime (Trim (x), "%a,%d%b%Y%h:%m:%s GMT"))
The formatted string that follows must match the passed-in string.




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.