One, date time, string processing
Date
Day: Date class, Year and day
POSIXCT: Date time class, accurate to seconds, expressed in numbers
POSIXLT: DateTime class, accurate to seconds, expressed as a list
Sys.date (), date (), Difftime (), Isodate (), Isodatetime ()
#得到当前日期时间 (D1=sys.date ()) #日期 date (D3=sys.time ()) #时间 Date and Time (D2=date ()) #日期和时间 year of the specified format by format output Month Day minute minute "Fri 11:11:00 1999" Mydate=as. Date (' 2007-08-09 ') class (MyDate) #Datemode (mydate) #numeric # Date to String As.character (mydate) birday=c (' 01/05/1986 ', ' 08/1 1/1976 ') #dates =as. Date (Birday, '%m/%d/%y ') #向量化运算, convert the vector dates#%d days (01~31) #%a abbreviation Week (Mon) #%a Week (Monday) #%m month (00~12) #%b Abbreviated month (Jan) #%B Month (January) #%y year (s) #%y year ($) #%H when #%M min #%s seconds td=sys.date () format (td,format= '%B %d%Y%s ') format (td,format= '%a,%a ') format (Sys.time (), '%H%H%M%s%s ') #日期转换成数字as. Integer (Sys.date ()) #自1970年1月1号至今的天数 As.integer (AS. Date (' 1970-1-1 ')) #0as. Integer (AS. Date (' 1970-1-2 ')) #1sdate =as. Date (' 2004-10-01 ') edate=as. Date (' 2010-10-22 ') days=edate-sdatedays #时间类型相互减, the result shows the difference in the number of days Ws=difftime (Sys.date (), as. Date (' 1956-10-12 '), units= ' weeks ') #可以指定单位 # Date (d=isodate (2011,10,2)); The result of Class (d) #ISOdate is Posixctas.date ( Isodate (2011,10,2)) #将结果转Change to Dateisodate (2011,2,30) #不存在的日期 result for na# batch conversion to date Years=c (2010,2011,2012,2013,2014,2015) months=1days=c (15,20,21,19 , 30,3) as. Date (Isodate (years,months,days)) #提取日期时间的一部分p =as. Posixlt (Sys.date ()) P=as. Posixlt (Sys.time ()) sys.date () sys.time () p$year + 1900 #年份需要加1900p $mon + 1 #月份需要加1p $mdayp $hourp$minp$sec
String processing
NCHAR (), Length ()
Paste (), outer ()
SUBSTR (), Strsplit ()
Sub (), Gsub (), grep (), regexpr (), grepexpr ()
1 #字符串2X='hello\rwold\n'3 4 Cat (x) #woldo Hello encountered \ r cursor shifted to the head and then printed Wold covered before Hell became Woldo5 Print(x) #6 #字符串长度7 nchar(x) #字符串长度8Length (x) #1the number of elements in the vector9 Ten #字符串拼接 OneBoard=Paste'b',1:4, Sep='-') # "B-1"" B-2"" B-3"" B-4" A Board - -Mm=Paste'mm',1:3, Sep='-') # "MM-1"" MM-2"" MM-3" the mm - - outer(Board,mm,paste,sep=':') #向量的外积 -#[, 1] [, 2] [, 3] +#[1,]"B-1: MM-1"" B-1: MM-2"" B-1: MM-3" -#[2,]"B-2: MM-1"" B-2: MM-2"" B-2: MM-3" +#[3,]"B-3: MM-1"" B-3: MM-2"" B-3: MM-3" A#[4,]"B-4: MM-1"" B-4: MM-2"" B-4: MM-3" at - - #拆分提取 - Board -SUBSTR (board,3,3) #子串 -Strsplit (board,'-', fixed=T) #拆分 in - #修改 toSub'-','.', board,fixed=T) #修改指定字符 + Board -MM # "mm-1"" MM-2"" MM-3" theSub'm','P', mm) #替换第一个匹配项 "PM-1"" PM-2"" PM-3" *Gsub'm','P', mm) #替换全部匹配项 "pp-1"" pp-2"" pp-3" $ Panax Notoginseng - #查找 theMm=C (MM,'MM4') # "MM-1"" MM-2"" MM-3"" mm4 " + mm Agrep'-', MM) #1 2 3Vector in 1,2, 3 contains'-' the +REGEXPR ('-', mm) #匹配成功会返回位置信息, return if not found-1
Second, data preprocessing
Ensure data quality
Accuracy
Integrity
Consistency
Redundancy of
Timeliness
...
1, the extraction of effective data, business personnel need to cooperate (subjective), and related technical means to ensure
2, understand the data definition, unify the understanding of the data definition
...
Data integration: Consolidating multiple data sources
Data conversion:
Data cleansing: Exception data, missing data
Data reduction: refining, rows, columns
Third, data integration
Integration of data through the merge
1 #数据集成2#merge PYLR::Join(Package:: function)3(Customer=Data.frame (Id=C1:6), state=C (Rep ("Beijing",3), Rep ("Shanghai",3))))4(OL=Data.frame (Id=C1,4,6,7), Product=C'IPhone','Vixo','mi','Note2')))5 6 7Merge (Customer,ol, by=('Id')) #Inner Join8Merge (Customer,ol, by=('Id'),all=T) # Full Join9Merge (Customer,ol, by=('Id'), All. x=T) # Left outer Joinleft link, data on the left isTenMerge (Customer,ol, by=('Id'), All. Y=T) # Right outer Joinright links, data on the right are One A -#Uniongo to the DF1 and DF2 have the same column name under -(DF1=Data.frame (ID=Seq0, by=3, length=5), name=Paste'Zhang', SEQ (0, by=3, length=5)))) the(DF2=Data.frame (ID=Seq0, by=4, length=4), name=Paste'Zhang', SEQ (0, by=4, length=4)))) - - Rbind (DF1,DF2) - +Merge (DF1,DF2,all=T) #去重, do not use by - +Merge (DF1,DF2, by=('ID')) #重名的列会被更改显示
Iv. Data Conversion
Construction properties
Normalization (extremely poor, standardized)
Discretization of
Improved distribution
R Language-Data preprocessing