R Language-Data preprocessing

Source: Internet
Author: User

One, date time, string processing

Date

Day: Date class, Year and day

POSIXCT: Date time class, accurate to seconds, expressed in numbers

POSIXLT: DateTime class, accurate to seconds, expressed as a list

Sys.date (), date (), Difftime (), Isodate (), Isodatetime ()

#得到当前日期时间 (D1=sys.date ()) #日期 date (D3=sys.time ()) #时间 Date and Time (D2=date ()) #日期和时间 year of the specified format by format output Month Day minute minute "Fri 11:11:00 1999" Mydate=as. Date (' 2007-08-09 ') class (MyDate) #Datemode (mydate) #numeric # Date to String As.character (mydate) birday=c (' 01/05/1986 ', ' 08/1 1/1976 ') #dates =as.    Date (Birday, '%m/%d/%y ') #向量化运算, convert the vector dates#%d days (01~31) #%a abbreviation Week (Mon) #%a Week (Monday) #%m month (00~12) #%b  Abbreviated month (Jan) #%B Month (January) #%y year (s) #%y year ($) #%H when #%M min #%s seconds td=sys.date () format (td,format= '%B %d%Y%s ') format (td,format= '%a,%a ') format (Sys.time (), '%H%H%M%s%s ') #日期转换成数字as. Integer (Sys.date ()) #自1970年1月1号至今的天数 As.integer (AS. Date (' 1970-1-1 ')) #0as. Integer (AS. Date (' 1970-1-2 ')) #1sdate =as. Date (' 2004-10-01 ') edate=as. Date (' 2010-10-22 ') days=edate-sdatedays #时间类型相互减, the result shows the difference in the number of days Ws=difftime (Sys.date (), as. Date (' 1956-10-12 '), units= ' weeks ') #可以指定单位 # Date (d=isodate (2011,10,2)); The result of Class (d) #ISOdate is Posixctas.date ( Isodate (2011,10,2)) #将结果转Change to Dateisodate (2011,2,30) #不存在的日期 result for na# batch conversion to date Years=c (2010,2011,2012,2013,2014,2015) months=1days=c (15,20,21,19 , 30,3) as. Date (Isodate (years,months,days)) #提取日期时间的一部分p =as. Posixlt (Sys.date ()) P=as. Posixlt (Sys.time ()) sys.date () sys.time () p$year + 1900 #年份需要加1900p $mon + 1 #月份需要加1p $mdayp $hourp$minp$sec

 

String processing

NCHAR (), Length ()

Paste (), outer ()
SUBSTR (), Strsplit ()
Sub (), Gsub (), grep (), regexpr (), grepexpr ()

1 #字符串2X='hello\rwold\n'3 4 Cat (x) #woldo Hello encountered \ r cursor shifted to the head and then printed Wold covered before Hell became Woldo5 Print(x) #6 #字符串长度7 nchar(x) #字符串长度8Length (x) #1the number of elements in the vector9 Ten #字符串拼接 OneBoard=Paste'b',1:4, Sep='-') # "B-1"" B-2"" B-3"" B-4" A Board -  -Mm=Paste'mm',1:3, Sep='-') # "MM-1"" MM-2"" MM-3" the mm -  - outer(Board,mm,paste,sep=':') #向量的外积 -#[, 1]       [, 2]       [, 3]       +#[1,]"B-1: MM-1"" B-1: MM-2"" B-1: MM-3" -#[2,]"B-2: MM-1"" B-2: MM-2"" B-2: MM-3" +#[3,]"B-3: MM-1"" B-3: MM-2"" B-3: MM-3" A#[4,]"B-4: MM-1"" B-4: MM-2"" B-4: MM-3" at  -  - #拆分提取 - Board -SUBSTR (board,3,3) #子串 -Strsplit (board,'-', fixed=T) #拆分 in  - #修改 toSub'-','.', board,fixed=T) #修改指定字符 + Board -MM # "mm-1"" MM-2"" MM-3" theSub'm','P', mm) #替换第一个匹配项 "PM-1"" PM-2"" PM-3" *Gsub'm','P', mm) #替换全部匹配项 "pp-1"" pp-2"" pp-3" $ Panax Notoginseng  - #查找 theMm=C (MM,'MM4') # "MM-1"" MM-2"" MM-3"" mm4 " + mm Agrep'-', MM) #1 2 3Vector in 1,2, 3 contains'-' the  +REGEXPR ('-', mm) #匹配成功会返回位置信息, return if not found-1
Second, data preprocessing

Ensure data quality

Accuracy
Integrity
Consistency
Redundancy of
Timeliness

...

1, the extraction of effective data, business personnel need to cooperate (subjective), and related technical means to ensure

2, understand the data definition, unify the understanding of the data definition

...

Data integration: Consolidating multiple data sources
Data conversion:
Data cleansing: Exception data, missing data
Data reduction: refining, rows, columns

Third, data integration

Integration of data through the merge

1 #数据集成2#merge PYLR::Join(Package:: function)3(Customer=Data.frame (Id=C1:6), state=C (Rep ("Beijing",3), Rep ("Shanghai",3))))4(OL=Data.frame (Id=C1,4,6,7), Product=C'IPhone','Vixo','mi','Note2')))5 6 7Merge (Customer,ol, by=('Id'))  #Inner Join8Merge (Customer,ol, by=('Id'),all=T) # Full Join9Merge (Customer,ol, by=('Id'), All. x=T) # Left outer Joinleft link, data on the left isTenMerge (Customer,ol, by=('Id'), All. Y=T) # Right outer Joinright links, data on the right are One  A  -#Uniongo to the DF1 and DF2 have the same column name under -(DF1=Data.frame (ID=Seq0, by=3, length=5), name=Paste'Zhang', SEQ (0, by=3, length=5)))) the(DF2=Data.frame (ID=Seq0, by=4, length=4), name=Paste'Zhang', SEQ (0, by=4, length=4)))) -  - Rbind (DF1,DF2) -  +Merge (DF1,DF2,all=T) #去重, do not use by -  +Merge (DF1,DF2, by=('ID')) #重名的列会被更改显示
Iv. Data Conversion

Construction properties
Normalization (extremely poor, standardized)
Discretization of
Improved distribution

R Language-Data preprocessing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.