Coursera-getting and Cleaning Data-week4
Thursday, January,
Make up the fourth week notes, and this course summary.
The four-week course focuses on text processing. Inside includes
1. Handling of variable names 2. Regular Expression 3. Date processing (see Swirl lubridate package exercise)
First, the processing of variable names, followed by two principles, 1) uniform case tolower/toupper;2) Remove the import data, because special characters caused by the merge variable 3) do not repeat; 4) less Code abbreviations
The functions used include
- Replace lookup:
Gsub | In the global scope, change a to Bgsub("A","B",data)
Sub | Replaces only the first charactersub("A","B",data)
grep | The position where the display character appears grep("A",data)
is the position of a in data, and if added value=TRUE
returns the value containing a directly
Grepl | Similar to grep, but output logic true/false, similar toa==b
- Inter-character calculation
NCHAR calculates how many characters are in data, such as nchar("text05")
output 6
Substr to a N output ATA from line M substr("data2014",2,4)
Paste character connection paste("ABC","BCD","DEF",sep="-")
output "Abc-bcd-def" if the PASTE0 is not split in the middle
Str-tim removing the tail and the space
Above, online search of similar documents such as text (string) processing and regular expressions
The regular expression, Regular expression, is in addition to finding content, R for some special characters have a fixed definition. If we use this kind of character, we can make it easier to find text.
It should be noted that regular expressions are not unique to r and are described in multiple languages. In the teacher's video, he mainly introduces several expressions that he often use. Mainly used in grep/gsub and other functions, look for replacement use. All grammars must be used with **\\** references.
Online search of the Chinese application has
1. Use regular expressions in R to give an example of how to read a watercress movie
R Regular Expression syntax details here is quite detailed, and there are examples, recommended reading
55 minutes to practice this has a problem, you can add practice.
#To replace a special meta character, you need to specify it with \ \. But if it is a normal character, use \ \ Instead of nothing to get ╮(╯▽╰)╭Gsub'\\.',"","$Peace. Love")## [1] "$PeaceLove"Gsub'\\$',"","$Peace. Love")## [1] "Peace.love"#\d Match Numeric charactersGsub'\\d',"","$Peace. Love0102")## [1] "$Peace. Love"#\d matches non-numeric charactersGsub'\\d',"","$Peace. Love2012012")## [1] "2012012"
Because there are not many opportunities to use this replacement and text search in work learning at this stage, skip it here. I'll make it up when I need it. Quiz4 is simple, the water is gone.
In short, this getting and cleaning data of the lessons of the notes finally finished! February intends to see explorary Data Analysis and reproducible studies. But the Chinese subtitle to explorary data is gone, so later estimated to spend two months to take a course, watching video time to increase the first four courses must thank subtitle Group selfless contribution!
Finally, a practical application of getting and cleaning data is attached, and the XML package reads the PDF handout of the Coursera course page, bulk download
First, go to the Coursera course page to copy all the HTML code to a local HTML file home.
Keyword search source code, found that its PDF file is clear text storage, so long as the use of an XML package, simply read the segmentation can be
##加载XML包, and read the dataURL1<-"lecture.html"HTML<-htmltreeparse (url1,useinternalnodes=T)#in the English version, the PDF is in a file with the title "Lecture Notes"#The following code is based on the Getnodeset demonstration. Get the related list with lecture nodes (should also be related to regular expressions)Notes<-getnodeset (HTML,"/html//a[@title = ' lecture Notes ']") Head (notes)#Here we use xmlgetattr to get the corresponding propertyPdf<-sapply (Notes,xmlgetattr,"href") PDF#This is only a demonstration. Because I am a Windows system, mode= "WB", other systems do not know. #Just get a loop to download and get all the PDFs down. You can also extract the PDF file name in bulk using the content mentioned in the regular expression .Download.file (Pdf[[1]),"test1.pdf", mode="WB")>PDF [1]"https://d396qusza40orc.cloudfront.net/getdata/lecture_slides/01_01_obtainingDataMotivation.pdf" [2]"https://d396qusza40orc.cloudfront.net/getdata/lecture_slides/01_02_rawAndProcessedData.pdf"
Notes for the first three weeks
First week: Get data from different data sources, HTML & CSV & xlsx http://www.cnblogs.com/weibaar/p/4217495.html
Second week: Get data from APIs and Web pages http://www.cnblogs.com/weibaar/p/4230868.html
Third week: Finishing data (dplyr,tidyr,lubridate) http://www.cnblogs.com/weibaar/p/4273636.html
Week Four: Text lookup and regular expression http://www.cnblogs.com/weibaar/p/4285082.html
Coursera-getting and cleaning regular expressions and text processing in Data-week4-r languages