Coursera-getting and cleaning regular expressions and text processing in Data-week4-r languages

Source: Internet
Author: User

Coursera-getting and Cleaning Data-week4 Thursday, January,

Make up the fourth week notes, and this course summary.

The four-week course focuses on text processing. Inside includes

1. Handling of variable names 2. Regular Expression 3. Date processing (see Swirl lubridate package exercise)

First, the processing of variable names, followed by two principles, 1) uniform case tolower/toupper;2) Remove the import data, because special characters caused by the merge variable 3) do not repeat; 4) less Code abbreviations

The functions used include

    1. Replace lookup:

Gsub | In the global scope, change a to Bgsub("A","B",data)

Sub | Replaces only the first charactersub("A","B",data)

grep | The position where the display character appears grep("A",data) is the position of a in data, and if added value=TRUE returns the value containing a directly

Grepl | Similar to grep, but output logic true/false, similar toa==b

    1. Inter-character calculation

NCHAR calculates how many characters are in data, such as nchar("text05") output 6

Substr to a N output ATA from line M substr("data2014",2,4)

Paste character connection paste("ABC","BCD","DEF",sep="-") output "Abc-bcd-def" if the PASTE0 is not split in the middle

Str-tim removing the tail and the space

Above, online search of similar documents such as text (string) processing and regular expressions

The regular expression, Regular expression, is in addition to finding content, R for some special characters have a fixed definition. If we use this kind of character, we can make it easier to find text.

It should be noted that regular expressions are not unique to r and are described in multiple languages. In the teacher's video, he mainly introduces several expressions that he often use. Mainly used in grep/gsub and other functions, look for replacement use. All grammars must be used with **\\** references.

Online search of the Chinese application has

1. Use regular expressions in R to give an example of how to read a watercress movie

    1. R Regular Expression syntax details here is quite detailed, and there are examples, recommended reading

    2. 55 minutes to practice this has a problem, you can add practice.

#To replace a special meta character, you need to specify it with \ \. But if it is a normal character, use \ \ Instead of nothing to get ╮(╯▽╰)╭Gsub'\\.',"","$Peace. Love")## [1] "$PeaceLove"Gsub'\\$',"","$Peace. Love")## [1] "Peace.love"#\d Match Numeric charactersGsub'\\d',"","$Peace. Love0102")## [1] "$Peace. Love"#\d matches non-numeric charactersGsub'\\d',"","$Peace. Love2012012")## [1] "2012012"

Because there are not many opportunities to use this replacement and text search in work learning at this stage, skip it here. I'll make it up when I need it. Quiz4 is simple, the water is gone.

In short, this getting and cleaning data of the lessons of the notes finally finished! February intends to see explorary Data Analysis and reproducible studies. But the Chinese subtitle to explorary data is gone, so later estimated to spend two months to take a course, watching video time to increase the first four courses must thank subtitle Group selfless contribution!

Finally, a practical application of getting and cleaning data is attached, and the XML package reads the PDF handout of the Coursera course page, bulk download

First, go to the Coursera course page to copy all the HTML code to a local HTML file home.

Keyword search source code, found that its PDF file is clear text storage, so long as the use of an XML package, simply read the segmentation can be

##加载XML包, and read the dataURL1<-"lecture.html"HTML<-htmltreeparse (url1,useinternalnodes=T)#in the English version, the PDF is in a file with the title "Lecture Notes"#The following code is based on the Getnodeset demonstration. Get the related list with lecture nodes (should also be related to regular expressions)Notes<-getnodeset (HTML,"/html//a[@title = ' lecture Notes ']") Head (notes)#Here we use xmlgetattr to get the corresponding propertyPdf<-sapply (Notes,xmlgetattr,"href") PDF#This is only a demonstration. Because I am a Windows system, mode= "WB", other systems do not know. #Just get a loop to download and get all the PDFs down. You can also extract the PDF file name in bulk using the content mentioned in the regular expression .Download.file (Pdf[[1]),"test1.pdf", mode="WB")>PDF [1]"https://d396qusza40orc.cloudfront.net/getdata/lecture_slides/01_01_obtainingDataMotivation.pdf" [2]"https://d396qusza40orc.cloudfront.net/getdata/lecture_slides/01_02_rawAndProcessedData.pdf"    

Notes for the first three weeks

First week: Get data from different data sources, HTML & CSV & xlsx http://www.cnblogs.com/weibaar/p/4217495.html

Second week: Get data from APIs and Web pages http://www.cnblogs.com/weibaar/p/4230868.html

Third week: Finishing data (dplyr,tidyr,lubridate) http://www.cnblogs.com/weibaar/p/4273636.html

Week Four: Text lookup and regular expression http://www.cnblogs.com/weibaar/p/4285082.html

Coursera-getting and cleaning regular expressions and text processing in Data-week4-r languages

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.