Coursera-getting and cleaning regular expressions and text processing in Data-week4-r languages

Last Update:2015-02-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Coursera-getting and Cleaning Data-week4 Thursday, January,

Make up the fourth week notes, and this course summary.

The four-week course focuses on text processing. Inside includes

1. Handling of variable names 2. Regular Expression 3. Date processing (see Swirl lubridate package exercise)

First, the processing of variable names, followed by two principles, 1) uniform case tolower/toupper;2) Remove the import data, because special characters caused by the merge variable 3) do not repeat; 4) less Code abbreviations

The functions used include

Replace lookup:

Gsub | In the global scope, change a to Bgsub("A","B",data)

Sub | Replaces only the first charactersub("A","B",data)

grep | The position where the display character appears grep("A",data) is the position of a in data, and if added value=TRUE returns the value containing a directly

Grepl | Similar to grep, but output logic true/false, similar toa==b

Inter-character calculation

NCHAR calculates how many characters are in data, such as nchar("text05") output 6

Substr to a N output ATA from line M substr("data2014",2,4)

Paste character connection paste("ABC","BCD","DEF",sep="-") output "Abc-bcd-def" if the PASTE0 is not split in the middle

Str-tim removing the tail and the space

Above, online search of similar documents such as text (string) processing and regular expressions

The regular expression, Regular expression, is in addition to finding content, R for some special characters have a fixed definition. If we use this kind of character, we can make it easier to find text.

It should be noted that regular expressions are not unique to r and are described in multiple languages. In the teacher's video, he mainly introduces several expressions that he often use. Mainly used in grep/gsub and other functions, look for replacement use. All grammars must be used with **\\** references.

Online search of the Chinese application has

1. Use regular expressions in R to give an example of how to read a watercress movie

R Regular Expression syntax details here is quite detailed, and there are examples, recommended reading
55 minutes to practice this has a problem, you can add practice.

#To replace a special meta character, you need to specify it with \ \. But if it is a normal character, use \ \ Instead of nothing to get ╮(╯▽╰)╭Gsub'\\.',"","$Peace. Love")## [1] "$PeaceLove"Gsub'\\$',"","$Peace. Love")## [1] "Peace.love"#\d Match Numeric charactersGsub'\\d',"","$Peace. Love0102")## [1] "$Peace. Love"#\d matches non-numeric charactersGsub'\\d',"","$Peace. Love2012012")## [1] "2012012"

Because there are not many opportunities to use this replacement and text search in work learning at this stage, skip it here. I'll make it up when I need it. Quiz4 is simple, the water is gone.

In short, this getting and cleaning data of the lessons of the notes finally finished! February intends to see explorary Data Analysis and reproducible studies. But the Chinese subtitle to explorary data is gone, so later estimated to spend two months to take a course, watching video time to increase the first four courses must thank subtitle Group selfless contribution!

Finally, a practical application of getting and cleaning data is attached, and the XML package reads the PDF handout of the Coursera course page, bulk download

First, go to the Coursera course page to copy all the HTML code to a local HTML file home.

Keyword search source code, found that its PDF file is clear text storage, so long as the use of an XML package, simply read the segmentation can be

##加载XML包, and read the dataURL1<-"lecture.html"HTML<-htmltreeparse (url1,useinternalnodes=T)#in the English version, the PDF is in a file with the title "Lecture Notes"#The following code is based on the Getnodeset demonstration. Get the related list with lecture nodes (should also be related to regular expressions)Notes<-getnodeset (HTML,"/html//a[@title = ' lecture Notes ']") Head (notes)#Here we use xmlgetattr to get the corresponding propertyPdf<-sapply (Notes,xmlgetattr,"href") PDF#This is only a demonstration. Because I am a Windows system, mode= "WB", other systems do not know. #Just get a loop to download and get all the PDFs down. You can also extract the PDF file name in bulk using the content mentioned in the regular expression .Download.file (Pdf[[1]),"test1.pdf", mode="WB")>PDF [1]"https://d396qusza40orc.cloudfront.net/getdata/lecture_slides/01_01_obtainingDataMotivation.pdf" [2]"https://d396qusza40orc.cloudfront.net/getdata/lecture_slides/01_02_rawAndProcessedData.pdf"

Notes for the first three weeks

First week: Get data from different data sources, HTML & CSV & xlsx http://www.cnblogs.com/weibaar/p/4217495.html

Second week: Get data from APIs and Web pages http://www.cnblogs.com/weibaar/p/4230868.html

Third week: Finishing data (dplyr,tidyr,lubridate) http://www.cnblogs.com/weibaar/p/4273636.html

Week Four: Text lookup and regular expression http://www.cnblogs.com/weibaar/p/4285082.html

Coursera-getting and cleaning regular expressions and text processing in Data-week4-r languages

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Coursera-getting and cleaning regular expressions and text processing in Data-week4-r languages

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Coursera-getting and cleaning regular expressions and text processing in Data-week4-r languages

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support