Quiz 1Question 1
The American Community Survey distributes downloadable data about the states communities. Download The 2006 microdata survey about housing for the state of Idaho using Download.file () from here:
Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv
and load the data into R. The code book, describing the variable names are here:
Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf
How many properties is worth $1,000,000 or more?
Question 2
Use the data, loaded from Question 1. Consider the variable FES in the Code book. Which of the "tidy data" principles does this variable violate?
- Tidy data has variable values, that is internally consistent.
- Tidy data has one variable per column. (Answer)
- Tidy data has no missing values.
- Tidy data has one observation per row.
Question 3
Download the EXCEL spreadsheet on Natural gas aquisition program here:
Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx
Read rows 18-23 and columns 7-15 into R and assign the result to a variable called:
dat
What is the value of:
sum(dat$Zip*dat$Ext,na.rm=T)
(Original data Source:http://catalog.data.gov/dataset/natural-gas-acquisition-program)
Question 4
Read the XML data on Baltimore restaurants from here:
Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml
How many restaurants has zipcode 21231?
Question 5
The American Community Survey distributes downloadable data about the states communities. Download The 2006 microdata survey about housing for the state of Idaho using Download.file () from here:
Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv
Using the fread () command load the data into an R object
DT
Which of the following is the fastest-to calculate the average value of the variable
pwgtp15
Broken down by sex using the The data.table package?
Quiz1. R
# set working directory
SETWD ('/home/mhwong/coursera/getting and cleaning data/quiz1 ')
# Question 1
# download file from server
Download.file (url = "Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv",
DestFile = "Q1.csv",
method = "Curl")
# Read CSV file
Q1 <-read.csv (file = "Q1.csv", Header = TRUE)
# Count properties with value more than 1,000,000
Nrow (Q1[which (q1$val = = 24),])
# Question 3
# download file from server
Download.file (url = "Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx",
DestFile = "Q3.xlsx",
method = "Curl")
# Load XLSX Library
Library (XLSX)
# Read xlsx file
Colindex <-7:15
RowIndex <-18:23
DAT <-read.xlsx ("q3.xlsx", Sheetindex = 1, Header = TRUE, Colindex = colindex, RowIndex = RowIndex)
# sum
SUM (dat$zip*dat$ext,na.rm=t)
# Question 4
# download file from server
Download.file (url = "Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml",
DestFile = "Q4.xml",
method = "Curl")
# Load Library
Library (XML)
# Read XML file
Q4 <-xmltreeparse (file = "Q4.xml", useinternalnodes = TRUE)
# Get root node
RootNode <-XmlRoot (Q4)
# sum the ZipCode = = 21231
Sum (xpathsapply (doc = rootNode, path = "//zipcode", fun = xmlvalue) = = 21231)
# Question 5
# download file from server
Download.file (url = "Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv",
DestFile = "Q5.csv",
method = "Curl")
# Load Library
Library (data.table)
# read CSV file using Fread
DT <-fread ("Q5.csv", Header = TRUE)
# best-of-calculate mean by sex
Dt[,mean (PWGTP15), By=sex]
Quiz 2Question 1
Register an application with the Github API here https://github.com/settings/applications. Access the API to get information on your instructors repositories (Hint:this are the URL you want "https://api.github.com /users/jtleek/repos "). Use this data to find the time, the datasharing repo was created. What time is it created? This tutorial is useful (HTTPS://GITHUB.COM/HADLEY/HTTR/BLOB/MASTER/DEMO/OAUTH2-GITHUB.R). Also need to run the code in the base R package and not R studio.
Question 2
The SQLDF package allows for execution of SQL commands on R data frames. We'll use the Sqldf-practice the queries we might send with the Dbsendquery command in Rmysql. Download the American Community Survey data and load it into an R object called
acs
Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv
Which of the following commands would select only the data for the probability weights PWGTP1 with ages less than 50?
Question 3
Using the same data frame created in the previous problem, what's the equivalent function to unique (ACS$AGEP)
Question 4
How many characters is in the 10th, 20th, 30th and 100th lines of HTML from this page:
Http://biostat.jhsph.edu/~jleek/contact.html
(hint:the nchar () function in R could be helpful)
Question 5
Read this data set into R and report the sum of the numbers in the fourth of the nine columns.
Https://d396qusza40orc.cloudfront.net/getdata%2Fwksst8110.for
Original Source of the Data:http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for
(Hint This is a fixed width file format)
Quiz2. R
# set working directory
SETWD ('/home/mhwong/coursera/getting and cleaning data/quiz2 ')
# Question 1
# Load Library
Library (HTTR)
Library (HTTPUV)
# GitHub OAuth Settings
MyApp <-Oauth_app ("GitHub",
Key = "9ac78a3b36b3862ed435",
Secret = "e1c3be242976383cc5f28dae177d41eef5445e3f")
# Get OAuth Credentials
Github_token <-Oauth2.0_token (oauth_endpoints ("GitHub"), MyApp)
# Use API
Gtoken <-Config (token = github_token)
Req <-GET ("Https://api.github.com/users/jtleek/repos", Gtoken)
Stop_for_status (req)
# Convert unstructured JSON to structured JSON
Library (Jsonlite)
Jsondata <-Fromjson (ToJSON (Content (req)))
# Find out the created time of datasharing repo
Subset (jsondata, name = = "Datasharing", select = C (created_at))
# Question 2
# download file from server
Download.file (url = "Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv",
DestFile = "Q2.csv",
method = "Curl")
# Load Library
Library (SQLDF)
# Read data from File
ACS <-Read.csv ("Q2.csv", Header = TRUE)
# Select the probability weights PWGTP1 with ages less than 50
Sqldf ("Select PWGTP1 from ACS where AGEP < 50")
# Question3
# Same SQL as unique (ACS$AGEP)
Sqldf ("Select DISTINCT Agep from ACS")
# Question 4
# read from HTML
Q4 <-URL ("http://biostat.jhsph.edu/~jleek/contact.html")
Htmlcode <-ReadLines (Q4, n = 100)
# Find numbers of characters of line 10th, 20th, 30th, 100th
NCHAR (HTMLCODE[10])
NCHAR (HTMLCODE[20])
NCHAR (htmlcode[30])
NCHAR (htmlcode[100])
# Question 5
# Read Fixed width format
Q5 <-READ.FWF (file = "Https://d396qusza40orc.cloudfront.net/getdata%2Fwksst8110.for",
Skip = 4,
Widths = C (12, 7, 4, 9, 4, 9, 4, 9,4))
# Find the sum of Forth column
Sum (q5[, 4])
Quiz 3Question 1
The American Community Survey distributes downloadable data about the states communities. Download The 2006 microdata survey about housing for the state of Idaho using Download.file () from here:
Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv
and load the data into R. The code book, describing the variable names are here:
Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf
Create a logical vector that identifies the households on greater than ten acres who sold more than $ worth of AGRICU Lture products. Assign that logical vectors to the variable agriculturelogical. Apply the which () function like this to identify the rows of the data frame where the logical vector is TRUE. which (agriculturelogical) what is the first 3 values that result?
Question 2
Using the JPEG package, read in the following, your instructor into R
Https://d396qusza40orc.cloudfront.net/getdata%2Fjeff.jpg
Use the parameter native=true. What is the 30th and 80th Quantiles of the resulting data? (Some Linux systems may produce a answer 638 different for the 30th quantile)
Question 3
Load the Gross domestic Product data for the ranked countries in this data set:
Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv
Load the educational data from this data set:
Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv
Match the data based on the country shortcode. How many of the IDs match? Sort the data frame in descending order by GDP rank (so, States is). What's the 13th country in the resulting data frame?
Original Data Sources:
Http://data.worldbank.org/data-catalog/GDP-ranking-table
Http://data.worldbank.org/data-catalog/ed-stats
Question 4
What's the average GDP ranking for the ' High Income:oecd ' and ' High Income:nonoecd ' group?
Question 5
Cut the GDP ranking into 5 separate quantile groups. Make a table versus Income.group. How many countries is Lower middle income but among the nations with highest GDP?
Quiz3. R
# set working directory
SETWD ('/home/mhwong/coursera/getting and cleaning data/quiz3 ')
# Question 1
# download file from server
Download.file (url = "Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv",
DestFile = "Q1.csv",
method = "Curl")
# Read CSV file
Q1 <-read.csv ("Q1.csv", Header = TRUE)
# Load Library
Library (PLYR)
Library (DPLYR)
# Create a logical vector
Q1 <-Mutate (Q1, agriculturelogical=factor (ACR = = 3 & AGS = = 6), levels = C (TRUE, FALSE)))
# show the first 3 row names which the logical value is TRUE
Head (Row.names (Q1[which (q1$agriculturelogical = = TRUE),]), 3)
# Question 2
# download JPEG file from server
Download.file (url = "Https://d396qusza40orc.cloudfront.net/getdata%2Fjeff.jpg",
DestFile = "Q2.jpg",
method = "Curl")
# Load Library
Library (JPEG)
# Read JPEG file
Q2 <-readjpeg (Source = "Q2.jpg",
Native = TRUE)
# show the 30th and 80th Quantiles
Quantile (Q2, C (0.3, 0.8))
# Question 3
# download file from server
Download.file (url = "Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv",
DestFile = "Q3_1.csv",
method = "Curl")
Download.file (url = "Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv",
DestFile = "Q3_2.csv",
method = "Curl")
# Load the Datasets
Q3_1 <-read.csv ("Q3_1.csv", Header = TRUE, skip = 3, Sep = ",")
Q3_2 <-read.csv ("Q3_2.csv", Header = TRUE)
# Reshaping data
Q3_1 <-q3_1[2:191, C (1,2,4,5)]
Rownames (q3_1) <-NULL
Q3_1 <-Rename (q3_1, CountryCode = X)
# merge Datasets
Q3_merge <-Join (q3_1, Q3_2)
# show the number of matches
SUM (!is.na (Unique (q3_merge$ranking)))
# Convert the data type of Ranking
Q3_merge$ranking <-as.numeric (As.character (q3_merge$ranking))
# show the 13th country after sort decending
Q3_merge <-Arrange (q3_merge, desc (Ranking))
q3_merge[13,3]
# Question 4
# Group Q3_merge by Income.group
Income_group <-group_by (Q3_merge, Income.group)
Summarise (income_group, Avg = mean (Ranking, na.rm = TRUE))
# Question 5
# cut Ranking into 5 quantile groups
Q3_merge$rankinggroup <-Cut (q3_merge$ranking, breaks = 5)
# Make a table vs Income.group
Table (Q3_merge$rankinggroup, Q3_merge$income.group)
Quiz 4Question 1
The American Community Survey distributes downloadable data about the states communities. Download The 2006 microdata survey about housing for the state of Idaho using Download.file () from here:
Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv
and load the data into R. The code book, describing the variable names are here:
Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf
Apply Strsplit () to split all the names of the data frame on the characters "WGTP". What is the value of the 123 element of the resulting list?
Question 2
Load the Gross domestic Product data for the ranked countries in this data set:
Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv
Remove the commas from the GDP numbers in millions of dollars and average them. What's the average?
Original Data sources:http://data.worldbank.org/data-catalog/gdp-ranking-table
Question 3
In the data set from Question 2 What is a regular expression, would allow you to count the number of countries whose n Ame begins with "the"? Assume the variable with the country names in it is named Countrynames. How do many countries begin with?
Question 4
Load the Gross domestic Product data for the ranked countries in this data set:
Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv
Load the educational data from this data set:
Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv
Match the data based on the country shortcode. Of the countries for which the end of the fiscal are available, how many end in June?
Original Data Sources:
Http://data.worldbank.org/data-catalog/GDP-ranking-table
Http://data.worldbank.org/data-catalog/ed-stats
Question 5
You can use the Quantmod (http://www.quantmod.com/) package to get historical stock prices for publicly traded companies O n the NASDAQ and NYSE. Use the following code to download data on Amazon's stock price and get the times of the data was sampled.
How many values were collected in 2012? How many values were collected on Mondays in 2012?
Quiz4. R
# set working directory
SETWD ('/home/mhwong/coursera/getting and cleaning data/quiz4 ')
# Question 1
# download file from server
Download.file (url = "Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv",
DestFile = "Q1.csv",
method = "Curl")
# load CSV file
Q1 <-read.csv ("Q1.csv", Header = TRUE)
# Apply Strsplit () and check the 123th element
Strsplit (x = names (q1), split = "WGTP") [[123]]
# Question 2
# download file from server
Download.file (url = "Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv",
DestFile = "Q2.csv",
method = "Curl")
# Load Library
Library (PLYR)
Library (DPLYR)
# load CSV file
Q2 <-Read.csv ("Q2.csv", Header = TRUE, skip = 3)
# Reshape Data
Q2 <-q2[2:191, C (1,2,4,5)]
Rownames (Q2) <-NULL
Q2 <-Rename (q2, CountryCode = X)
Names (Q2) <-gsub (pattern = "\ \",
Replacement = "",
x = names (Q2))
Names (Q2) <-ToLower (names (Q2))
# Remove the comma in column Usdollars
Q2$usdollars <-gsub (pattern = ",",
Replacement = "",
x = q2$usdollars)
# convert Usdollars to numeric
Q2$usdollars <-as.numeric (As.character (Q2$usdollars))
# Calculate the average of usdollars
Mean (x = q2$usdollars, na.rm = TRUE)
# quesetion 3
# Change the column name to match the question
Q2 <-Rename (q2, countrynames = economy)
# find countries whose name begins with "the"
grep (pattern = "^united", x = q2$countrynames)
# count the countries whose name begins with "the
SUM (GREPL (pattern = "^united",
x = q2$countrynames) = = TRUE)
# Question 4
# download file from server
Download.file (url = "Https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv",
DestFile = "Q4.csv",
method = "Curl")
# Load the CSV file
Q4_1 <-Q2
Q4_2 <-read.csv ("Q4.csv", Header = TRUE)
# Reshape the data
Names (Q4_2) <-ToLower (gsub (pattern = "\ \",
Replacement = "",
x = names (q4_2)))
# Merge the data
Q4_merge <-Join (x = q4_1, y = q4_2)
# Count the fiscal year end in June
SUM (GREPL (pattern = "^[ff]iscal" (. *) [Yy]ear (. *) [Ee]nd (.) *[jj]une ",
x = q4_merge$specialnotes) = = TRUE)
# Question 5
# Launch given code
Library (QUANTMOD)
AMZN = Getsymbols ("AMZN", Auto.assign=false)
Sampletimes = Index (AMZN)
# Create a data fram from AMZN
AMZN <-data.frame (AMZN)
# count the values collected in 2012
SUM (Format (AS). Date (x = Rownames (AMZN), format = "%y-%m-%d"), "%Y") = = 2012)
# count the values collected in Mondays 2012
SUM (Format (AS). Date (x = Rownames (AMZN), format = "%y-%m-%d"), "%y%a") = = "2012Mon")
All this code is from: http://mhwong2007.logdown.com/
Thank you!
[Coursera] Getting and cleaning Data Quiz