Recently, Big Data rage, also become our code Nong Hot technology. With the Hadoop Environment, we look at a variety of Hadoop Technology Books, browse Hadoop,hive , Storm and other technologies. Over a period of time, when we want to use these techniques to practice data. In the face of the test data from the Internet, there is no way, or no matter 3,721, a statistical regression model.
We are completely clueless about big data and big data analytics, and we're even confused about big data technology, and we've got to shrink from it.
What to do when we get the data, and if we don't know how to do it, we'll start with exploratory analysis.
Analysis data can be divided into two stages of exploration and validation. Exploratory data analysis (exploratory, hereinafter referredto as EDA) refers to data that is already in place ( especially the original data that is investigated or observed ) Explore under the least priori assumptions. Exploratory data analysis is particularly effective when we do not have enough experience with the information in this data and do not know what traditional statistical methods are used for analysis.
Exploratory analysis is generally represented by histograms and stem-leaf plots. The basic tools for exploratory data analysis are graphs, tabulation, and summary statistics. In general, exploratory data analysis is a systematic analysis of the data, it shows the distribution of all variables, time series data and transformation variables, using the hash matrix diagram to show the relationship between the variables 22, and get all the aggregated statistics. In other words, you want to calculate the mean, maximum, minimum, upper and lower four-bit, and determine outliers.
Say so much, let's take an example. and the implementation of R language and SPSS are given .
The attached data contains 5 columns: Age, gender, number of ads, clicks, and whether to sign in.
implementation of the R language:
1root= "f:/dds_datasets/dds_ch2_nyt/"2 SETWD (Root)3File<-paste (Root, "nyt1.csv", sep= "")4nytdata<-read.csv (file)5 Head (nytdata)6Nytdata$agecat<-cut (Nytdata$age,c (-inf,0,18,24,34,44,54,64, INF))7 Summary (nytdata)8 9Install.packages ("Doby")TenLibrary ("Doby") Onesiterange<-function (x) {C (length (x), Min (x), mean (x), Max (x))} ASummaryby (age~agecat,data=nytdata,fun=Siterange) -Summaryby (Gender+signed_in+impressions+clicks~agecat,data =nytdata) - # #先画出直方图图 the -Install.packages ("Ggplot2") -Library ("Ggplot2") - +Ggplot (Nytdata,aes (x=impressions,fill=agecat)) +Geom_histogram () -#ggplot (Nytdata,aes (X=impressions,y=agecat,fill=agecat)) +geom_area ()
View Code
The analysis results are as follows:
The implementation of SPSS is relatively simple, through the wizard to import data, choose Analysis-Data Description-Explore on the line.
I am also a member of the yard, big data for me I am also a beginner, some time ago began to learn R Language, interested colleagues can come in to communicate with each other.
I do not know where to send attachments, dizzy ... Please contact me if you need any data.
Analysis of Big Data (i) exploratory analysis