Machine learning for hackers reading notes (ii) data analysis

Source: Internet
Author: User
Tags ggplot

#均值: Sum/length

Mean ()

#中位数: Sort the series, if the number is odd, take the middle of the ordinal sequence of values. If the number is even, take the average of two numbers in the middle of the sequence series

Median ()

#R语言中没有众数函数

#分位数

Quantile (data): Lists the numbers at the 0%,25%,50%,75%,100% location

#可自己设置百分比

Quantile (data,probs=0.975)

#方差: Measure the average deviation of any and all values in a dataset

var ()

#标准差:

SD ()

#直方图, binwidth indicates that the interval width is 1

Ggplot (Heights.weights, aes (x = Height)) +geom_histogram (binwidth = 1)

#发现是对称的, remember when using the histogram: interval width is an external structure imposed on the data, but it also reveals the internal structure of the data

#把宽度改成5

Ggplot (Heights.weights, aes (x = Height)) +geom_histogram (binwidth = 5)

#从看, symmetry does not exist, which is called a smoothing, the opposite case is called under-smoothed, such as

Ggplot (Heights.weights, aes (x = Height)) +geom_histogram (binwidth = 0.01)

#因此合适的直方图需要调整宽度值. You can choose other ways to visualize the density graph

Ggplot (Heights.weights, aes (x = Height)) +geom_density ()

#如, Peak Flat, try to divide data by sex

Ggplot (Heights.weights, aes (x = Height, fill = Gender)) +geom_density ()

#混合模型, a non-standard distribution formed by mixing two standard distributions

#正态分布, bell-shaped curve or Gaussian distribution

#按性别分片

Ggplot (Heights.weights, aes (x = Weight, fill = Gender)) +geom_density () +facet_grid (Gender ~.)

#以下代码指定分布的均值和方差, M and s can be adjusted, just move center or telescopic width

M <-0
S <-1
Ggplot (data.frame (x = Rnorm (100000, M, s)), AES (x = x)) +geom_density ()

#构建出了密度曲线, the majority at the peak of the bell-shaped

#正态分布的众数同时也是均值和中位数

#只有一个众数叫单峰, two called Shuangfeng, more than two called multi-peak

#从一个定性划分分布有对称 (symmetric) distribution and skewness (skewed) distributions

#对称 (symmetric) Distribution: the right and left sides of the same shape, such as normal distribution

#这说明观察到小于众数的数据和大于众数的数据可能性是一样的.

#偏态 (skewed) Distribution: shows that the possibility of observing an extremum on the right side of the majority is greater than the left side, called the gamma distribution

#从另一个定性区别划分两类数据: Narrow tail distribution (thin-tailed) and heavy-tailed distribution (heavy-tailed)

The values generated by the #窄尾分布 (thin-tailed) are usually around the mean, with a probability of 99%

#柯西分布 (Cauchy distribution) about 90% of the value falls within three standard deviations, the farther the distance is, the more different the distribution characteristics

#正态分布几乎不可能产生出距离均值有6个标准差的值, Cauchy has a 5% chance of distribution.

#产生正态分布及柯西分布随机数

Set.seed (1)
Normal.values <-rnorm (250, 0, 1)
Cauchy.values <-rcauchy (250, 0, 1)
Range (normal.values)
Range (cauchy.values)

#画图

Ggplot (data.frame (x = normal.values), AES (x = x)) +geom_density ()


Ggplot (data.frame (x = cauchy.values), AES (x = x)) +geom_density ()

#正态分布: Single peak, symmetrical, bell-shaped narrow tail

#柯西分布: Single peak, symmetrical, bell-shaped heavy tail

#产生gamma分布随机数

Gamma.values <-Rgamma (100000, 1, 0.001)

Ggplot (data.frame (x = gamma.values), AES (x = x)) +geom_density ()

#游戏数据很多都符合伽玛分布

#伽玛分布只有正值

#指数分布: The highest frequency in a dataset is 0, and only non-negative values appear

#例如, enterprise call centers often find that the interval between two calls to a call appears to be exponentially distributed

#散点图

Ggplot (Heights.weights, aes (x = Height, y = Weight)) +geom_point ()

#加平滑模式

Ggplot (Heights.weights, aes (x = Height, y = Weight)) +geom_point () +geom_smooth ()

Ggplot (HEIGHTS.WEIGHTS[1:20,], AES (x = Height, y = Weight)) +geom_point () +geom_smooth ()


Ggplot (heights.weights[1:200,], AES (x = Height, y = Weight)) +geom_point () +geom_smooth ()


Ggplot (heights.weights[1:2000,], AES (x = Height, y = Weight)) +geom_point () +geom_smooth ()

Ggplot (Heights.weights, aes (x = Height, y = Weight)) +
Geom_point (AES (color = Gender, alpha = 0.25)) +
Scale_alpha (guide = "none") +
Scale_color_manual (values = C ("Male" = "Black", "Female" = "Gray")) +
THEME_BW ()

# An alternative using bright colors.
Ggplot (Heights.weights, aes (x = Height, y = Weight, color = Gender)) +
Geom_point ()

#
# Snippet 35
#

Heights.weights <-transform (heights.weights,
Male = IfElse (Gender = = ' Male ', 1, 0))

Logit.model <-GLM (Male ~ Weight + Height,
data = Heights.weights,
Family = binomial (link = ' logit '))

Ggplot (Heights.weights, aes (x = Height, y = Weight)) +
Geom_point (AES (color = Gender, alpha = 0.25)) +
Scale_alpha (guide = "none") +
Scale_color_manual (values = C ("Male" = "Black", "Female" = "Gray")) +
THEME_BW () +
Stat_abline (Intercept =-COEF (Logit.model) [1]/Coef (Logit.model) [2],
Slope =-Coef (Logit.model) [3]/Coef (Logit.model) [2],
Geom = ' Abline ',
color = ' black ')

Machine learning for hackers reading notes (ii) data analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.