"Data Analysis R Language Practice" study notes the sixth chapter parameter estimation and r implementation (above)

Source: Internet
Author: User

6.1-point estimation and R implementation

6.1.1 Moment Estimation

The solution equation function in R:

Function and package: function

Uniroot () @stats: Solving a unary (nonlinear) equation

Multiroot () @rootSolve: given n (nonlinear) equations, solve n root

Uniroot.all () @rootSolve: Solving multiple roots of an equation within a district question

Bbsolve () @BB: Solving nonlinear equations with Barzilai-borwein steps

Uniroot (F,interval, ..., lower = min (interval), upper = Max (interval), f.lower = f (Lower,...), F.upper = f (upper, ...), Exten DInt = C ("No", "yes", "Downx", "UpX"), Check.conv = False,tol =. machine$double.eps^0.25, Maxiter = +, trace = 0)

Where f specifies the function that is required to solve the equation: interval is a numerical vector that specifies the interval range of the root of the solution: either the two endpoints of the interval are specified with lower and upper, and the Tol represents the desired precision (convergence tolerance): Maxiter is the maximum number of iterations.

If we encounter the solution of multivariate equations, we need to use the function multiroot () of rootsolve package to solve the equations. Multiroot () is used to solve n non-linear equations by using the Newton-raphson method, which requires a complete Jacobian matrix. Its invocation format is:

Multiroot (f, start, Maxiter = 100,

Rtol = 1e-6, Atol = 1e-8, Ctol = 1e-8,

Usefortran = TRUE, positive = FALSE,

Jacfunc = NULL, Jactype = "Fullint",

verbose = FALSE, Bandup = 1, Banddown = 1,

Parms = NULL, ...)

f Specifies the function of the required solution; Since Newton's iterative method is used, it must pass the initial value of the root given by start, where the Name property can also mark the names of the output variables; the maxiter is the maximum number of iterations allowed, and the rtol and ATOL are relative and absolute errors, respectively. The default value is generally maintained; Ctol is also a scalar used to control the number of iterations, if the maximum variation of the two iterations is less than Ctol, then the iteration stops and the root of the equation set is obtained.

For example, it is known that the loss of an insurance product during a policy year is as follows, which gives the number of policies under different loss times, and we estimate the distribution of the number of losses. The known distribution type is Poisson (Poisson), and its sample mean is the moment estimate of the parameter λ.

Number of losses

0

1

2

3

4

5

Number of policies

1532

581

179

41

10

4

> Num=c (Rep (0:5,c (1532,581,179,41,10,4))) #用rep () function generates a sample with a sample value. 15 number composition, the second vector in the function corresponds to the number of repetitions of each number > Lambda=mean (num) > lambda[1] 0.4780571

Paint Compare the difference between the estimated and the sample values of the loss count

> k=0:5> Ppois=dpois (K,LAMBDA) > poisnum=ppois*length (num) #由poisson分布生成的损失次数 > Plot (k,poisnum,ylim=c ( 0,1600)) #画图比较, for better graphics, set the range of the longitudinal axis with the parameter Ylim, the minimum value is 0, the maximum value is greater than the minimum value of the sample, select 1600> samplenum=as.vector (num) #样本的损失次数 > Points (k,samplenum,type= "P", col=2) > Legend (4,1000,legend=c ("num", "Poisson"), col=1:2,pch= "0")

The function multiroot () of the Rootsolve package is used to solve the equation set:

> X=c (4,5,4,3,9,9,5,7,9,8,0,3,8,0,8,7,2,1,1,2) > M1=mean (x) > M2=var (x) > Model=function (x,m1,m2) {}> Model=function (x,m1,m2) {+   C (f1=x[1]+x[2]-2*m1,+     f2= (x[2]-x[1]) ^2/12-m2) +}> library (rootsolve) > Multiroot (F=model,start=c (0,10), m1=m1,m2=m2) $root [1]-0.7523918 10.2523918# evenly distributed two parameter values [0.75, 10.25] $f. Root           F1            f2-5.153211e-12  1.121688e-09  $iter [1] 4 $estim. precis[1] 5.634204e-10

  

Verify that:

> m1-sqrt (3*m2); m1+sqrt (3*M2) [1] -0.7523918[1] 10.25239

  

6.1.2 Maximum Likelihood estimation

Functions for calculating extrema in R (stats package)

Optimize () to calculate the polar likelihood estimates for single-parameter distributions

Optim () calculates maximum likelihood estimates for multiple parameter distributions

NLM () calculates the minimum point of a nonlinear function

NLMINB () Nonlinear minimization function

1. function optimize ()

When the distribution contains only one parameter, we can use the function optimize () that calculates the extremum in R to find the maximum likelihood estimate.

Optimize (f =, interval =,  ..., lower = min (interval),
         Upper = Max (interval), maximum = FALSE,
         Tol =. machine$double.eps^0.25)
Where f is the likelihood function: Interval The value range of the specified parameter, Lower/upper is the lower and upper bounds of the parameter, respectively: Maximum default is False, which indicates the minimum value of the likelihood function, and if true, the maximum value: Tol represents the precision of the calculation.

2. function Optim () and NLM ()

When the distribution contains multiple parameters, the maximum point of the likelihood function is computed with the function Optim () or NLM ().

Optim (Par, fn, gr = NULL, ...,
      method = C ("Nelder-mead", "BFGS", "CG", "L-bfgs-b", "Sann",
                 "Brent"),
      Lower =-inf, upper = INF,
      Control = List (), Hessian = FALSE)

Par sets the initial value of the parameter; The FN is a likelihood function; method provides 5 ways to calculate the Extremum

NLM (f, p, ..., Hessian = FALSE, typsize = Rep (1, Length (p)),
    Fscale = 1, print.level = 0, Ndigit =, Gradtol = 1e-6,
    Stepmax = max (sqrt ((p/typsize) ^2), 1000),
Steptol = 1e-6, Iterlim = +, check.analyticals = TRUE)
NLM is a non-linear minimization function that uses only the Newton-Slavic algorithm to calculate the minimum point of a function by iterating. General only cloth to set the first two parameters: F is a function that needs to be minimized: P sets the initial value of the parameter.

3. function Nlminb ()

In practical applications, the above three basic functions need to use the Optimization function nlminb () when they encounter a large amount of data or a more complex calculation.

NLMINB (start, objective, gradient = null, Hessian = NULL, ...,
       Scale = 1, control = list (), lower =-inf, upper = INF)

The parameter start is a numeric vector used to set the initial value of the parameter; objective specifies the function to be optimized: gradient and Hess are used to set the logarithmic likelihood gradient, usually in the default state; control is a list of controlling parameters: Lower and upper set the lower and upper bounds of the parameter and, if unspecified, assume that all parameters are not constrained.

Cases:

> Library (MASS) > Head (geyser,5)  waiting Duration1      4.0166672-2.1500003 4.0000005      4.000000> Attach (Geyser) > hist (waiting,freq=false) #通过直方图了解数据分布的形态

  

The guess distribution is a mixture of two normal distributions, which requires estimating 5 parameters in the function: P, μ 1, μ 2, Σ 1, σ2.

When writing a log-likelihood function in R, 5 parameters are stored in the vector para, since nlminb () is the minimum value, so the inverse number of the logarithmic likelihood function is returned in function functions.

> L1=function (para) + {+ f1=dnorm (waiting,para[2],para[3]) + f2=dnorm (waiting,para[4],para[5]) + f=para[1]*f1+ (1- PARA[1]) *f2+ l1=sum (log (f)) + return (-11) +}

  

For parameter estimation, the most important point before using NLMINB () is to determine the initial value, the closer the initial value is to the real value, the more accurate the result of the calculation. We suspect that the distribution of the data is two normal mixed, the probability p directly with 0.5 to do the initial value. With the x-axis values (approximately 50 and 80>) corresponding to the two peaks in the histogram, the initial values can be set to μ1 and μ2. And the probability P is in ((0,1) interval, the parameter σ1,σ2 is normal distribution standard deviation, must be greater than 0, therefore through lower and upper two parameters carries on certain restraint.

> Geyser.est=nlminb (C (0.5,50,10,80,10), L1,lower=c (0.0001,-inf,0.0001,-inf,0.0001), Upper=c (0.9999,Inf,Inf,Inf , INF)) > Options (digits=3) > geyser.est$par[1]  0.308 54.203  4.952 80.360 7.508>  p=geyser.est$par [1]> mu1=geyser.est$par[2];sigma1=geyser.est$par[3]> mu2=geyser.est$par[4];sigma2=geyser.est$par[5]> x= Seq (40,120) > #将估计的参凌丈函数代入原密度函数 > F=p*dnorm (X,MU1,SIGMA1) + (1-p) *dnorm (X,MU2,SIGMA2) > hist (waiting,freq=f) > Lines (x,f)

  

(2) Calculation using maximum likelihood estimation function Maxlik ()

A function Maxlik () with the same name in the package Maxlik can calculate the maximum likelihood estimate directly, and the invocation format is as follows:

Maxlik (Loglik, grad = null, Hess = NULL, start, method,
Constraints=null, ...)

Loglik is a logarithmic likelihood function, Grad and Hess are used to set the logarithmic likelihood gradient, usually do not need to be set, with the default value of NULL; start is a numerical vector, set the initial value of the parameter, method selection solution to maximize the methods, including "Newton-Raphson", " BFGS ". "BFGSR", "Bhhh", "sank" and "Nelder-mead", if not set, will automatically select an appropriate method; constraints specifies a constraint on the likelihood estimate.

Cases:

The two-parameter negative two-item distribution is used to make the maximum likelihood estimation, and the fitting of the discrete distribution is specified:

To write the R program, we first write the log likelihood function Loglik, and use the negative two function dnbinom () in R, whose parameters are R, p. If you want to estimate the value of β, you should convert the form.

> Num=c (Rep (0:5,c (1532,581,179,41,10,4)) > Loglik=function (para) + {+   f=dnbinom (num,para[1],1/(1+para[2) ) +   L1=sum (log (f)) +   return (L1) +}> library (Maxlik) > Para=maxlik (loglik,start=c (0.5,0.4)) $estimate > R=para[1];beta=para[2]

  

Using graphs to observe the estimated effects, compare the sample values and estimates for the number of losses:

> l=length (num) > Nbinomnum=dnbinom (0:5,r,1/(1+beta)) *l;nbinomnum[1] 1530.12  588.08  170.66   44.17   10.74    2.51> plot (0:5,nbinomnum,ylim=c (0,1600)) > points (0:5,nbinomnum,type= "P", col=2) > Legend ( 3,1000,legend=c ("num", "Poisson"), Col=1:2,lty=1)

  

It can be seen that the maximum likelihood of negative two distributions is very good, and the estimated value is almost completely coincident with the value of the sample, and it can be concluded that the number of losses is subject to negative two distributions.

6.2 Interval Estimation of single normal population

Interval estimation of 6.2.1 mean-value μ

(1) σ2 known

There is no built-in function for calculating the confidence interval of the mean value of variance, which needs to be written by itself:

Conf.int=function (X,sigma,alpha) {

Mean=mean (x)

N=length (x)

Z=qnorm (1-alpha/2,mean=0,sd=1,lower.tail=true)

C (MEAN-SIGMA*Z/SQRT (n), MEAN+SIGMA*Z/SQRT (n))

}

where x is the data sample; Sigma is the standard deviation of the known population; Alpha represents a significant level. Usually when we make interval estimates, we estimate the two-sided confidence interval because it provides up to two reference values for the parameters to be evaluated. However, if you want to estimate the confidence interval of the side, in theory, the same as the two sides, only need to use the standard normal distribution of the α-quantile, write the function also do the same change.

The function z.test () is now provided in basic statistics and data Analysis package BSDA (basic statisticsand data analyses), which can be used for hypothesis testing and interval estimation of single and double samples based on normal distribution, using the following methods:

Z.test (x, y = null, alternative = "two.sided", mu = 0, sigma.x = null,
SIGMA.Y = NULL, conf.level = 0.95)

wherein, x and y are numerical vectors, the default y=null, that is, a single sample of the hypothesis test; alternative is used to specify the type of confidence interval, the default is two.sided, the confidence interval for two tails, if less is the upper limit, for greater to find the confidence of the limit; MU represents the mean value, which only works in the hypothesis test and defaults to 0; Sigma.x and SIGMA.Y Specify the standard deviation for the population of two samples, respectively: The confidence level of the Conf.level specified interval estimate.

The function simple.z.test () in the package Usingr, which is specially used for interval estimation of the average value of the sample, differs from the z.test () in that it can only estimate the confidence interval and not achieve the Z-Test. Simple.z.test ()

Use the following methods:

Simple.z.test (X,sigma, conf.level=0.95)

where x is the data vector: Sigma is known as the overall standard deviation; Conf.level The confidence level of the specified interval estimate, the default

To 95%.

Cases:

Extracting 20 samples from a population with a mean value of 10 and a standard deviation of 2, so this is a variance known

Sample of normal distribution. The confidence interval of x when the confidence level is 95% is calculated, and the self-written function conf.int () is called first:

> conf.int=function (x,sigma,alpha) {+   Mean=mean (x) +   n=length (x) +   z=qnorm (1-alpha/2,mean=0,sd=1, Lower.tail=true) +   C (mean-sigma*z/sqrt (n), MEAN+SIGMA*Z/SQRT (n)) +}> Set.seed (111) > X=rnorm (20,10,2) > Conf.int (x,2,0.05) [1]  8.42 10.17

  

This result can also be obtained directly with the function z.test ():

> Library (BSDA) > Z.test (x,sigma.x=2) $conf. int[1]  8.42 10.17attr (, "Conf.level") [1] 0.95

  

Simple.z.test (), can directly get the interval estimate result:

> Library (USINGR) > Simple.z.test (x,2) [1]  8.42 10.17

  

The results of three methods show that the 95% confidence interval of the sample is [8.42, 10.17]

(2) σ2 unknown

When the population variance is unknown, the statistic of t distribution is substituted for z, and the variance is replaced by the sample variance S2.

T.test (x, y = Null,alternative = C ("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = False,conf.level = 0.95, ...)

where x is the sample data, and if x and Y are input simultaneously, a two-sample T-test is done; alternative is used to specify the type of the confidence interval, the default is two.sided, the confidence interval for two tails, if less the upper limit for the confidence, for greater to find the lower confidence; MU means mean value , which only works in the hypothesis test, which defaults to 0.

Still use vector x in the previous example, assuming that the population variance is unknown, after the confidence interval is calculated with the function t.test ():

> t.test (x) one          Sample t-testdata:  xt = 22.6, df = +, P-value = 3.407e-15alternative hypothesis:true mean is Equal to 095 percent confidence interval:  8.43 10.15sample Estimates:mean of X      

  

If only the result of the interval estimate, select the contents of the Conf.int with the symbol "$":

> t.test (x) $conf. int[1]  8.43 10.15attr (, "Conf.level") [1] 0.95

  

Interval estimation of 6.2.2 variance σ2

(1) μ known

(2) μ Unknown

In R there is no direct calculation of the confidence interval of the function, we can write the above two cases in a function, through an if statement to judge, as long as the variance of the interval estimate, all call this function. When writing a function in R, the parameter can be pre-set an initial value, for example, set Mu=inf, the mean value is unknown, call the function if there is no special description mu value, will be calculated according to the mean value is unknown, if the mean value is known, MU should be re-assigned value when calling the function.

> var.conf.int=function (x,mu=inf,alpha) {+   n=length (x) + if (mu<inf) {+    s2=sum ((X-MU) ^2)/n+ df=n+   } + else{+    S2=var (x) + df=n-1+   }+   C (DF*S2/QCHISQ (1-ALPHA/2,DF), df*s2)/qchisq (ALPHA/2,DF) +}> Var.conf.int (x,alpha=0.05) [1]  5.35 39.50

  

The confidence interval for calculating the total variance is "5.35,39.5" and the confidence level is 95%.

"Data Analysis R Language Practice" study notes the sixth chapter parameter estimation and r implementation (above)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.