68–95–99.7 rule

For the normal distribution, the values less than one standard deviation away from the mean account for 68.27% of the set; while two standard deviations from the mean account for 95.45%; and three standard deviations account for 99.73%.

Prediction interval (on the y-axis) given from the standard score (on the x-axis). The y-axis is logarithmically scaled (but the values on it are not modified).

In statistics, the 68–95–99.7 rule is a shorthand used to remember the percentage of values that lie within a band around the mean in a normal distribution with a width of one, two and three standard deviations, respectively; more accurately, 68.27%, 95.45% and 99.73% of the values lie within one, two and three standard deviations of the mean, respectively. In mathematical notation, these facts can be expressed as follows, where $x$ is an observation from a normally distributed random variable, $μ$ is the mean of the distribution, and $σ$ is its standard deviation:

{\begin{aligned}\Pr(\mu -\;\,\sigma \leq x\leq \mu +\;\,\sigma )&\approx 0.6827\\\Pr(\mu -2\sigma \leq x\leq \mu +2\sigma )&\approx 0.9545\\\Pr(\mu -3\sigma \leq x\leq \mu +3\sigma )&\approx 0.9973\end{aligned}}

In the empirical sciences the so-called three-sigma rule of thumb expresses a conventional heuristic that "nearly all" values are taken to lie within three standard deviations of the mean, i.e. that it is empirically useful to treat 99.7% probability as "near certainty".^[1] The usefulness of this heuristic of course depends significantly on the question under consideration, and there are other conventions, e.g. in the social sciences a result may be considered "significant" if its confidence level is of the order of a two-sigma effect (95%), while in particle physics, there is a convention of a five-sigma effect (99.99994% confidence) being required to qualify as a "discovery".

The "three sigma rule of thumb" is related to a result also known as the three-sigma rule, which states that even for non-normally distributed variables, at least 98% of cases should fall within properly-calculated three-sigma intervals.^[2]

Cumulative distribution function

Diagram showing the cumulative distribution function for the normal distribution with mean (µ) 0 and variance (σ²) 1.

These numerical values "68%, 95%, 99.7%" come from the cumulative distribution function of the normal distribution.

The prediction interval for any standard score corresponds numerically to (1−(1−Φ_µ,σ²(standard score))·2).

For example, $Φ(2) \approx 0.9772$ , or $Pr(x \leq μ + 2 σ) \approx 0.9772$ , corresponding to a prediction interval of (1 − (1 − 0.97725)·2) = 0.9545 = 95.45%. Note that this is not a symmetrical interval – this is merely the probability that an observation is less than $μ + 2 σ$ . To compute the probability that an observation is within two standard deviations of the mean (small differences due to rounding):

\Pr(\mu -2\sigma \leq x\leq \mu +2\sigma )=\Phi (2)-\Phi (-2)\approx 0.9772-(1-0.9772)\approx 0.9545

This is related to confidence interval as used in statistics: $x̅ \pm 2 σ$ is approximately a 95% confidence interval when $x̅$ is the average of a sample.

Normality tests

Main article: Normality test

The "68–95–99.7 rule" is often used to quickly get a rough probability estimate of something, given its standard deviation, if the population is assumed to be normal. It is also as a simple test for outliers if the population is assumed normal, and as a normality test if the population is potentially not normal.

To pass from a sample to a number of standard deviations, one first computes the deviation, either the error or residual depending on whether one knows the population mean or only estimates it. The next step is standardizing (dividing by the population standard deviation), if the population parameters are known, or studentizing (dividing by an estimate of the standard deviation), if the parameters are unknown and only estimated.

To use as a test for outliers or a normality test, one computes the size of deviations in terms of standard deviations, and compares this to expected frequency. Given a sample set, one can compute the studentized residuals and compare these to the expected frequency: points that fall more than 3 standard deviations from the norm are likely outliers (unless the sample size is significantly large, by which point one expects a sample this extreme), and if there are many points more than 3 standard deviations from the norm, one likely has reason to question the assumed normality of the distribution. This holds ever more strongly for moves of 4 or more standard deviations.

One can compute more precisely, approximating the number of extreme moves of a given magnitude or greater by a Poisson distribution, but simply, if one has multiple 4 standard deviation moves in a sample of size 1,000, one has strong reason to consider these outliers or question the assumed normality of the distribution.

For example, a 6σ event corresponds to a chance of about two parts per billion. For illustration, if events are taken to occur daily, this would correspond to an event expected every 1.4 million years. This gives a simple normality test: if one witnesses a 6σ in daily data and significantly fewer than 1 million years have passed, then a normal distribution most likely does not provide a good model for the magnitude or frequency of large deviations in this respect.

In The Black Swan, Nassim Nicholas Taleb gives the example of risk models according to which the Black Monday crash would correspond to a 36-σ event: the occurrence of such an event should instantly suggest that the model is flawed, i.e. that the process under consideration is not satisfactorily modelled by a normal distribution. Refined models should then be considered, e.g. by the introduction of stochastic volatility. In such discussions it is important to be aware of problem of the gambler's fallacy, which states that a single observation of a rare event does not contradict that the event is in fact rare. It is the observation a plurality of purportedly rare events that increasingly undermines the hypothesis that they are rare, i.e. the validity of the assumed model. A proper modelling of this process of gradual loss of confidence in a hypothesis would involve the designation of prior probability not just to the hypothesis itself but to all possible alternative hypotheses. For this reason, statistical hypothesis testing works not so much by confirming a hypothesis considered to be likely, but by refuting hypotheses considered unlikely.

Table of numerical values

Because of the exponential tails of the normal distribution, odds of higher deviations decrease very quickly. From the rules for normally distributed data for a daily event:

Range	Expected Fraction of Population Inside Range	Approximate Expected Frequency Outside Range	Approximate Frequency for Daily Event
μ ± 0.5σ	0.382924922548026	2 in 3	Four times a week
μ ± σ	0.682689492137086	1 in 3	Twice a week
μ ± 1.5σ	0.866385597462284	1 in 7	Weekly
μ ± 2σ	0.954499736103642	1 in 22	Every three weeks
μ ± 2.5σ	0.987580669348448	1 in 81	Quarterly
μ ± 3σ	0.997300203936740	1 in 370	Yearly
μ ± 3.5σ	0.999534741841929	1 in 2149	Every six years
μ ± 4σ	0.999936657516334	1 in 7004157870000000000♠15787	Every 43 years (twice in a lifetime)
μ ± 4.5σ	0.999993204653751	1 in 7005147160000000000♠147160	Every 403 years (once in the modern era)
μ ± 5σ	0.999999426696856	1 in 7006174427800000000♠1744278	Every 7003477600000000000♠4776 years (once in recorded history)
μ ± 5.5σ	0.999999962020875	1 in 7007263302540000000♠26330254	Every 7004720900000000000♠72090 years (thrice in history of modern humankind)
μ ± 6σ	0.999999998026825	1 in 7008506797346000000♠506797346	Every 1.38 million years (twice in history of humankind)
μ ± 6.5σ	0.999999999919680	1 in 7010124501973930000♠12450197393	Every 34 million years (twice since the extinction of dinosaurs)
μ ± 7σ	0.999999999997440	1 in 7011390682215445000♠390682215445	Every 1.07 billion years (a quarter of Earth's history)
μ ± $x$ σ	$\textstyle \operatorname {erf} \left({\frac {x}{\sqrt {2}}}\right)$	1 in $\textstyle {\frac {1}{1-\operatorname {erf} \left({\frac {x}{\sqrt {2}}}\right)}}$	Every $\textstyle {\frac {1}{1-\operatorname {erf} \left({\frac {x}{\sqrt {2}}}\right)}}$ days

References

↑ this usage of "three-sigma rule" entered common usage in the 2000s, e.g. cited in Schaum's Outline of Business Statistics. McGraw Hill Professional. 2003. p. 359 , and in Grafarend, Erik W. (2006). Linear and Nonlinear Models: Fixed Effects, Random Effects, and Mixed Models. Walter de Gruyter. p. 553.
↑
See:
- Wheeler, D. J.; Chambers, D. S. (1992). Understanding Statistical Process Control. SPC Press.
- Czitrom, Veronica; Spagon, Patrick D. (1997). Statistical Case Studies for Industrial Process Improvement. SIAM. p. 342.
- Pukelsheim, F. (1994). "The Three Sigma Rule". American Statistician. 48: 88–91.

External links

"The Normal Distribution" by Balasubramanian Narasimhan
"Calculate percentage proportion within x sigmas at WolframAlpha

Probability distributions

List

Discrete univariate with finite support	Benford Bernoulli beta-binomial binomial categorical hypergeometric Poisson binomial Rademacher discrete uniform Zipf Zipf–Mandelbrot

Discrete univariate with infinite support	beta negative binomial Borel Conway–Maxwell–Poisson discrete phase-type Delaporte extended negative binomial Gauss–Kuzmin geometric logarithmic negative binomial parabolic fractal Poisson Skellam Yule–Simon zeta

Continuous univariate supported on a bounded interval	arcsine ARGUS Balding–Nichols Bates beta beta rectangular Irwin–Hall Kumaraswamy logit-normal noncentral beta raised cosine reciprocal triangular U-quadratic uniform Wigner semicircle

Continuous univariate supported on a semi-infinite interval	Benini Benktander 1st kind Benktander 2nd kind beta prime Burr chi-squared chi Dagum Davis exponential-logarithmic Erlang exponential F folded normal Flory–Schulz Fréchet gamma gamma/Gompertz generalized inverse Gaussian Gompertz half-logistic half-normal Hotelling's T-squared hyper-Erlang hyperexponential hypoexponential inverse chi-squared scaled inverse chi-squared inverse Gaussian inverse gamma Kolmogorov Lévy log-Cauchy log-Laplace log-logistic log-normal matrix-exponential Maxwell–Boltzmann Maxwell–Jüttner Mittag-Leffler Nakagami noncentral chi-squared Pareto phase-type poly-Weibull Rayleigh relativistic Breit–Wigner Rice shifted Gompertz truncated normal type-2 Gumbel Weibull Discrete Weibull Wilks's lambda

Continuous univariate supported on the whole real line	Cauchy exponential power Fisher's z Gaussian q generalized normal generalized hyperbolic geometric stable Gumbel Holtsmark hyperbolic secant Johnson's S_U Landau Laplace asymmetric Laplace logistic noncentral t normal (Gaussian) normal-inverse Gaussian skew normal slash stable Student's t type-1 Gumbel Tracy–Widom variance-gamma Voigt

Continuous univariate with support whose type varies	generalized extreme value generalized Pareto Tukey lambda q-Gaussian q-exponential q-Weibull shifted log-logistic

Mixed continuous-discrete univariate	rectified Gaussian

Multivariate (joint)	Discrete Ewens multinomial Dirichlet-multinomial negative multinomial Continuous Dirichlet generalized Dirichlet multivariate normal multivariate stable multivariate t normal-inverse-gamma normal-gamma Matrix-valued inverse matrix gamma inverse-Wishart matrix normal matrix t matrix gamma normal-inverse-Wishart normal-Wishart Wishart

Directional	Univariate (circular) directional Circular uniform univariate von Mises wrapped normal wrapped Cauchy wrapped exponential wrapped asymmetric Laplace wrapped Lévy Bivariate (spherical) Kent Bivariate (toroidal) bivariate von Mises Multivariate von Mises–Fisher Bingham

Degenerate and singular	Degenerate Dirac delta function Singular Cantor

Families	Circular compound Poisson elliptical exponential natural exponential location-scale maximum entropy mixture Pearson Tweedie wrapped

This article is issued from Wikipedia - version of the 11/18/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.