Statistical Formulas For Programmers

DRAFT: May 19, 2013

Being able to apply statistics is like having a secret superpower.

Where most people see averages, you see confidence intervals.

When someone says “7 is greater than 5,” you declare that they're really the same.

In a cacophony of noise, you hear a cry for help.

Unfortunately, not enough programmers have this superpower. That's a shame, because the application of statistics can almost always enhance the display and interpretation of data.

As my modest contribution to developer-kind, I've collected together the statistical formulas that I find to be most useful; this page presents them all in one place, a sort of statistical cheat-sheet for the practicing programmer.

Most of these formulas can be found in Wikipedia, but others are buried in journal articles or in professors' web pages. They are all classical (not Bayesian), and to motivate them I have added concise commentary. I've also added links and references, so that even if you're unfamiliar with the underlying concepts, you can go out and learn more. Wearing a red cape is optional.

Send suggestions and corrections to emmiller@gmail.com

1. Formulas For Reporting Averages

One of the first programming lessons in any language is to compute an average. But rarely does anyone stop to ask: what does the average actually tell us about the underlying data?

1.1 Corrected Standard Deviation

The standard deviation is a single number that reflects how spread out the data actually is. It should be reported alongside the average (unless the user will be confused).

Where:

is the number of observations
is the value of the
^th observation
is the average value of

Reference: Standard deviation (Wikipedia)

1.2 Standard Error of the Mean

From a statistical point of view, the "average" is really just an estimate of an underlying population mean. That estimate has uncertainty that is summarized by the standard error.

Reference: Standard error (Wikipedia)

1.3 Confidence Interval Around the Mean

A confidence interval reflects the set of statistical hypotheses that won't be rejected at a given significance level. So the confidence interval around the mean reflects all possible values of the mean that can't be rejected by the data. It is a multiple of the standard error added to and subtracted from the mean.

Where:

is the significance level, typically 5% (one minus the confidence level)
is the
quantile of a t-distribution with
degrees of freedom

Reference: Confidence interval (Wikipedia)

1.4 Two-Sample T-Test

A two-sample t-test can tell whether two groups of observations differ in their mean.

The test statistic is given by:

The hypothesis of equal means is rejected if

exceeds the

quantile of a t distribution with degrees of freedom equal to:

You can see a demonstration of these concepts in Evan's Awesome Two-Sample T-Test.

Reference: Student's t-test (Wikipedia)

2. Formulas For Reporting Proportions

It's common to report the relative proportions of binary outcomes or categorical data, but in general these are meaningless without confidence intervals and tests of independence.

2.1 Confidence Interval of a Bernoulli Parameter

A Bernoulli parameter is the proportion underlying a binary-outcome event (for example, the percent of the time a coin comes up heads). The confidence interval is given by:

Where:

is the observed proportion of interest
is the
quantile of a normal distribution

This formula can also be used as a sorting criterion.

Reference: Binomial proportion confidence interval (Wikipedia)

2.2 Multinomial Confidence Intervals

If you have more than two categories, a multinomial confidence interval supplies upper and lower confidence limits on all of the category proportions at once. The formula is nearly identical to the preceding one.

Where:

is the observed proportion of the
th category

Reference: Confidence Intervals for Multinomial Proportions

2.3 Chi-Squared Test

Pearson's chi-squared test can detect whether the distribution of row counts seem to differ across columns (or vice versa). It is useful when comparing two or more sets of category proportions.

The test statistic, called

, is computed as:

Where:

is the number of rows
is the number of columns
is the observed count in row
and column
is the expected count in row
and column

The expected count is given by:

A statistical dependence exists if

is greater than the (

) quantile of a

distribution with

degrees of freedom.

You can see a 2x2 demonstration of these concepts in Evan's Awesome Chi-Squared Test.

Reference: Pearson's chi-squared test (Wikipedia)

3. Formulas For Reporting Count Data

If the incoming events are independent, their counts are well-described by a Poisson distribution. A Poisson distribution takes a parameter

, which is the distribution's mean — that is, the average arrival rate of events per unit time.

3.1. Standard Deviation of a Poisson Distribution

The standard deviation of Poisson data usually doesn't need to be explicitly calculated. Instead it can be inferred from the Poisson parameter:

This fact can be used to read an unlabeled sales chart, for example.

Reference: Poisson distribution (Wikipedia)

3.2. Confidence Interval Around the Poisson Parameter

The confidence interval around the Poisson parameter represents the set of arrival rates that can't be rejected by the data. It can be inferred from a single data point of

events observed over

time periods with the following formula:

Where:

is the inverse of the lower incomplete gamma function

Reference: Confidence Intervals for the Mean of a Poisson Distribution

3.3. Conditional Test of Two Poisson Parameters

Please never do this:

From a statistical point of view, 5 events is indistinguishable from 7 events. Before reporting in bright red text that one count is greater than another, it's best to perform a test of the two Poisson means.

The p-value is given by:

360docimg_501_ 360docimg_502_ 360docimg_503_ 360docimg_504_ 360docimg_505_ 360docimg_506_ 360docimg_507_ 360docimg_508_ 360docimg_509_360docimg_510_360docimg_511_360docimg_512_360docimg_513_360docimg_514_360docimg_515_360docimg_516_360docimg_517_ 360docimg_518_ 360docimg_519_ 360docimg_520_ 360docimg_521_ 360docimg_522_ 360docimg_523_ 360docimg_524_ 360docimg_525_ 360docimg_526_ 360docimg_527_ 360docimg_528_ 360docimg_529_ 360docimg_530_ 360docimg_531_ 360docimg_532_360docimg_533_360docimg_534_360docimg_535_

Where:

Observation 1 consists of 360docimg_536_360docimg_537_ events over 360docimg_538_360docimg_539_ time periods
Observation 2 consists of 360docimg_540_360docimg_541_ events over 360docimg_542_360docimg_543_ time periods
360docimg_544_360docimg_545_360docimg_546_360docimg_547_360docimg_548_360docimg_549_360docimg_550_ and 360docimg_551_360docimg_552_360docimg_553_360docimg_554_360docimg_555_360docimg_556_360docimg_557_

You can see a demonstration of these concepts in Evan's Awesome Poisson Means Test.

Reference: A more powerful test for comparing two Poisson means (PDF)

4. Formulas For Comparing Distributions

If you want to test whether groups of observations come from the same (unknown) distribution, or if a single group of observations comes from a known distribution, you'll need a Kolmogorov-Smirnov test. A K-S test will test the entire distribution for equality, not just the distribution mean.

4.1. Comparing An Empirical Distribution to a Known Distribution

The simplest version is a one-sample K-S test, which compares a sample of 360docimg_558_ points having an observed cumulative distribution function 360docimg_559_ to a known distribution function having a c.d.f. of 360docimg_560_. The test statistic is:

360docimg_561_360docimg_562_360docimg_563_360docimg_564_360docimg_565_360docimg_566_360docimg_567_360docimg_568_360docimg_569_360docimg_570_360docimg_571_360docimg_572_360docimg_573_360docimg_574_360docimg_575_360docimg_576_360docimg_577_360docimg_578_

In plain English, 360docimg_579_360docimg_580_ is the absolute value of the largest difference in the two c.d.f.s for any value of 360docimg_581_.

The critical value of 360docimg_582_360docimg_583_ is given by 360docimg_584_360docimg_585_360docimg_586_360docimg_587_360docimg_588_360docimg_589_360docimg_590_, where 360docimg_591_360docimg_592_ is the value of 360docimg_593_ that solves:

360docimg_594_360docimg_595_360docimg_596_360docimg_597_360docimg_598_ 360docimg_599_ 360docimg_600_ 360docimg_601_ 360docimg_602_ 360docimg_603_360docimg_604_360docimg_605_360docimg_606_360docimg_607_360docimg_608_360docimg_609_360docimg_610_360docimg_611_360docimg_612_360docimg_613_360docimg_614_360docimg_615_360docimg_616_360docimg_617_360docimg_618_360docimg_619_360docimg_620_360docimg_621_360docimg_622_360docimg_623_360docimg_624_360docimg_625_360docimg_626_360docimg_627_360docimg_628_360docimg_629_

The critical must be solved iteratively, e.g. by Newton's method. If only the p-value is needed, it can be computed directly by solving the above for 360docimg_630_.

Reference: Kolmogorov-Smirnov Test (Wikipedia)

4.2. Comparing Two Empirical Distributions

The two-sample version is similar, except the test statistic is given by:

360docimg_631_360docimg_632_360docimg_633_360docimg_634_360docimg_635_360docimg_636_360docimg_637_360docimg_638_360docimg_639_360docimg_640_360docimg_641_360docimg_642_360docimg_643_360docimg_644_360docimg_645_360docimg_646_360docimg_647_360docimg_648_360docimg_649_360docimg_650_360docimg_651_360docimg_652_360docimg_653_360docimg_654_

Where 360docimg_655_360docimg_656_ and 360docimg_657_360docimg_658_ are the empirical c.d.f.s of the two samples, having 360docimg_659_360docimg_660_ and 360docimg_661_360docimg_662_ observations, respectively. The critical value of the test statistic is 360docimg_663_360docimg_664_360docimg_665_360docimg_666_360docimg_667_360docimg_668_360docimg_669_360docimg_670_360docimg_671_360docimg_672_360docimg_673_360docimg_674_360docimg_675_360docimg_676_360docimg_677_360docimg_678_360docimg_679_360docimg_680_360docimg_681_360docimg_682_360docimg_683_360docimg_684_360docimg_685_360docimg_686_360docimg_687_360docimg_688_360docimg_689_360docimg_690_ with the same value of 360docimg_691_360docimg_692_ above.

Reference: Kolmogorov-Smirnov Test (Wikipedia)

4.3. Comparing Three or More Empirical Distributions

A 360docimg_693_-sample extension of Kolmogorov-Smirnov was described by J. Kiefer in a 1959 paper. The test statistic is:

360docimg_694_360docimg_695_360docimg_696_360docimg_697_360docimg_698_360docimg_699_360docimg_700_360docimg_701_360docimg_702_360docimg_703_360docimg_704_360docimg_705_360docimg_706_360docimg_707_360docimg_708_360docimg_709_360docimg_710_360docimg_711_360docimg_712_360docimg_713_360docimg_714_360docimg_715_360docimg_716_360docimg_717_360docimg_718_360docimg_719_

Where 360docimg_720_360docimg_721_ is the c.d.f. of the combined samples. The critical value of 360docimg_722_ is 360docimg_723_360docimg_724_ where 360docimg_725_ solves:

360docimg_726_360docimg_727_360docimg_728_360docimg_729_360docimg_730_ 360docimg_731_ 360docimg_732_ 360docimg_733_ 360docimg_734_ 360docimg_735_ 360docimg_736_ 360docimg_737_ 360docimg_738_ 360docimg_739_ 360docimg_740_ 360docimg_741_360docimg_742_360docimg_743_360docimg_744_360docimg_745_360docimg_746_360docimg_747_ 360docimg_748_ 360docimg_749_ 360docimg_750_ 360docimg_751_ 360docimg_752_ 360docimg_753_ 360docimg_754_ 360docimg_755_ 360docimg_756_ 360docimg_757_ 360docimg_758_ 360docimg_759_ 360docimg_760_ 360docimg_761_ 360docimg_762_360docimg_763_360docimg_764_ 360docimg_765_ 360docimg_766_ 360docimg_767_ 360docimg_768_ 360docimg_769_ 360docimg_770_ 360docimg_771_ 360docimg_772_ 360docimg_773_ 360docimg_774_ 360docimg_775_ 360docimg_776_ 360docimg_777_ 360docimg_778_ 360docimg_779_ 360docimg_780_ 360docimg_781_ 360docimg_782_ 360docimg_783_ 360docimg_784_ 360docimg_785_ 360docimg_786_ 360docimg_787_ 360docimg_788_ 360docimg_789_ 360docimg_790_ 360docimg_791_ 360docimg_792_ 360docimg_793_ 360docimg_794_ 360docimg_795_ 360docimg_796_ 360docimg_797_ 360docimg_798_ 360docimg_799_ 360docimg_800_ 360docimg_801_ 360docimg_802_ 360docimg_803_

Where:

360docimg_804_360docimg_805_360docimg_806_360docimg_807_360docimg_808_
360docimg_809_360docimg_810_360docimg_811_360docimg_812_ is a Bessel function of the first kind with order 360docimg_813_360docimg_814_360docimg_815_
360docimg_816_360docimg_817_360docimg_818_360docimg_819_360docimg_820_360docimg_821_360docimg_822_360docimg_823_360docimg_824_360docimg_825_ is the 360docimg_826_^th zero of 360docimg_827_360docimg_828_360docimg_829_360docimg_830_360docimg_831_360docimg_832_360docimg_833_360docimg_834_

To compute the critical value, this equation must also be solved iteratively. When 360docimg_835_360docimg_836_360docimg_837_, the equation reduces to a two-sample Kolmogorov-Smirnov test. The case of 360docimg_838_360docimg_839_360docimg_840_ can also be reduced to a simpler form, but for other values of 360docimg_841_, the equation cannot be reduced.

Reference: K-sample analogues of the Kolmogorov-Smirnov and Cramer-v. Mises tests (JSTOR)

5. Formulas For Drawing a Trend Line

Trend lines (or best-fit lines) can be used to establish a relationship between two variables and predict future values.

5.1. Slope of a Best-Fit Line

The slope of a best-fit (least squares) line is:

360docimg_842_360docimg_843_360docimg_844_ 360docimg_845_ 360docimg_846_ 360docimg_847_ 360docimg_848_ 360docimg_849_ 360docimg_850_ 360docimg_851_ 360docimg_852_ 360docimg_853_ 360docimg_854_ 360docimg_855_ 360docimg_856_ 360docimg_857_ 360docimg_858_ 360docimg_859_ 360docimg_860_ 360docimg_861_ 360docimg_862_ 360docimg_863_ 360docimg_864_ 360docimg_865_ 360docimg_866_ 360docimg_867_ 360docimg_868_ 360docimg_869_ 360docimg_870_ 360docimg_871_ 360docimg_872_ 360docimg_873_ 360docimg_874_ 360docimg_875_

Where:

360docimg_876_360docimg_877_360docimg_878_360docimg_879_360docimg_880_360docimg_881_360docimg_882_360docimg_883_360docimg_884_ is the independent variable with sample mean 360docimg_885_360docimg_886_
360docimg_887_360docimg_888_360docimg_889_360docimg_890_360docimg_891_360docimg_892_360docimg_893_360docimg_894_360docimg_895_ is the dependent variable with sample mean 360docimg_896_360docimg_897_

5.2. Standard Error of the Slope

The standard error around the estimated slope is:

360docimg_898_360docimg_899_360docimg_900_360docimg_901_ 360docimg_902_ 360docimg_903_ 360docimg_904_ 360docimg_905_ 360docimg_906_ 360docimg_907_ 360docimg_908_ 360docimg_909_ 360docimg_910_ 360docimg_911_ 360docimg_912_ 360docimg_913_ 360docimg_914_ 360docimg_915_ 360docimg_916_ 360docimg_917_ 360docimg_918_ 360docimg_919_ 360docimg_920_ 360docimg_921_ 360docimg_922_ 360docimg_923_ 360docimg_924_ 360docimg_925_ 360docimg_926_ 360docimg_927_ 360docimg_928_ 360docimg_929_ 360docimg_930_ 360docimg_931_ 360docimg_932_ 360docimg_933_ 360docimg_934_ 360docimg_935_ 360docimg_936_ 360docimg_937_ 360docimg_938_ 360docimg_939_ 360docimg_940_ 360docimg_941_ 360docimg_942_ 360docimg_943_ 360docimg_944_ 360docimg_945_ 360docimg_946_ 360docimg_947_ 360docimg_948_ 360docimg_949_ 360docimg_950_ 360docimg_951_ 360docimg_952_ 360docimg_953_ 360docimg_954_ 360docimg_955_ 360docimg_956_ 360docimg_957_ 360docimg_958_ 360docimg_959_ 360docimg_960_ 360docimg_961_ 360docimg_962_ 360docimg_963_ 360docimg_964_ 360docimg_965_ 360docimg_966_ 360docimg_967_ 360docimg_968_ 360docimg_969_ 360docimg_970_ 360docimg_971_ 360docimg_972_ 360docimg_973_ 360docimg_974_ 360docimg_975_ 360docimg_976_ 360docimg_977_ 360docimg_978_ 360docimg_979_ 360docimg_980_ 360docimg_981_ 360docimg_982_ 360docimg_983_ 360docimg_984_

5.3. Confidence Interval Around the Slope

The confidence interval is constructed as:

360docimg_985_360docimg_986_360docimg_987_360docimg_988_360docimg_989_360docimg_990_360docimg_991_360docimg_992_360docimg_993_360docimg_994_360docimg_995_

Where:

360docimg_996_ is the significance level, typically 5% (one minus the confidence level)
360docimg_997_360docimg_998_360docimg_999_360docimg_1000_ is the 360docimg_1001_360docimg_1002_360docimg_1003_360docimg_1004_360docimg_1005_ quantile of a t-distribution with 360docimg_1006_360docimg_1007_360docimg_1008_ degrees of freedom

Reference: Simple linear regression (Wikipedia)

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。

Statistical Formulas For Programmers

Table of Contents

1. Formulas For Reporting Averages

1.1 Corrected Standard Deviation

1.2 Standard Error of the Mean

1.3 Confidence Interval Around the Mean

1.4 Two-Sample T-Test

2. Formulas For Reporting Proportions

2.1 Confidence Interval of a Bernoulli Parameter

2.2 Multinomial Confidence Intervals

2.3 Chi-Squared Test

3. Formulas For Reporting Count Data

3.1. Standard Deviation of a Poisson Distribution

3.2. Confidence Interval Around the Poisson Parameter

3.3. Conditional Test of Two Poisson Parameters

4. Formulas For Comparing Distributions

4.1. Comparing An Empirical Distribution to a Known Distribution

4.2. Comparing Two Empirical Distributions

4.3. Comparing Three or More Empirical Distributions

5. Formulas For Drawing a Trend Line

5.1. Slope of a Best-Fit Line

5.2. Standard Error of the Slope

5.3. Confidence Interval Around the Slope