打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
Statistical Formulas For Programmers

Statistical Formulas For Programmers

By Evan Miller

DRAFT: May 19, 2013

Being able to apply statistics is like having a secret superpower.

Where most people see averages, you see confidence intervals.

When someone says “7 is greater than 5,” you declare that they're really the same.

In a cacophony of noise, you hear a cry for help.

Unfortunately, not enough programmers have this superpower. That's a shame, because the application of statistics can almost always enhance the display and interpretation of data.

As my modest contribution to developer-kind, I've collected together the statistical formulas that I find to be most useful; this page presents them all in one place, a sort of statistical cheat-sheet for the practicing programmer.

Most of these formulas can be found in Wikipedia, but others are buried in journal articles or in professors' web pages. They are all classical (not Bayesian), and to motivate them I have added concise commentary. I've also added links and references, so that even if you're unfamiliar with the underlying concepts, you can go out and learn more. Wearing a red cape is optional.

Send suggestions and corrections to emmiller@gmail.com


Table of Contents

  1. Formulas For Reporting Averages
    1. Corrected Standard Deviation
    2. Standard Error of the Mean
    3. Confidence Interval Around the Mean
    4. Two-Sample T-Test
  2. Formulas For Reporting Proportions
    1. Confidence Interval of a Bernoulli Parameter
    2. Multinomial Confidence Intervals
    3. Chi-Squared Test
  3. Formulas For Reporting Count Data
    1. Standard Deviation of a Poisson Distribution
    2. Confidence Interval Around the Poisson Parameter
    3. Conditional Test of Two Poisson Parameters
  4. Formulas For Comparing Distributions
    1. Comparing an Empirical Distribution to a Known Distribution
    2. Comparing Two Empirical Distributions
    3. Comparing Three or More Empirical Distributions
  5. Formulas For Drawing a Trend Line
    1. Slope of a Best-Fit Trend Line
    2. Standard Error of the Slope
    3. Confidence Interval Around the Slope

1. Formulas For Reporting Averages

One of the first programming lessons in any language is to compute an average. But rarely does anyone stop to ask: what does the average actually tell us about the underlying data?

1.1 Corrected Standard Deviation

The standard deviation is a single number that reflects how spread out the data actually is. It should be reported alongside the average (unless the user will be confused).

Where:

  • is the number of observations
  • is the value of the
    th observation
  • is the average value of

Reference: Standard deviation (Wikipedia)

1.2 Standard Error of the Mean

From a statistical point of view, the "average" is really just an estimate of an underlying population mean. That estimate has uncertainty that is summarized by the standard error.

Reference: Standard error (Wikipedia)

1.3 Confidence Interval Around the Mean

A confidence interval reflects the set of statistical hypotheses that won't be rejected at a given significance level. So the confidence interval around the mean reflects all possible values of the mean that can't be rejected by the data. It is a multiple of the standard error added to and subtracted from the mean.

Where:

  • is the significance level, typically 5% (one minus the confidence level)
  • is the
    quantile of a t-distribution with
    degrees of freedom

Reference: Confidence interval (Wikipedia)

1.4 Two-Sample T-Test

A two-sample t-test can tell whether two groups of observations differ in their mean.

The test statistic is given by:

The hypothesis of equal means is rejected if

exceeds the
quantile of a t distribution with degrees of freedom equal to:

You can see a demonstration of these concepts in Evan's Awesome Two-Sample T-Test.

Reference: Student's t-test (Wikipedia)


2. Formulas For Reporting Proportions

It's common to report the relative proportions of binary outcomes or categorical data, but in general these are meaningless without confidence intervals and tests of independence.

2.1 Confidence Interval of a Bernoulli Parameter

A Bernoulli parameter is the proportion underlying a binary-outcome event (for example, the percent of the time a coin comes up heads). The confidence interval is given by:

Where:

  • is the observed proportion of interest
  • is the
    quantile of a normal distribution

This formula can also be used as a sorting criterion.

Reference: Binomial proportion confidence interval (Wikipedia)

2.2 Multinomial Confidence Intervals

If you have more than two categories, a multinomial confidence interval supplies upper and lower confidence limits on all of the category proportions at once. The formula is nearly identical to the preceding one.

Where:

  • is the observed proportion of the
    th category

Reference: Confidence Intervals for Multinomial Proportions

2.3 Chi-Squared Test

Pearson's chi-squared test can detect whether the distribution of row counts seem to differ across columns (or vice versa). It is useful when comparing two or more sets of category proportions.

The test statistic, called

, is computed as:

Where:

  • is the number of rows
  • is the number of columns
  • is the observed count in row
    and column
  • is the expected count in row
    and column

The expected count is given by:

A statistical dependence exists if

is greater than the (
) quantile of a
distribution with
degrees of freedom.

You can see a 2x2 demonstration of these concepts in Evan's Awesome Chi-Squared Test.

Reference: Pearson's chi-squared test (Wikipedia)


3. Formulas For Reporting Count Data

If the incoming events are independent, their counts are well-described by a Poisson distribution. A Poisson distribution takes a parameter

, which is the distribution's mean — that is, the average arrival rate of events per unit time.

3.1. Standard Deviation of a Poisson Distribution

The standard deviation of Poisson data usually doesn't need to be explicitly calculated. Instead it can be inferred from the Poisson parameter:

This fact can be used to read an unlabeled sales chart, for example.

Reference: Poisson distribution (Wikipedia)

3.2. Confidence Interval Around the Poisson Parameter

The confidence interval around the Poisson parameter represents the set of arrival rates that can't be rejected by the data. It can be inferred from a single data point of

events observed over
time periods with the following formula:

Where:

  • is the inverse of the lower incomplete gamma function

Reference: Confidence Intervals for the Mean of a Poisson Distribution

3.3. Conditional Test of Two Poisson Parameters

Please never do this:

From a statistical point of view, 5 events is indistinguishable from 7 events. Before reporting in bright red text that one count is greater than another, it's best to perform a test of the two Poisson means.

The p-value is given by:

360docimg_501_
360docimg_502_360docimg_503_360docimg_504_360docimg_505_360docimg_506_360docimg_507_360docimg_508_360docimg_509_
360docimg_510_360docimg_511_360docimg_512_360docimg_513_360docimg_514_360docimg_515_360docimg_516_360docimg_517_360docimg_518_360docimg_519_360docimg_520_360docimg_521_360docimg_522_360docimg_523_360docimg_524_360docimg_525_360docimg_526_360docimg_527_360docimg_528_360docimg_529_360docimg_530_360docimg_531_360docimg_532_360docimg_533_360docimg_534_360docimg_535_

Where:

  • Observation 1 consists of 360docimg_536_360docimg_537_ events over 360docimg_538_360docimg_539_ time periods
  • Observation 2 consists of 360docimg_540_360docimg_541_ events over 360docimg_542_360docimg_543_ time periods
  • 360docimg_544_360docimg_545_360docimg_546_360docimg_547_360docimg_548_360docimg_549_360docimg_550_ and 360docimg_551_360docimg_552_360docimg_553_360docimg_554_360docimg_555_360docimg_556_360docimg_557_

You can see a demonstration of these concepts in Evan's Awesome Poisson Means Test.

Reference: A more powerful test for comparing two Poisson means (PDF)


4. Formulas For Comparing Distributions

If you want to test whether groups of observations come from the same (unknown) distribution, or if a single group of observations comes from a known distribution, you'll need a Kolmogorov-Smirnov test. A K-S test will test the entire distribution for equality, not just the distribution mean.

4.1. Comparing An Empirical Distribution to a Known Distribution

The simplest version is a one-sample K-S test, which compares a sample of 360docimg_558_ points having an observed cumulative distribution function 360docimg_559_ to a known distribution function having a c.d.f. of 360docimg_560_. The test statistic is:

360docimg_561_360docimg_562_360docimg_563_360docimg_564_360docimg_565_360docimg_566_360docimg_567_360docimg_568_360docimg_569_360docimg_570_360docimg_571_360docimg_572_360docimg_573_360docimg_574_360docimg_575_360docimg_576_360docimg_577_360docimg_578_

In plain English, 360docimg_579_360docimg_580_ is the absolute value of the largest difference in the two c.d.f.s for any value of 360docimg_581_.

The critical value of 360docimg_582_360docimg_583_ is given by 360docimg_584_360docimg_585_360docimg_586_360docimg_587_360docimg_588_360docimg_589_360docimg_590_, where 360docimg_591_360docimg_592_ is the value of 360docimg_593_ that solves:

360docimg_594_360docimg_595_360docimg_596_360docimg_597_360docimg_598_360docimg_599_360docimg_600_360docimg_601_360docimg_602_360docimg_603_360docimg_604_360docimg_605_360docimg_606_360docimg_607_360docimg_608_360docimg_609_360docimg_610_360docimg_611_360docimg_612_360docimg_613_360docimg_614_360docimg_615_360docimg_616_360docimg_617_360docimg_618_360docimg_619_360docimg_620_360docimg_621_360docimg_622_360docimg_623_360docimg_624_360docimg_625_360docimg_626_360docimg_627_360docimg_628_360docimg_629_

The critical must be solved iteratively, e.g. by Newton's method. If only the p-value is needed, it can be computed directly by solving the above for 360docimg_630_.

Reference: Kolmogorov-Smirnov Test (Wikipedia)

4.2. Comparing Two Empirical Distributions

The two-sample version is similar, except the test statistic is given by:

360docimg_631_360docimg_632_360docimg_633_360docimg_634_360docimg_635_360docimg_636_360docimg_637_360docimg_638_360docimg_639_360docimg_640_360docimg_641_360docimg_642_360docimg_643_360docimg_644_360docimg_645_360docimg_646_360docimg_647_360docimg_648_360docimg_649_360docimg_650_360docimg_651_360docimg_652_360docimg_653_360docimg_654_

Where 360docimg_655_360docimg_656_ and 360docimg_657_360docimg_658_ are the empirical c.d.f.s of the two samples, having 360docimg_659_360docimg_660_ and 360docimg_661_360docimg_662_ observations, respectively. The critical value of the test statistic is 360docimg_663_360docimg_664_360docimg_665_360docimg_666_360docimg_667_360docimg_668_360docimg_669_360docimg_670_360docimg_671_360docimg_672_360docimg_673_360docimg_674_360docimg_675_360docimg_676_360docimg_677_360docimg_678_360docimg_679_360docimg_680_360docimg_681_360docimg_682_360docimg_683_360docimg_684_360docimg_685_360docimg_686_360docimg_687_360docimg_688_360docimg_689_360docimg_690_ with the same value of 360docimg_691_360docimg_692_ above.

Reference: Kolmogorov-Smirnov Test (Wikipedia)

4.3. Comparing Three or More Empirical Distributions

A 360docimg_693_-sample extension of Kolmogorov-Smirnov was described by J. Kiefer in a 1959 paper. The test statistic is:

360docimg_694_360docimg_695_360docimg_696_360docimg_697_360docimg_698_360docimg_699_360docimg_700_360docimg_701_360docimg_702_360docimg_703_360docimg_704_360docimg_705_360docimg_706_360docimg_707_360docimg_708_360docimg_709_360docimg_710_360docimg_711_360docimg_712_360docimg_713_360docimg_714_360docimg_715_360docimg_716_360docimg_717_360docimg_718_360docimg_719_

Where 360docimg_720_360docimg_721_ is the c.d.f. of the combined samples. The critical value of 360docimg_722_ is 360docimg_723_360docimg_724_ where 360docimg_725_ solves:

360docimg_726_360docimg_727_360docimg_728_360docimg_729_360docimg_730_360docimg_731_360docimg_732_360docimg_733_360docimg_734_360docimg_735_360docimg_736_360docimg_737_360docimg_738_360docimg_739_360docimg_740_360docimg_741_360docimg_742_360docimg_743_360docimg_744_360docimg_745_360docimg_746_360docimg_747_360docimg_748_360docimg_749_360docimg_750_360docimg_751_360docimg_752_360docimg_753_360docimg_754_360docimg_755_360docimg_756_360docimg_757_360docimg_758_360docimg_759_360docimg_760_360docimg_761_360docimg_762_360docimg_763_360docimg_764_360docimg_765_360docimg_766_360docimg_767_360docimg_768_360docimg_769_360docimg_770_360docimg_771_360docimg_772_360docimg_773_360docimg_774_360docimg_775_360docimg_776_360docimg_777_360docimg_778_360docimg_779_360docimg_780_360docimg_781_360docimg_782_360docimg_783_360docimg_784_360docimg_785_360docimg_786_360docimg_787_360docimg_788_360docimg_789_360docimg_790_360docimg_791_360docimg_792_360docimg_793_360docimg_794_360docimg_795_360docimg_796_360docimg_797_360docimg_798_360docimg_799_360docimg_800_360docimg_801_360docimg_802_360docimg_803_

Where:

  • 360docimg_804_360docimg_805_360docimg_806_360docimg_807_360docimg_808_
  • 360docimg_809_360docimg_810_360docimg_811_360docimg_812_ is a Bessel function of the first kind with order 360docimg_813_360docimg_814_360docimg_815_
  • 360docimg_816_360docimg_817_360docimg_818_360docimg_819_360docimg_820_360docimg_821_360docimg_822_360docimg_823_360docimg_824_360docimg_825_ is the 360docimg_826_th zero of 360docimg_827_360docimg_828_360docimg_829_360docimg_830_360docimg_831_360docimg_832_360docimg_833_360docimg_834_

To compute the critical value, this equation must also be solved iteratively. When 360docimg_835_360docimg_836_360docimg_837_, the equation reduces to a two-sample Kolmogorov-Smirnov test. The case of 360docimg_838_360docimg_839_360docimg_840_ can also be reduced to a simpler form, but for other values of 360docimg_841_, the equation cannot be reduced.

Reference: K-sample analogues of the Kolmogorov-Smirnov and Cramer-v. Mises tests (JSTOR)


5. Formulas For Drawing a Trend Line

Trend lines (or best-fit lines) can be used to establish a relationship between two variables and predict future values.

5.1. Slope of a Best-Fit Line

The slope of a best-fit (least squares) line is:

360docimg_842_360docimg_843_360docimg_844_360docimg_845_360docimg_846_360docimg_847_360docimg_848_360docimg_849_360docimg_850_360docimg_851_360docimg_852_360docimg_853_360docimg_854_360docimg_855_360docimg_856_360docimg_857_360docimg_858_360docimg_859_360docimg_860_360docimg_861_360docimg_862_360docimg_863_360docimg_864_360docimg_865_360docimg_866_360docimg_867_360docimg_868_360docimg_869_360docimg_870_360docimg_871_360docimg_872_360docimg_873_360docimg_874_360docimg_875_

Where:

  • 360docimg_876_360docimg_877_360docimg_878_360docimg_879_360docimg_880_360docimg_881_360docimg_882_360docimg_883_360docimg_884_ is the independent variable with sample mean 360docimg_885_360docimg_886_
  • 360docimg_887_360docimg_888_360docimg_889_360docimg_890_360docimg_891_360docimg_892_360docimg_893_360docimg_894_360docimg_895_ is the dependent variable with sample mean 360docimg_896_360docimg_897_

5.2. Standard Error of the Slope

The standard error around the estimated slope is:

360docimg_898_360docimg_899_360docimg_900_360docimg_901_360docimg_902_360docimg_903_360docimg_904_360docimg_905_360docimg_906_360docimg_907_360docimg_908_360docimg_909_360docimg_910_360docimg_911_360docimg_912_360docimg_913_360docimg_914_360docimg_915_360docimg_916_360docimg_917_360docimg_918_360docimg_919_360docimg_920_360docimg_921_360docimg_922_360docimg_923_360docimg_924_360docimg_925_360docimg_926_360docimg_927_360docimg_928_360docimg_929_360docimg_930_360docimg_931_360docimg_932_360docimg_933_360docimg_934_360docimg_935_360docimg_936_360docimg_937_360docimg_938_360docimg_939_360docimg_940_360docimg_941_360docimg_942_360docimg_943_360docimg_944_360docimg_945_360docimg_946_360docimg_947_360docimg_948_360docimg_949_360docimg_950_360docimg_951_360docimg_952_360docimg_953_360docimg_954_360docimg_955_360docimg_956_360docimg_957_360docimg_958_360docimg_959_360docimg_960_360docimg_961_360docimg_962_360docimg_963_360docimg_964_360docimg_965_360docimg_966_360docimg_967_360docimg_968_360docimg_969_360docimg_970_360docimg_971_360docimg_972_360docimg_973_360docimg_974_360docimg_975_360docimg_976_360docimg_977_360docimg_978_360docimg_979_360docimg_980_360docimg_981_360docimg_982_360docimg_983_360docimg_984_

5.3. Confidence Interval Around the Slope

The confidence interval is constructed as:

360docimg_985_360docimg_986_360docimg_987_360docimg_988_360docimg_989_360docimg_990_360docimg_991_360docimg_992_360docimg_993_360docimg_994_360docimg_995_

Where:

  • 360docimg_996_ is the significance level, typically 5% (one minus the confidence level)
  • 360docimg_997_360docimg_998_360docimg_999_360docimg_1000_ is the 360docimg_1001_360docimg_1002_360docimg_1003_360docimg_1004_360docimg_1005_ quantile of a t-distribution with 360docimg_1006_360docimg_1007_360docimg_1008_ degrees of freedom

Reference: Simple linear regression (Wikipedia)

本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
【热】打开小程序,算一算2024你的财运
神器! 统计和金融计算器, 词云和情感分析器强大到无敌!
在R语言和Stan中估计截断泊松分布
SixSigma terminology (六西格玛专业术语)
6.6 Hypothesis testing for the ratio of Two Population variances
它是正态分布、二项分布还是泊松分布?
统计建模与R软件第五章习题答案(假设检验)
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服