Benford’s Law

Benford’s Law, also called the First-Digit Law, refers to the frequency distribution of digits in many (but not all) real-life sources of data. In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time. Benford’s Law also concerns the expected distribution for digits beyond the first, which approach a uniform distribution.

This result has been found to apply to a wide variety of data sets, including electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants, and processes described by power laws (which are very common in nature). It tends to be most accurate when values are distributed across multiple orders of magnitude.

It is named after physicist Frank Benford, who stated it in 1938,[1] although it had been previously stated by Simon Newcomb in 1881.

Mathematical statement

A logarithmic scale bar. Picking a random x position uniformly on this number line, roughly 30% of the time the first digit of the number will be 1.

A set of numbers is said to satisfy Benford’s Law if the leading digit d (d ∈ {1, …, 9}) occurs with probability

P(d)=\log _{{10}}(d+1)-\log _{{10}}(d)=\log _{{10}}\left({\frac  {d+1}{d}}\right)=\log _{{10}}\left(1+{\frac  {1}{d}}\right).

Numerically, the leading digits have the following distribution in Benford’s Law, where d is the leading digit and P(d) the probability:

d P(d) Relative size of P(d)
1 30.1%
2 17.6%
3 12.5%
4 9.7%
5 7.9%
6 6.7%
7 5.8%
8 5.1%
9 4.6%

The quantity P(d) is proportional to the space between d and d + 1 on a logarithmic scale. Therefore, this is the distribution expected if the mantissae of the logarithms of the numbers (but not the numbers themselves) are uniformly and randomly distributed. For example, a number x, constrained to lie between 1 and 10, starts with the digit 1 if 1 ≤ x < 2, and starts with the digit 9 if 9 ≤ x < 10. Therefore, x starts with the digit 1 if log 1 ≤ log x < log 2, or starts with 9 if log 9 ≤ log x < log 10. The interval [log 1, log 2] is much wider than the interval [log 9, log 10] (0.30 and 0.05 respectively); therefore if log x is uniformly and randomly distributed, it is much more likely to fall into the wider interval than the narrower interval, i.e. more likely to start with 1 than with 9. The probabilities are proportional to the interval widths, and this gives the equation above. (The above discussion assumed x is between 1 and 10, but the result is the same no matter how many digits x has before the decimal point.)

An extension of Benford’s Law predicts the distribution of first digits in other bases besides decimal; in fact, any base b ≥ 2. The general form is:

P(d)=\log _{{b}}(d+1)-\log _{{b}}(d)=\log _{{b}}\left(1+{\frac  {1}{d}}\right).

For b = 2 (the binary number system), Benford’s Law is true but trivial: All binary numbers (except for 0) start with the digit 1. (On the other hand, the generalization of Benford’s law to second and later digits is not trivial, even for binary numbers.) Also, Benford’s Law does not apply to unary systems such as tally marks.

History

The discovery of Benford’s Law goes back to 1881, when the American astronomer Simon Newcomb noticed that in logarithm tables (used at that time to perform calculations) the earlier pages (which contained numbers that started with 1) were much more worn than the other pages.[2] Newcomb’s published result is the first known instance of this observation and includes a distribution on the second digit, as well. Newcomb proposed a law that the probability of a single number N being the first digit of a number was equal to log(N + 1) − log(N).

The phenomenon was again noted in 1938 by the physicist Frank Benford,[1] who tested it on data from 20 different domains and was credited for it. His data set included the surface areas of 335 rivers, the sizes of 3259 US populations, 104 physical constants, 1800 molecular weights, 5000 entries from a mathematical handbook, 308 numbers contained in an issue of Reader’s Digest, the street addresses of the first 342 persons listed in American Men of Science and 418 death rates. The total number of observations used in the paper was 20,229. This discovery was later named after Benford (making it an example of Stigler’s Law).

Explanations

Benford’s Law has been explained in various ways.

Outcomes of exponential growth processes

The precise form of Benford’s Law can be explained if one assumes that the mantissae of the logarithms of the numbers are uniformly distributed; and this is likely to be approximately true if the numbers range over several orders of magnitude. For many sets of numbers, especially sets that grow exponentially such as incomes and stock prices, this is a reasonable assumption.

For example, if a quantity increases continuously and doubles every year, then it will be twice its original value after one year, four times its original value after two years, eight times its original value after three years, and so on. When this quantity reaches a value of 100, the value will have a leading digit of 1 for a year, reaching 200 at the end of the year. Over the course of the next year, the value increases from 200 to 400; it will have a leading digit of 2 for a little over seven months, and 3 for the remaining five months. In the third year, the leading digit will pass through 4, 5, 6, and 7, spending less and less time with each succeeding digit, reaching 800 at the end of the year. Early in the fourth year, the leading digit will pass through 8 and 9. The leading digit returns to 1 when the value reaches 1000, and the process starts again, taking a year to double from 1000 to 2000. From this example, it can be seen that if the value is sampled at uniformly distributed random times throughout those years, it is more likely to be measured when the leading digit is 1, and successively less likely to be measured with higher leading digits.

This example makes it plausible that data tables that involve measurements of exponentially growing quantities will agree with Benford’s Law. But the law also appears to hold for many cases where an exponential growth pattern is not obvious.

Scale invariance

If there is a list of lengths, the distribution of first digits of numbers in the list may be generally similar regardless of whether all the lengths are expressed in metres, or yards, or feet, or inches, etc. To the extent that the distribution of first digits of a data set is scale invariant, the distribution of first digits is the same regardless of the units that the data are expressed in. This implies that the distribution of first digits is given by Benford’s Law.[4][5] To be sure of approximate agreement with Benford’s Law, the data has to be approximately invariant when scaled up by any factor up to 10. A lognormally distributed data set with wide dispersion has this approximate property, as do some of the examples mentioned above.

This means that if one converts from feet to yards (multiplication by a constant), for example, the distribution of first digits must be unchanged — it is scale invariant, and the only continuous distribution that fits this is one whose logarithm is uniformly distributed. For example, the first (non-zero) digit of the lengths or distances of objects should have the same distribution whether the unit of measurement is feet or yards, or anything else. But there are three feet in a yard, so the probability that the first digit of a length in yards is 1 must be the same as the probability that the first digit of a length in feet is 3, 4, or 5. Applying this to all possible measurement scales gives a logarithmic distribution, and combined with the fact that log10(1) = 0 and log10(10) = 1 gives Benford’s Law. That is, if there is a scale invariant distribution of first digits, it must apply to a set of data regardless of what measuring units are used, and the only distribution of first digits that fits that is Benford’s Law.

Multiple probability distributions

For each positive integer n, this graph shows the probability that a random integer between 1 and n starts with each of the nine possible digits. For any particular value of n, the probabilities do not precisely satisfy Benford’s Law; however, looking at a variety of different values of n and averaging the probabilities for each, the resulting probabilities do exactly satisfy Benford’s Law.[citation needed]

For numbers drawn from certain distributions (IQ scores, human heights) the Law fails to hold because these variates obey a normal distribution which is known not to satisfy Benford’s Law,[6] since normal distributions can’t span several orders of magnitude and the mantissae of their logarithms will not be (even approximately) uniformly distributed.

However, if one “mixes” numbers from those distributions, for example by taking numbers from newspaper articles, Benford’s Law reappears. This can also be proven mathematically: if one repeatedly “randomly” chooses a probability distribution (from an uncorrelated set) and then randomly chooses a number according to that distribution, the resulting list of numbers will obey Benford’s Law.[3][7] A similar probabilistic explanation for the appearance of Benford’s Law in everyday-life numbers has been advanced by showing that it arises naturally when one considers mixtures of uniform distributions.[8]

Applications

Accounting fraud detection

In 1972, Hal Varian suggested that the law could be used to detect possible fraud in lists of socio-economic data submitted in support of public planning decisions. Based on the plausible assumption that people who make up figures tend to distribute their digits fairly uniformly, a simple comparison of first-digit frequency distribution from the data with the expected distribution according to Benford’s Law ought to show up any anomalous results.[9] Following this idea, Mark Nigrini showed that Benford’s Law could be used in forensic accounting and auditing as an indicator of accounting and expenses fraud.[10] In practice, applications of Benford’s Law for fraud detection routinely use more than the first digit.[10]

Legal status

In the United States, evidence based on Benford’s Law has been admitted in criminal cases at the federal, state, and local levels.[11]

Election data

Benford’s Law has been invoked as evidence of fraud in the 2009 Iranian elections,[12] and also used to analyze other election results. However, other experts consider Benford’s Law essentially useless as a statistical indicator of election fraud in general.[13][14]

Macroeconomic data

Similarly, the macroeconomic data the Greek government reported to the European Union before entering the Euro Zone was shown to be probably fraudulent using Benford’s Law, albeit years after the country joined.[15]

Genome data

The number of open reading frames and their relationship to genome size differs between eukaryotes and prokaryotes with the former showing a log-linear relationship and the latter a linear relationship. Benford’s Law has been used to test this observation with an excellent fit to the data in both cases.[16]

Scientific fraud detection

A test of regression coefficients in published papers showed agreement with Benford’s law.[17] As a comparison group subjects were asked to fabricate statistical estimates. The fabricated results failed to obey Benford’s law.

Limitations

Benford’s law can only be applied to data that are distributed across multiple orders of magnitude. For instance, one might expect that Benford’s law would apply to a list of numbers representing the populations of UK villages beginning with ‘A’, or representing the values of small insurance claims. But if a “village” is defined as a settlement with population between 300 and 999, or a “small insurance claim” is defined as a claim between $50 and $100, then Benford’s law will not apply.[18][19] More generally, if there is any cut-off which excludes a portion of the underlying data above a maximum value or below a minimum value, then the law will not apply.

Consider the probability distributions shown below, plotted on a log scale.[20] In each case, the total area in red is the relative probability that the first digit is 1, and the total area in blue is the relative probability that the first digit is 8.

A broad probability distribution on a log scale

A narrow probability distribution on a log scale

For the left distribution, the size of the areas of red and blue are approximately proportional to the widths of each red and blue bar. Therefore the numbers drawn from this distribution will approximately follow Benford’s law. On the other hand, for the right distribution, the ratio of the areas of red and blue is very different from the ratio of the widths of each red and blue bar. Rather, the relative areas of red and blue are determined more by the height of the bars than the widths. The heights, unlike the widths, do not satisfy the universal relationship of Benford’s law; instead, they are determined entirely by the shape of the distribution in question. Accordingly, the first digits in this distribution do not satisfy Benford’s law at all.[19]

Thus, real-world distributions that span several orders of magnitude rather smoothly (e.g. populations of settlements, provided that there is no lower limit) are likely to satisfy Benford’s law to a very good approximation. On the other hand, a distribution that covers only one or two orders of magnitude (e.g. heights of human adults, or IQ scores) is unlikely to satisfy Benford’s law well.[18][19]

Statistical tests

Statistical tests examining the fit of Benford’s law to data have more power when the data values span several orders of magnitude. Since many data samples typically do not have this range, numerical transformation of the data to a base other than 10 may be useful before testing.[citation needed]

Although the chi squared test has been used to test for compliance with Benford’s law it has low statistical power when used with small samples.

The Kolmogorov–Smirnov test and the Kuiper test are more powerful when the sample size is small particularly when Stephens’s corrective factor is used.[21] These tests may be overly conservative when applied to discrete distribution. Values for the Benford test have been generated by Morrow.[22] The critical values of the test statistics are shown below:

α = 0.10 α = 0.05 α = 0.01
Kuiper Test 1.191 1.321 1.579
Kolmogorov–Smirnov 1.012 1.148 1.420

Two alternative tests specific to this law have been published: first, the max (m) statistic[23] is given by

m=\operatorname *{max}_{{i=1}}^{{9}}{\Big \{}\Pr(X{\text{ has FSD}}=i)-\log _{{10}}(1+1/i){\Big \}},

and secondly, the distance (d) statistic[24] is given by

d={\sqrt  {\sum _{{i=1}}^{{9}}{\Big [}\Pr(X{\text{ has FSD}}=i)-\log _{{10}}(1+1/i){\Big ]}^{{2}}}},

where FSD is the First Significant Digit. Morrow has determined the critical values for both these statistics, which are shown below:[22]

α = 0.10 α = 0.05 α = 0.01
Leemis’ m 0.851 0.967 1.212
Cho–Gaines’ d 1.212 1.330 1.569

Nigrini[25] has suggested the use of a z statistic

z={\frac  {\,|p_{o}-p_{e}|-{\frac  {1}{2n}}\,}{s_{i}}}

with

s_{i}=\left[{\frac  {p_{e}(1-p_{e})}{n}}\right]^{{1/2}},

where |x| is the absolute value of x, n is the sample size, 1/(2n) is a continuity correction factor, pe is the proportion expected from Benford’s law and po is the observed proportion in the sample.

Morrow has also shown that for any random variable X (with a continuous pdf) divided by its standard deviation (σ), a value A can be found such that the probability of the distribution of the first significant digit of the random variable ( X / σ )A will differ from Benford’s Law by less than ε > 0.[22] The value of A depends on the value of ε and the distribution of the random variable.

A method of accounting fraud detection based on bootstrapping and regression has been proposed.[26]

Generalization to digits beyond the first

It is possible to extend the law to digits beyond the first.[27] In particular, the probability of encountering a number starting with the string of digits n is given by:

\log _{{10}}\left(n+1\right)-\log _{{10}}\left(n\right)=\log _{{10}}\left(1+{\frac  {1}{n}}\right)

(For example, the probability that a number starts with the digits 3, 1, 4 is log10(1 + 1/314) ≈ 0.0014.) This result can be used to find the probability that a particular digit occurs at a given position within a number. For instance, the probability that a “2” is encountered as the second digit is[27]

\log _{{10}}\left(1+{\frac  {1}{12}}\right)+\log _{{10}}\left(1+{\frac  {1}{22}}\right)+\cdots +\log _{{10}}\left(1+{\frac  {1}{92}}\right)\approx 0.109

And the probability that d (d = 0, 1, …, 9) is encountered as the n-th (n > 1) digit is

\sum _{{k=10^{{n-2}}}}^{{10^{{n-1}}-1}}\log _{{10}}\left(1+{\frac  {1}{10k+d}}\right)

The distribution of the n-th digit, as n increases, rapidly approaches a uniform distribution with 10% for each of the ten digits.[27] Four digits is often enough to assume a uniform distribution of 10% as ‘0’ appears 10.0176% of the time in the fourth digit while ‘9’ appears 9.9824% of the time.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s