Benford's Law
-------------
To examine the distribution of the mantissas of a dataset, we can
examine the fractional parts of the common logarithms of the data.
That's because the fractional part of the common logarithm is the
common logarithm of the mantissa.
For example, consider numbers with mantissa 2.5:
log(.025) = .39794000867203760957 - 2
log(.25) = .39794000867203760957 - 1
log(2.5) = .39794000867203760957 + 0
log(25) = .39794000867203760957 + 1
log(250) = .39794000867203760957 + 2
log(2500) = .39794000867203760957 + 3
Thus, if the mantissa of a number is 2.5, the fractional part of its
common logarithm is .39794000867203760957.
If the data spans several decades (powers of 10, not years; see
Decade (log scale)), when we combine the data from all of the
decades, it tends to even out. Thus, the fractional parts of the
common logarithms of the data should be evenly distributed.
For example, suppose the common logarithm of the data is distributed
across 4 decades as shown below:

Summing the distributions of the fractional parts of the logarithms,
that is, moving the decades on top of each other and adding curves,
we get the black curve at the top of the image below, which is close
to evenly distributed:
Thus, we arrive at the principal assumption of Benford's Law: the
logarithm of the mantissa of data which spans several decades is
typically distributed evenly.
Note the hash marks on the bottoms of the graphs above. These marks
separate where the different leading digits of the mantissa live.
On the line segment below, we expand these hash marks and align the
leading digit of the mantissa with the fractional part of the common
logarithm. The leading digit of the mantissa is 1 if the fractional
part of the common logarithm is between 0 and .30103; the leading
digit is 2 if the fractional part is between .30103 and .47712; and
so on.
If the principal assumption of Benford's Law holds, the fractional
part of the common logarithm is evenly distributed. In view of the
previous diagram, it is obvious that the probability of 1 being the
leading digit is greater than that of 2 being the leading digit; the
probability of 2 is greater than that of 3; and so on. This is made
precise below.
Data that has a mantissa starting with the digit 1 has a common
logarithm whose fractional part ranges from log(1) to log(2). If
the fractional part of the common logarithm of the data is evenly
distributed, then the portion of the data that starts with 1 would
be
log(2) - log(1)
---------------- = .30102999566398119521
log(10) - log(1)
Similarly, data that has a mantissa starting with the digit 2 has
a common logarithm whose fractional part ranges from log(2) to
log(3). Thus, the portion of the data starting with 2 would be
log(3) - log(2)
---------------- = .17609125905568124208
log(10) - log(1)
In the same manner, data that has a mantissa starting with the
digit d has a common logarithm whose fractional part ranges from
log(d) to log(d+1). Thus, the portion of the data starting with d
would be
log(d+1) - log(d)
----------------- [1]
log(10) - log(1)
Using [1], we can compute the probability that such data will start
with the digit d:
d P(d)
- ----
1 .30102999566398119521
2 .17609125905568124208
3 .12493873660829995313
4 .096910013008056414359
5 .079181246047624827723
6 .066946789630613198203
7 .057991946977686754929
8 .051152522447381288949
9 .045757490560675125410
This distribution of leading digits is called Benford's Law.
Further Digits
--------------
The probability that the first two digits are 10 is
log(11) - log(10)
------------------ = .041392685158225040750
log(100) - log(10)
The probability that the first two digits are 20 is
log(21) - log(20)
------------------ = .021189299069938072794
log(100) - log(10)
Adding the probabilities for all first digits, we can compute the
probability that the second digit is 0 to be .11967926859688076667.
In this manner, we can compute the probability that the second digit
is d:
d P(d)
- ----
0 .11967926859688076667
1 .11389010340755643889
2 .10882149900550836859
3 .10432956023095946693
4 .10030820226757934031
5 .096677235802322528359
6 .093374735783036121570
7 .090351989269603369600
8 .087570053578861399175
9 .084997352057692199898
For reference, here are the probabilities that the third digit is d:
d P(d)
- ----
0 .10178436464421710175
1 .10137597744780144287
2 .10097219813704129959
3 .10057293211092617495
4 .10017808762794737592
5 .099787575692177452606
6 .099401309944962084127
7 .099019206561896092170
8 .098641184154777437875
9 .098267163678253538152
Here are the probabilities that the fourth digit is d:
d P(d)
- ----
0 .10017614693993552632
1 .10013688811757926504
2 .10009767259461432585
3 .10005850028348653742
4 .10001937109690488020
5 .099980284947840433784
6 .099941241749525329518
7 .099902241415451708313
8 .099863283859370683672
9 .099824368995291309873