Benford's Law
-------------
To examine the distribution of the mantissas of a dataset, we can
examine the fractional parts of the common logarithms of the data.
That's because the fractional part of the common logarithm is the
common logarithm of the mantissa.

For example, consider numbers with mantissa 2.5:

    log(.025) = .39794000867203760957 - 2
    log(.25)  = .39794000867203760957 - 1
    log(2.5)  = .39794000867203760957 + 0
    log(25)   = .39794000867203760957 + 1
    log(250)  = .39794000867203760957 + 2
    log(2500) = .39794000867203760957 + 3

Thus, if the mantissa of a number is 2.5, the fractional part of its
common logarithm is .39794000867203760957.

If the data spans several decades (powers of 10, not years; see
Decade (log scale)), when we combine the data from all of the
decades, it tends to even out.  Thus, the fractional parts of the
common logarithms of the data should be evenly distributed.

For example, suppose the common logarithm of the data is distributed
across 4 decades as shown below:

Summing the distributions of the fractional parts of the logarithms, that is, moving the decades on top of each other and adding curves, we get the black curve at the top of the image below, which is close to evenly distributed:
Thus, we arrive at the principal assumption of Benford's Law: the logarithm of the mantissa of data which spans several decades is typically distributed evenly. Note the hash marks on the bottoms of the graphs above. These marks separate where the different leading digits of the mantissa live. On the line segment below, we expand these hash marks and align the leading digit of the mantissa with the fractional part of the common logarithm. The leading digit of the mantissa is 1 if the fractional part of the common logarithm is between 0 and .30103; the leading digit is 2 if the fractional part is between .30103 and .47712; and so on.
If the principal assumption of Benford's Law holds, the fractional part of the common logarithm is evenly distributed. In view of the previous diagram, it is obvious that the probability of 1 being the leading digit is greater than that of 2 being the leading digit; the probability of 2 is greater than that of 3; and so on. This is made precise below. Data that has a mantissa starting with the digit 1 has a common logarithm whose fractional part ranges from log(1) to log(2). If the fractional part of the common logarithm of the data is evenly distributed, then the portion of the data that starts with 1 would be log(2) - log(1) ---------------- = .30102999566398119521 log(10) - log(1) Similarly, data that has a mantissa starting with the digit 2 has a common logarithm whose fractional part ranges from log(2) to log(3). Thus, the portion of the data starting with 2 would be log(3) - log(2) ---------------- = .17609125905568124208 log(10) - log(1) In the same manner, data that has a mantissa starting with the digit d has a common logarithm whose fractional part ranges from log(d) to log(d+1). Thus, the portion of the data starting with d would be log(d+1) - log(d) ----------------- [1] log(10) - log(1) Using [1], we can compute the probability that such data will start with the digit d: d P(d) - ---- 1 .30102999566398119521 2 .17609125905568124208 3 .12493873660829995313 4 .096910013008056414359 5 .079181246047624827723 6 .066946789630613198203 7 .057991946977686754929 8 .051152522447381288949 9 .045757490560675125410 This distribution of leading digits is called Benford's Law. Further Digits -------------- The probability that the first two digits are 10 is log(11) - log(10) ------------------ = .041392685158225040750 log(100) - log(10) The probability that the first two digits are 20 is log(21) - log(20) ------------------ = .021189299069938072794 log(100) - log(10) Adding the probabilities for all first digits, we can compute the probability that the second digit is 0 to be .11967926859688076667. In this manner, we can compute the probability that the second digit is d: d P(d) - ---- 0 .11967926859688076667 1 .11389010340755643889 2 .10882149900550836859 3 .10432956023095946693 4 .10030820226757934031 5 .096677235802322528359 6 .093374735783036121570 7 .090351989269603369600 8 .087570053578861399175 9 .084997352057692199898 For reference, here are the probabilities that the third digit is d: d P(d) - ---- 0 .10178436464421710175 1 .10137597744780144287 2 .10097219813704129959 3 .10057293211092617495 4 .10017808762794737592 5 .099787575692177452606 6 .099401309944962084127 7 .099019206561896092170 8 .098641184154777437875 9 .098267163678253538152 Here are the probabilities that the fourth digit is d: d P(d) - ---- 0 .10017614693993552632 1 .10013688811757926504 2 .10009767259461432585 3 .10005850028348653742 4 .10001937109690488020 5 .099980284947840433784 6 .099941241749525329518 7 .099902241415451708313 8 .099863283859370683672 9 .099824368995291309873