Archive

Critique

Richard Dawkins’ argument for ‘Why there almost certainly is no God’ (Chap 4, The God Delusion), is mathematical. He proposes that there is no mathematical solution to ‘the problem of improbability’ of God, whereas there is a mathematical solution to ‘the problem of improbability’ of a large stage of Darwinian evolution.

The problem occurs when an improbability is prohibitively improbable. According to Dawkins, the solution to a problem of improbability is to replace its complementary probability with a series of the factors of its complementary probability. Each improbability, complementary to the probability of a factor, is ‘slightly improbable, but not prohibitively so’ (p. 121, The God Delusion).

The problem of improbability of the success of natural selection for a large stage of Darwinian evolution is solved by replacing the large stage with a series of smaller sub-stages. The series of sub-stages represents the staged evolution of the ultimate mutation. In this gradualism, for each sub-stage there is a mutation, which survives the test of natural selection for that sub-stage. In contrast, for the overall, large stage, all mutations are subjected to but one, the ultimate test of natural selection. This test of natural selection represents a much larger and prohibitive improbability than that of the test of natural selection within each sub-stage in the series.

In order for there to be a solution to the improbability of God, God would have to come into being by a series of sub-stages, where each sub-stage improbability is not prohibitively large as is the single stage improbability of the existence of God. Obviously, God would not be God, if he came into existence gradually. Therefore, there is no solution to the improbability of God.

Dawkins’ Elucidation of the Role of Gradualism Is Self-Criticism

In delineating the role of gradualism in Darwinian evolution, Dawkins’ demonstrated that it has no effect on the probability of the evolutionary success of natural selection. He showed that the role of gradualism is to increase the efficiency of mutation. Thus, he disproved that gradualism is the solution to the mathematical problem of improbability. In criticizing his own solution to the problem of improbability, Dawkins disproved his rationale for why there almost certainly is no God.

To illustrate the role of gradualism in Darwinian evolution, Dawkins chose an example of three mutation sites of three mutations each. He accordingly noted that this defines 6x6x6 = 216 different mutations. If the one mutation of these 216, which is capable of surviving natural selection, is unknown, then a minimum of one copy of each would have to be generated non-randomly to ensure 100% evolutionary success of natural selection.

He compared this large stage of non-random mutation and natural selection with a series of three sub-stages, each affecting one of the three mutation sites. Each sub-stage would require the generation of six non-random mutations to ensure 100% evolutionary success. That would be a total of 18 non-random mutations for 100% overall success of natural selection for the series of sub-stages.

The difference between the single, overall stage and the series of sub-stages is not in the success of natural selection. For both, the success of natural selection is 100%. The difference in the total number of non-random mutations to ensure 100% success. The difference is that of 216 and 18 total mutations. The series is mutationally more efficient by a factor of 216/18 = 12, at 100% probability of success of natural selection.

If the pools of mutations subjected to natural selection in the illustration are generated by random mutation, rather than non-random mutation, a similar efficiency in the number of mutations is achieved, without any change in the probability of success of natural selection.

A pool of 19 randomly generated mutations in each of three sub-cycles would yield a probability of success of natural selection of 96.9% for each cycle and an overall probability of success of natural selection of 90.9%. For a single cycle of random mutation involving all three mutation sites, a pool of 516 random mutations would be required to yield a probability of success of natural selection of 90.9%. The efficiency factor in random mutations would be 516/57 = 9 due to gradualism with no change in the probability of 90.9% success of natural selection.

The probability, P, of at least one copy of the mutation surviving natural selection in a pool of x randomly generated mutations with a base of n different mutations is: P = 1 – ((n -1)/n)^x.

Thus, as his own critic, Dawkins has disproved his claim that the problem of improbability of success of a large stage of Darwinian evolution is solved by replacing it with a series of sub-stages.

Dawkins’ Criticism of ‘The Problem of Improbability’

Dawkins has also demonstrated that there is no such problem as ‘the problem of improbability’. He has labeled those who propose such a problem as being persons of a ‘discontinuous mind‘.

Dawkins noted that some variables, which are defined over a range of 0 to 1, are essentially fractions of a whole of some thing or property. The range is 0% to 100%. Implicitly, any two values of such a variable differ from one another by degree, not by kind. Dawkins claims that only a person with a ‘discontinuous mind’ would propose that there is some value in the range which distinguishes two kinds of the variable. One kind would be 0% to the arbitrary point marking the discontinuity from the second kind, defined from the point of discontinuity to 100%.

Of course, wearing his mathematician’s hat, Dawkins is correct. In being correct he has demonstrated that there is no valid definition of a ‘problem of improbability’. Defining the problem requires an arbitrary point demarcating a discontinuity in the range of improbability of 0% to 100%, thereby forming two kinds of improbability, non-prohibitive and prohibitive. It is these two kinds of improbability, which form the basis of his discussion of ‘the problem of improbability’ of Darwinian evolution and of why there almost certainly is no God.

Conclusion

Dawkins has proved
1. Gradualism does not solve ‘the problem of improbability’ of the success of natural selection in a large stage of Darwinian evolution. It has no effect on probability. It merely increases the efficiency of mutation.
2. ‘The problem of improbability’ is a self-contradiction. It proposes a distinction of kind between two subsets, within the defined range of a continuous variable whose values vary by degree, not kind.

Advertisements

Joe Average bowled in a recreational league on Tuesday nights from April through August. On Wednesday mornings, relying on his memory, Joe entered his three game scores in a log before breakfast. After breakfast it was his habit to read the box scores of those American League baseball games played on Tuesdays and reported in the Wednesday morning edition of USA Today. That typically consisted in seven games with six data each, namely the runs, hits and errors of the two teams. That was a total of forty-two baseball data compared to the three data, which were Joe’s bowling scores.

joebowl

(E,J) is the number of Joe’s logged bowling scores that are erroneous.

E is the sum of Joe’s erroneous scores plus reported erroneous AL box scores.

J is the total of Joe’s logged scores, erroneous plus correct.

T is the grand total of scores, Joe’s logged bowling scores plus the reported AL box scores

Given:

(E,J) / E = X = 2/3

E/T = Y = 1/300

J/T = Z = 1/15

What is the probability that a bowling score in Joe’s log is erroneous?

Answer:

Employing Bayes’ Theorem,

(E,J) / J = (X * Y) / Z

(E,J) / J = ((2/3) * (1/300)) / (1/15)

(E,J) / J = 1/30

The fraction of erroneous bowling scores in Joe’s log was 0.0333

Comments:

If the time period consisted in twenty weeks, Joe would have recorded 60 scores of which 0.0333 or 2 were in error. If over the same time period, USA Today recorded 42 * 20 = 840 AL box score data, then the total data in the population would be 60 + 840 = 900 of which 0.00333 were in error or 900 * 0.00333 = 3. Thus, there was one typo in the AL box scores recorded by USA Today over the same time period.

By Bayes’ theorem, the probability of error in Joe’s logging of his bowling scores was calculated to be 3.333%.

Could we conclude that the box score data of the American League determined the probability of Joe’s making an error in his bowling log?

What would be the standard of comparison for determining the correctness or error of a datum in Joe’s bowling log? Could the standard of comparison inherently be the data of American League baseball box scores?

To apply Bayes’ theorem a population of data must be partitioned by two independent criteria. In the above example, one criterion partitioned the population into Joe’s data and non-Joe’s data. The other criterion partitioned the population into erroneous data and non-erroneous data.

What is often lost sight of in applying Bayes’ theorem is that the theorem does not treat subsets as antithetical to one another. Rather, it deals with subsets as compatible, as complementary in forming a whole. In the illustration, the baseball scores are not treated as baseball data, but as non-Joe’s data, the complement of Joe’s data.

In Proving History, page 50 ff, Richard Carrier partitions a population of data into historical reports from Source A and reports from non-Source A. Carrier’s other criterion partitions the population of reports into true reports and non-true reports. He then employs Bayes’ theorem to calculate the probability of true reports among all the reports of Source A. That is not what he indicates he has done. He indicates that what he has done is to evaluate the truth of a Source A report where the evaluation is based on the content of non-Source A reports. That would be comparable to claiming that a datum, in Joe’s bowling log, could be determined to be correct or erroneous based on the content of American League baseball box scores as reported in USA Today by employing Bayes’ theorem.

Both Bayes’ theorem and the reports of the American League box scores are pertinent to calculating the probability of errors in Joe’s bowling log. That probability is the fraction of his logged scores which are erroneous. The pertinence is due to the fact that both Bayes’ theorem and probability deal with complementary subsets. In this instance, the complementary subsets are: Some of Joe’s logged scores are erroneous. Some are non-erroneous.

Neither Bayes’ theorem nor the reports of the American League box scores are pertinent to determining whether any particular score in Joe’s log is erroneous or correct. That distinction is between antithetical propositions: This score is erroneous. This score is not erroneous.

Subsets subject to Bayes’ theorem may be nominally antithetical, such as true and non-true, and, in that sense, incompatible. Yet, relevant to Bayes’ theorem, such subsets are merely complementary and in that sense compatible. Their sum equals the entire set. It is their compatibility as complementary which renders the subsets subject to Bayes’ theorem.

Carrier in Proving History, p 50 ff, by conflating antithetical with different, while ignoring the complementary of subsets, completely misrepresents Bayes’ theorem and its utility.

For an algebraic validation of Bayes’ theorem see the first five paragraphs of the essay.

On page 50 of Proving History, Richard Carrier states,

Notice that the bottom expression (the denominator) represents the sum total of all possibilities, and the top expression (the numerator) represents your theory (or whatever theory you are testing the merit of), so we have a standard calculation of odds: your theory in ratio to all theories.

Carrier is proposing that Bayes’ theorem can be used to determine the truth of your theory which is one among many theories. Carrier implicitly claims that Bayes’ theorem can be used to determine the truth of your theory according to the numerical value of the probability of your theory with respect to all theories, i.e. ‘your theory in ratio to all theories’.

If there are n theories of which yours is one, then the probability of your theory is 1/n, but so too the probability of every other theory in the set of all theories is 1/n. Consequently, such a probability is no indication of the truth or non-truth of your theory. If Carrier’s statement of what is calculated by Bayes’ theorem were true, then Bayes’ theorem has no relevance to determining the truth of your theory.

What Probabilities of Your Theory(s) are Determinable by Bayes’ Theorem?

Probability is the ratio of a subset to a set. Thus, what we are asking is what ratios, within the context of Bayes’ theorem, have your theory(s) alone in the numerator and your theory(s) plus other theories in the denominator.

The population of elements to which Bayes’ theorem applies, may be viewed as a surface over which the population density varies. A Bayesian population is divided into two portions by each of two independent criteria. One criterion may be viewed as dividing the population into two horizontal portions, while the other criterion divides it into two vertical portions. The result is the formation of four quadrants, which differ in population due to the non-uniformity of the population density.

The two portions formed by the horizontal division may be distinguished as the horizontal top row, HT, and the horizontal bottom row, HB. The two portions formed by the vertical division may be distinguished as the vertical left column, VL, and the vertical right column, VR. The two portions, HT + HB add up to the total, T, as do the two portions, VL and VR. The quadrants are designated as Q1 through Q4. Each of the portions is the sum of two quadrants, e.g. HT = Q1 + Q2 and VL = Q1 + Q3.

Tabulation of a Bayesian Population

Tabulation of a Bayesian Population

In the illustrated Bayesian population, the column VR has the role of non-VL. Thus, rather than being one column, VR, may be any number of columns, whose sum is the complement of VL. Analogously, the row, HB, has the role of non-HT. Consequently, Bayes’ theorem is applicable to any number of rows and any number of columns, where the additional rows and columns may be treated in their sum, respectively as non-HT and non-VL, i.e. as HB and VR, respectively.

Bayes’ theorem, in its algebraic expression, which focuses on Q1, is:

Q1/VL = ((Q1/HT) / (VL/T)) * (HT/T) Eq. 1

The two terms, HT, cancel out as do the two terms, T. This leaves the identity, Q1/VL ≡ Q1/VL, which proves the validity of Bayes’ theorem. In the application of Bayes’ theorem the numerical values of the numerators and the denominators of the fractions are not given. What is given are the numerical values of the three fractions on the right hand side of the equation, which permits the calculation of the numerical value of the fraction, Q1/VL, as a fraction.

In the context of the quotation of Carrier: HT are true theories and HB are non-true theories; VL are your theories and VR are non-your or others’ theories. Thus Q1/VL, which is calculated by Bayes’ theorem is the probability of your true theories in ratio to all of your theories. This is what Carrier falsely states is ‘your theory in ratio to all theories’. (I will substantiate that Carrier is referring to Q1/VL later in this essay.)

Let me first list the other probabilities of your theory(s) calculable using Bayes’ theorem, Eq. 1. We can solve Eq. 1 for three other probabilities of your true theories, and of your theories, besides Q1/VL. They are Q1/HT, VL/T and Q1/T.

Q1/HT = (Q1/VL) * ((VL/T)) / (HT/T)) Eq. 2

VL/T = ((Q1/HT) / (Q1/VL)) * (HT/T) Eq. 3

Q1/T = (Q1/VL) * (VL/T) Eq. 4

To What Bayesian Ratio is Carrier Referring as ‘your theory in ratio to all theories’?

In Eq. 2, Q1/HT is the probability of your true theory(s) in ratio to all true theories. This probability is restricted to true theories. If this probability were what Carrier is referring to by ‘your theory in ratio to all theories’, he would be granting that your theory is true, and is not a ‘theory you are testing the merit of’.

In Eq. 3, VL/T is the probability of all of your theories, true and non-true, in ratio to all theories. This probability lumps both your true theories and your non-true theories together, so it could not be a test of the merit of your theory(s). For example, you have ten theories, whether true or non-true, the fact that there are five or a million other theories, has no relevance to the merit of your theory(s).

In Eq. 4, Q1/T is the probability of your true theory(s) in ratio to all theories. This ratio, which acknowledges the truth of your true theory cannot be a test of the merit of your theory. Nevertheless, Q1/T appears close to ‘your theory in ratio to all theories’. It lacks the word, true, after the word, your. However, as shown below, Carrier cannot be referring to Q1/T, but must be referring to Q1/VL.

Carrier in the Quote is Referring to Q1/VL

The common expression of Bayes’ theorem is Eq. 1, which calculates Q1/VL.

Q1/VL is the probability of your true theories in ratio to all of your theories. It is this which Carrier falsely labels ‘your theory in ratio to all theories’. Admittedly, Carrier’s expression, ‘your theory’ can be understood as your true theory(s), but it is obvious that by the words, ‘all theories’, Carrier means all theories and does not mean only all of your theories.

We must ask if Carrier could not have been referring to Q1/T, expressed as Eq. 4 and not Q1/VL, expressed in Eq. 1. The reason that it is Q1/VL becomes apparent by his verbal presentation of Bayes’ theorem as,

verbal-3

Typically, Bayes’ theorem is expressed as Eq. 1. In Eq. 1, the denominator is VL/T. However, VL/T is often expressed as the sum,

VL/T = (Q1/HT) * (HT/T) + (Q3/HB) * (HB/T) Eq. 5

The denominator of Carrier’s verbalized version of the Bayesian equation is undeniably an attempt to express this sum.

The validity of Eq. 5 is apparent in that,

VL/T = Q1/T + Q3/T = (Q1 + Q3)/T, where Q1 + Q3 = VL

Due to the fact that Carrier is attempting to verbalize the standard expression of Bayes’ theorem, i.e. Eq. 1, then the denominator is VL/T. VL/T is the ratio of all your theories to all theories. It cannot be in any way construed to be simply ‘all theories’ as Carrier claims. VL/T is obviously a ratio, in which T is all theories.

If Carrier had meant to express, Q1/T, as in Eq. 4, by his verbalization, the term VL/T, expressed as a sum, would then be a direct factor as it is in Eq. 4. VL/T would not be in the denominator, i.e. an inverse factor, as it is in Carrier’s verbalization and as it is in Eq. 1.

There is another reason that it is apparent that Carrier’s verbalization is expressing Q1/VL as in Eq. 1. The numerator of Eq. 1 is (Q1/HT) * (HT/T). This is the first term of VL/T when VL/T is expressed as a sum as in Eq. 5. In his verbalization, Carrier acknowledges that the first term of the sum of his denominator is his numerator. Thus, Carrier’s verbalization is meant to express Eq. 1, where the denominator, VL/T, is not ‘all theories’, as Carrier claims. VL/T is the ratio of all your theories to all theories.

Also, it should be noted that the numerator of Bayes’ theorem, Eq. 1, which is (Q1/HT) * (HT/T), is Q1/T. Thus, the numerator of Bayes’ theorem is the probability of your true theories over all theories, and is not as Carrier claims simply ‘your theory’.

Conclusion

Carrier’s explanation of Bayes’ theorem on page 50, of Proving History as ‘your theory in ratio to all theories’ is completely erroneous.

Bayes’ theorem is a simple algebraic relationship among fractions of a set or population of elements. Based on common expositions of it, one would think that it was complicated in itself and that it resolved a mystery through its implications.

The population of elements to which Bayes’ theorem applies, may be viewed as a surface over which the population density varies. A Bayesian surface is partitioned by two independent criteria. One criterion may be viewed as dividing the surface into two horizontal rows, while the other criterion divides it into two vertical columns. The result is the formation of four quadrants, which differ in population due to the non-uniformity of the population density. One important thing is that the four quadrants are mutually related. Each may be expressed by the same algebraic formulation in its relationships to the other three.

The two rows formed by the horizontal partitioning may be distinguished as the horizontal top row, HT, and the horizontal bottom row, HB. The two columns formed by the vertical partitioning may be distinguished as the vertical left column, VL, and the vertical right column, VR. The two rows, HT + HB, add up to the total, T, as do the two columns, VL and VR. The quadrants are designated as Q1 through Q4. Each row or column is the sum of two quadrants, e.g. HT = Q1 + Q2 and VL = Q1 + Q3.

Tabulation of a Bayesian Population

Tabulation of a Bayesian Population

 

In the Tabulated Bayesian Population, the column VR has the role of non-VL. Thus, rather than being one column, VR, may be any number of columns, whose sum is the complement of VL. Analogously, the row, HB, has the role of non-HT. Consequently, Bayes’ theorem is applicable to any number of rows and any number of columns, where the additional rows and columns may be treated in their sum, respectively as non-HT and non-VL, i.e. as HB and VR, respectively.

Bayes’ theorem, in its algebraic expression, which focuses on Q1, is:

Q1/VL = ((Q1/HT) / (VL/T)) * (HT/T) Eq. 1

The two terms, HT, cancel out as do the two terms, T. This leaves the identity, Q1/VL ≡ Q1/VL, which proves the validity of Bayes’ theorem.

In the application of Bayes’ theorem the numerical values of the numerators and the denominators of the fractions are not given. What is given are the numerical values of the three fractions on the right hand side of Eq. 1, which permits the calculation of the numerical value of the fraction, Q1/VL, as a fraction.

Reciprocity of Various Expressions of Bayes’ Theorem

Eq. 1 expresses Bayes’ algebraic formulation by focusing on the top, left quadrant, Q1. However, it must be remembered that the same algebraic formulation of relationships with the other three quadrants, could be applied to any quadrant. This can be seen in that each of the other three quadrants can be successively designated as quadrant, Q1, by rotating the population surface in increments of 90 degrees.

In the application of Bayes’ theorem, Eq. 1 is viewed as representing Q1/VL as directly proportional to HT/T, where the constant of proportionality is (Q1/HT) / (VL/T). Because each of the fractions of Eq. 1 is ratio of a subset to a set, each of the fractions is a probability. Expressing the direct proportionality of Eq. 1 using the word, probability, rather than the word, fraction, yields: The probability of quadrant Q1 with respect to the column VL is directly proportional to the probability of the row HT with respect to the total population, T.

Typically, the numerical value of the probability, HT/T, is given along with the numerical value of the constant of proportionality. The numerical value of the probability, Q1/VL, is calculated. Common jargon refers to the given probability, HT/T, as the prior probability and the calculated probability, Q1/VL, as the posterior or the final probability.

If the numerical value of Q1/VL were given along with the constant of proportionality, then the probability HT/T could be calculated. We would be viewing Eq. 1 in the form,

HT/T = ((VL/T) / (Q1/HT)) * (Q1/VL) Eq. 2

Common jargon would then label Q1/VL as the prior probability and HT/T as the posterior or final probability, i.e. vice versa to the common jargon applied to Eq. 1.

Eq. 1 and Eq.2 are fully equivalent. With respect to Eq. 1, common jargon in determining the probability of a hypothesis, would claim that the prior probability of row HT with respect to the total was revised to the posterior or final probability of Q1 with respect to column VL.

With respect to Eq. 2, common jargon would claim that the prior probability of Q1 with respect to column VL was revised to the posterior or final probability of row HT with respect to the total.

What this apparently contradictory jargon means is (1) that given the constant of proportionality and HT/T, then Q1/VL can be calculated, while (2) given the constant of proportionality, and Q1/VL, then HT/T can be calculated. Both probabilities remain completely distinct. Neither replaces the other or is revised to equal the other.

A numerical value, which is given, is prior in our knowledge to a numerical value, which is calculated. But in no sense does one replace the other or is one revised to be the other. To use the words, replace and/or revise is to use misleading jargon.

Identifying one probability within Bayes’ equation as prior and one as posterior, where the posterior replaces or supersedes the prior, is a misleading mystification of simple algebra, where the two probabilities are distinct and do not change in their algebraic relationship to one another.

An Illustration of Bayes’ Theorem

Let us use an easily comprehended set of elements to illustrate Bayes’ theorem. That set is a bunch of playing cards. Not a standard deck, a bunch. All of the cards in the set, i.e. the bunch, are not of the customary thirteen ranks, but of only two ranks, Kings and Queens. All of the cards in the set are not of four, but of only two suits, Diamonds and Spades.

Let us view Bayes’ theorem as telling us that Q1/VL, is directly proportional to HT/T. The constant of proportionality would then be (Q1/HT) / (VL/T).

Q1/VL = ((Q1/HT) / (VL/T)) * (HT/T) Eq. 1

In this example the elements of the set are cards. T is the total number of cards. HT is the total number of Kings. VL is the total number of Diamonds. Q1 is the number of cards that are both Kings and Diamonds.

The person, who formed the set of cards, tells us that 70% of the Kings are Diamonds; that 50% of the cards are Diamonds and that 40% of the cards are Kings. Referring to Eq. 1: (1) If 70% of the Kings are Diamonds, then Q1/HT = 0.7. (2) If 50% of the cards are Diamonds, then VL/T = 0.5. (3) If 40% of the cards are Kings, then HT/T = 0.4. The constant of proportionality, (Q1/HT) / (VL/T), equals 0.7/0.5 = 1.4.

The fraction of Diamonds that are Kings, Q1/VL is directly proportional to HT/T, the fraction of all cards that are Kings.

The fraction of Diamonds that are Kings = (.7/.5) * the fraction of all cards that are Kings.
Q1/VL = (.7/.5) * (H/T)

The fraction of Diamonds that are Kings = (1.4) * 0.4 = 0.56 = 56%
Q1/VL = 56%

Verbalization of Bayes’ Theorem

In the illustration, common jargon would state that the prior probability of a card’s being a King, HT/T or 40%, is revised to the posterior probability, namely the probability of a King’s being a Diamond, Q1/VL or 56%. However, if HT/T were the given and Q1/VL were calculated, then, based on the same equation, common jargon would have to state that the prior probability of a King’s being a Diamond or 56%, was revised to the posterior probability, namely the probability of a card’s being a King or 40%.

It is easy to fall into the rut of such jargon, if HT/T is thought of as the probability of a generic card’s being a Diamond, and Q1/VL as the probability that a card specified as being a King is a Diamond. It is as if the generic was being replaced by the specific. Such a nuanced inference is not warranted by the mathematics, because the reciprocal relation is equally valid. The reciprocal relationship is given the numerical value of the specific, the numerical value of the generic can be calculated.

Caution

The use of replace and revise in common jargon confuses a displacement based on inequality with a replacement based on equality. Such a displacement of inequality does not elucidate Bayes’ theorem, which is the equality expressed by, Eq. 1.

The criticism of common jargon in this essay does not preclude the successive iteration of an algorithm based on Bayes’ theorem, which could involve a displacement. In such a case, the succeeding iteration uses the specific probability of the prior iteration as its generic probability. The iteration of the algorithm calculates a new specific probability based on some added or omitted characteristic. It thereby calculates a partitioning, i.e. a probability, not of the prior population, but, of a newly limited sub-population.

It should be noted that it is inappropriate and misleading to identify as Bayes’ theorem an algorithm, which iteratively employs Bayes’ theorem, just as it would be inappropriate and misleading to identify as the Pythagorean theorem an algorithm, which iteratively employs the Pythagorean theorem.

Common jargon confuses Bayes’ theorem with its algorithmic iteration.

Bayes’ theorem is a fraction, expressed algebraically in terms of other fractions.

The theorem applies to a set of data that may be tabulated in a two by two format. The data set consists of two rows by two columns. Tabulated data, with more than two rows and/or more than two columns, may be reduced to the two by two format. All rows, but the top row may be combined to form a single, bottom row as the complement of the top row. Similarly, all columns, but the left column may be combined to form a single, right column, as the complement of the left column.

Let the rows be labeled X and non-X. Let the columns be labeled A and non-A. The table presents four quadrants of data. Let the upper left quadrant be identified as (X,A). Let the total of row X be labeled TX, the total of column A be labeled TA and the grand total of the data be labeled T.

The Algebraic Form: Fractions

Bayes’ theorem or Bayes’ equation is,

(X,A) / TA = ((TX / T) * ((X,A) / TX)) / (TA / T) Eq. 1

The validity of Bayes’ equation can easily be demonstrated in that both T and TX cancel out on the right hand side of the equation, leaving the identity, (X,A) / TA ≡ (X,A) / TA

In accord with the fact that (X,A) + (non-X,A) = TA, the denominator, TA / T, is often expressed as,

((TX / T) * ((X,A) / TX) + (((Tnon-X) / T) * ((non-X,A) / (Tnon-X)) Eq. 2

The Verbal Form: Fractions

Verbalizing Eq. 1, we have,
Cell (X,A) as a fraction of Column A equals
(Row X as a fraction of the grand total, times Cell (X,A) as a fraction of row X) divided by column A as a fraction of the grand total.

Eq. 2, the denominator, i.e. column A as a fraction of the grand total, may be expressed as,
(Row X as a fraction of the grand total, times the Cell (X,A) as a fraction of row X) plus
(Row non-X as a fraction of the grand total, times Cell (non-X,A) as a fraction of row non-X)

Replacing the Row, Column and Element Labels

On page 50 of Proving History, Richard Carrier replaces the row, column and element labels. In place of the row labels, X and non-X, he uses ‘true’ and ‘isn’t true’. In place of the column label, A, he uses ‘our’. Instead of referring to the data elements of the table as elements, Carrier refers to them as explanations. The only data in a Bayesian analysis are the elements of the table. Consequently, the only evidence considered in a Bayesian analysis is the data. In Carrier’s terminology, the only data, thus the only ‘evidence’, are the ‘explanations’.

Carrier’s Terminology for the Fractions of Bayes’ Theorem

Probability is the fraction or ratio of a subset with respect to a set. Thus, probability is a synonym for those fractions, which are the ratio of a subset to a set. Each fraction in Bayes’ theorem is a probability, the ratio of a subset to a set.

Accordingly, Carrier uses the word, probability, for the lone fraction on the left hand side of Eq. 1. However, on the right hand side of the equation, he does not use the word, probability. He uses synonyms for probability. He refers to the ratio of probability as ‘how typical’ the subset is with respect to the set. Instead of probability, he also refers to probability as ‘how expected’ the subset is with respect to the set.

Probability and improbability are complements of one, just as the paired subsets in Bayes’ theorem are complements of the set. Thus, the probability of a subset with respect to a set may be referred to as the improbability of the complementary subset. Carrier does not use the expression, improbability. Instead of referring to the improbability of the complementary subset, he refers to ‘how atypical’ is the complementary subset.

Carrier’s Verbalization of Bayes’ Theorem

verbal-3
Left hand side Eq. 1

Adopting Carrier’s terminology, ‘Cell (X,A) as a fraction of Column A’ would be, ‘the probability of our true explanations with respect to our total explanations’. Carrier renders it, ‘the probability our explanation is true’. It is as if probability primarily referred to just one isolated explanation rather than a subset of explanations as a fraction of a set of explanations to which the subset belongs.

The Right Hand Side, Eq. 1, the Numerator

Adopting Carrier’s terminology, the first term of the numerator, ‘Row X as a fraction of the grand total’, would be ‘how typical all true explanations are with respect to total explanations’, i.e. the fraction is TX/T. Carrier renders it ‘how typical our explanation is’. Thus, Carrier would have it to be TA/T, rather than TX/T.

In Carrier’s terminology the second term of the numerator, ‘Cell (X,A) as a fraction of row X’ would be ‘how expected are our true explanations among the set of all true explanations’. Carrier renders it ‘how expected the evidence is, if our explanation is true’. The evidence, i.e. the data, that our explanations are true, is Cell (X,A). Carrier’s rendition is thus, ‘how expected are our true explanations among the set of our true explanations’. That would be the ratio, Cell (X,A) / Cell (X,A), and not Cell (X,A) / TX.

The Right Hand Side, Eq. 1, the Denominator as Eq. 2

The first two terms of Eq. 2 are the same as the numerator of Eq. 1. Thus, there are only two more terms to be considered, namely the two terms of Eq. 2, after the ‘plus’. The first is ‘Row non-X as a fraction of the grand total’. Adopting Carrier’s terminology, this would be, ‘‘how atypical true explanations are with respect to total explanations’, i.e. the fraction is (Tnon-X)/T, which is the improbability (i.e. the atypicality) of TX/T. Carrier renders it ‘how atypical our explanation is’. Carrier would have it to be (Tnon-A)/T, which is the improbability of TA/T, rather than the improbability of TX/T.

The other term is ‘Cell (non-X,A) as a fraction of row non-X’. Adopting Carrier’s terminology, this would be, ‘how expected are our non-true explanations among the set of all non-true explanations’. Carrier renders it, ‘how expected the evidence is, if our explanation isn’t true’. The evidence, i.e. the data, that our explanations aren’t true, is Cell (non-X,A). Carrier’s rendition is thus, ‘how expected are our non-true explanations among the set of our non-true explanations’. That would be the ratio, Cell (non-X,A) / Cell (non-X,A), and not Cell (non-X,A) / Tnon-X.

Valid, but Obscurant

Each fraction in Bayes’ theorem is a fraction, which may be expressed as a probability, but also as an improbability or an atypicality. For a Bayesian tabulation of explanations, where the top row is true and the left column is our, Bayes’ theorem is the probability of true explanations among our explanations. It is also the atypicality or the improbability of non-true explanations among our explanations. However, the words, atypicality and improbability can obscure rather than elucidate the meaning of Bayes’ theorem.

Conclusion

Bayes’ theorem can be verbalized using much of Carrier’s terminology including, probability, our, explanations, true, typical, expected and atypical. However, Carrier’s actual use of his terminology does not merely obscure, but totally obliterates the algebraic and intentional meaning of Bayes’ theorem.

On page 58 of Proving History, Richard C. Carrier states,

“So even if there was only a 1% chance that such a claim would turn out to be true, that is a prior probability of merely 0.01, the evidence in this case (e1983) would entail a final probability of at least 99.9% that this particular claim is nevertheless true. . . . Thus, even extremely low prior probabilities can be overcome with adequate evidence.”

The tabulated population data implied by Carrier’s numerical calculation, which uses Bayes’ theorem, is of the form:

carrier-1

 

Bayes’ theorem permits the calculation of Cell(X,A) / Col A by the formula,

((Row X / Total Sum) * (Cell(X,A) / Row X)) / (Col A / Total Sum)

The numerical values, listed within the equations on page 58, imply,

carrier-2

 

From these, the remaining values of the table can be determined as,

carrier-3

 

Carrier’s application of Bayes’ theorem in calculating the final probability and in identifying the prior probability are straight forward and without error.

How Error Slips In

In Bayesian jargon the ‘prior’ probability of X is the Sum of Row X divided by the Total Sum. It is 0.01 or 1%. The final probability or more commonly the consequent or posterior probability is the probability of X based solely on Column A, completely ignoring Column B. The probability of X, considering only Column A, is 0.01/0.0100099 or 99.9%. One may call this the final probability, the consequent probability, the posterior probability or anything else one pleases, but to pretend it is something other than based on a scope, exclusionary of Column B, is foolishness. It is in no sense ‘the overcoming of a low prior probability with sufficient evidence’ unless one is willing to claim that the proverbial ostrich by putting its head to the sand has a better view of its surroundings by restricting the scope of its view to the sand.

The way this foolishness comes about is this. The prior probability is defined as the probability that ‘this’ element is a member of the subpopulation X, simply because it is a member of the overall population. The consequent or posterior probability (or as Carrier says, the final probability) is the probability consequent or posterior to identifying the element, no longer as merely a generic member of the overall population, but now identifying it as an element of subpopulation A. The probability calculated by Bayes’ theorem is that of sub-subpopulation, Cell(X,A), as a fraction of subpopulation A, thereby having nothing directly to do with Column B or the total population. In Bayesian jargon we say the prior probability of X of 1% is revised to the probability of X of 99.9%, posterior to the observation that ‘this element’ is a member of the subpopulation A and not merely a generic member of the overall population.

Clarification of the Terminology

The terminology, ‘prior probability’ and ‘posterior probability’, refers to before and after the restriction of the scope of consideration from a population to a subpopulation. The population is one which is divided into subsets by two independent criteria. This classifies the population into subsets which may be displayed in a rectangular tabulation. One criterion identifies the rows. The second criterion identifies the columns of the table. Each member of the population belongs to one and only one of the cells of the tabulation, where a cell is a subset identified by a row and a column.

An Example

A good example of such a population would be the students of a high school. Let the first criterion, identify two rows, those who ate oatmeal for breakfast this morning and those who did not. The second criterion, which identifies the columns will be the four classes, freshmen, sophomores, juniors and seniors. Notice that the sum of the subsets of each criterion is the total population. In other words, the subsets of each criterion are complements forming the population.

In the high school example, the prior probability is the fraction of the students of the entire high school who ate oatmeal for breakfast. The prior is the scope of consideration before we restrict that scope to one of the subsets of the second criterion. Let that subset of the second criterion be the sophomore class. We restrict our scope from the entire high school down to the sophomore class. The posterior probability is the fraction of sophomores who ate oatmeal for breakfast. Notice the posterior probability eliminates from consideration the freshmen, junior and senior classes. They are irrelevant to the posterior fraction.

In Bayesian Jargon, prior refers to the full scope of the population prior to restricting the scope. Posterior refers to after restricting the scope. The posterior renders anything outside of the restricted scope irrelevant.

In Carrier’s example, the full scope covers all years, prior to restricting that scope to the year, 1983, thereby ignoring all other years. This is parallel to the high school example, where the full scope covers all class years, prior to restricting that scope to the class year, sophomores, thereby ignoring all other class years.

By some quirk let it be that 75% of the sophomore class ate oatmeal for breakfast, but none of the students of the other three classes did so. Let the four class sizes be equal. We would then say, ala Carrier, “The low prior probability (18.75%) of the truth that a student ate oatmeal for breakfast, was overcome with adequate evidence, so that the final probability of the truth that a sophomore student ate oatmeal for breakfast was 75%.” Note that this ‘adequate evidence’ consists in ignoring any evidence concerning the freshmen, juniors and seniors, which evidence was considered in determining the prior.

This conclusion of ‘adequate evidence’ contrasts a fraction based on a full scope of the population, ‘the prior’, to a fraction based on a restricted scope of the population, ‘the final’. The final does not consider further evidence. The final simply ignores everything about the population outside the restricted scope.

Prejudice as a Better Jargon

A more lucid conclusion, based on the restriction of scope, may be made in terms of prejudice. The following conclusion adopts the terminology of prejudice. It is based on the same data used in the discussion above.

Knowledge of the fraction of students in this high school, who ate oatmeal, serves as the basis for our prejudging ‘this’ high school student. We know the prior probability of the truth that ‘this’ student is ‘one of them’, i.e. those who ate oatmeal for breakfast, is 18.75%. Upon further review, in noting that ‘this’ student is a sophomore, we can hone our prejudice by restricting it in scope to the sophomore class. We can now restrict the scope upon which our original prejudice was based, by ignoring all of the other subsets of the population, but the sophomore class. We now know the final probability of the truth of our prejudice that ‘this’ student is ‘one of them’ is 75%, based on his belonging to the sophomore class.

This is what Carrier is doing. His prior is the prejudice, i.e. the probability based on all years of the population. His final is the prejudice, which ignores evidence from all years except 1983.

We can now see more clearly what Carrier means by adequate evidence. He means considering only knowledge labeled 1983 and ignoring knowledge from other years. Similarly, adequate evidence to increasing our prejudice that this student ate oatmeal, would mean considering only the knowledge that he is in the sophomore year and ignoring knowledge from other class years. It was the consideration of all years upon which our prior prejudice was based. Similarly it was all years, including 1983, upon which Carrier’s prior prejudice is based.

To form our prior prejudice, we consider the total tabulated count. We restrict the scope of our consideration of the tabulated count to a subset in order to form our final or posterior prejudice.

We refine our prejudice by restricting the scope of its application from the whole population to a named subpopulation. Is this what is conveyed by saying that even a low chance of a statement’s being true can be increased by evidence, or, that the low probability of its truth was overcome by adequate evidence? To me, that is not what is conveyed. From the appellations of truth and evidence, I would infer that more data were being introduced into the tabulation, or at least more of the tabulated data was being considered, rather than that much of the tabulated data was being ignored.

Conclusion

Carrier’s discussion of Bayes’ theorem gives the impression that the final probability of the 1983 data depends intrinsically upon the tabulated data from all the other years. In fact, the data from all the other years are completely extrinsic, i.e. irrelevant, to the final probability of the1983 data. The ‘final’ probability is the ratio of one subset of the 1983 data divided by the set of 1983 data, ignoring all other data.

Probability is the ratio of a subpopulation of data to a population of data. In Carrier’s discussion, the population of his ‘prior’ is the entire data set. The population of his ‘final’ is solely the 1983 data, ignoring all else. He is not evaluating the 1983 data, or any sub-portion of it, in light of non-1983 evidence.

One can easily be misled by the jargon of the ‘prior probability’ of ‘the truth’, the ‘final probability’ of ‘the truth’ and ‘adequate evidence’.

In an exchange of comments with Phil Rimmer on the website, StrangeNotions.com, I attempted to explain the distinction between probability and efficiency. The topic deserves this fuller exposition.

I have argued that Richard Dawkins does not understand Darwinian evolution because he claims that the role of replacing a single stage of random mutation and natural selection with a series of sub-stages increases the probability of evolutionary success. In The God Delusion (p 121) he titles this ‘solving the problem of improbability’, i.e. the problem of low probability. My claim is that replacing the single stage with the series of sub-stages increases the efficiency of mutation while having no effect upon the probability of success.

Using Dawkins’ example of three mutation sites of six mutations each, I have illustrated the efficiency at a level of probability of 85.15%, where the series requires only 54 random mutations, while the single stage requires 478.

It may be noted that at a given number of mutations, the probability of success is greater for the series than for the single stage. A numerical example would be at 54 total mutations. For the series the probability of success in 85.15%, whereas at 54 total mutations, the probability of the single stage is only 22.17%. The series has the greater probability of success at a total of 54 mutations.

This would appear to be a mortal blow to my argument. It would seem that Richard Dawkins correctly identifies the role of the series of sub-stages as increasing the probability of success, while not denying its role of increasing the efficiency of mutation. It would seem that Bob Drury errs, not in identifying the role of the series as increasing the efficiency of mutation, but in denying its role in increasing the probability of evolutionary success.

Hereby, I address this apparently valid criticism of my position.

The Two Equations of Probability as a Function of Random Mutations

The probability of evolutionary success for the single stage, PSS, as a function of the total number of random mutations, MSS, is:
PSS = 1 – (215/216)^MSS

The probability of evolutionary success for the series of three sub-stages, PSR, as a function of the total number of random mutations per sub-stage, MSUB, is:
PSR = (1 – (5/6)^MSUB)^3.

For the series, the total number of mutations is 3 x MSUB.

Comparison of Probability at the Initial Condition

At zero mutations, both probabilities are zero. Initially, the probability of both processes, namely the single stage and the series of sub-stages is the same.

For the single stage at one random mutation, which is the minimum for a positive value of probability, the probability of success is 1/216 = 0.46%.

For the series of three stages, at one random mutation per stage, which is the minimum for a positive value of probability, the probability of success is (1/6)^3 = 1/216 = 0.46%. At this level of probability, the single stage has the greater mutational efficiency. It takes the series three random mutations to achieve the same probability of success as the single stage achieves in one random mutation.

Comparison of the Limit of Probability

For both the single stage and for the series of three stages, the limit of probability with the increasing number of mutations is the asymptotic value of 100% probability.

Comparison of the Method of Increasing Probability

For both the single stage and for the series of three stages, the method of increasing the probability is the same, namely increasing the number of random mutations. For both, probability is a function of the number of random mutations.

Comparison of the Intermediate Values between the Initial Condition and the Limit

For both the single stage and for the series of three stages, probability varies, but continually increases from the initial condition toward the limit.

Excepting for values of total mutations less than six, i.e. two per sub-stage, at every level of probability, the series requires fewer mutations than does the single stage. Correspondingly, at any number of mutations greater than six, the series has a higher value of probability than the single stage. Thus, if the comparison is at a constant value of probability, the series requires fewer mutations. If the comparison is at a constant value of mutations, the series has a higher value of probability.

Apparent Conclusion

Richard Dawkins is right in that the series increases the probability of success, without denying that it also increases the efficiency of mutation. Bob Drury is wrong in denying the increase in probability.

The Apparent Conclusion Is False, in Consideration of the Concept of Efficiency

Both the single stage and the series of sub-stages are able to achieve any value of probability over the range from zero toward the asymptotic limit.

Efficiency is the ratio of output to input. One system or process is more efficient than another, if its efficiency is numerically greater. There is no difficulty in comparing two processes where the efficiency of both systems is constant. In such a case, output starts at zero at input equals zero. Output is a linear function of input, having a constant positive slope. The process with the higher positive slope is more efficient than the other. However, in cases where the efficiencies vary, the comparison of efficiencies must be determined at the same value for the numerator of the ratio of efficiency, i.e. the output, or at the same value for the denominator, the input.

In this comparison of the single stage vs. the series of sub-stages, the output is probability and the input is the number of random mutations. Remember both processes increase probability by the same means, namely by increasing the number of random mutations. That is, output increases with increasing input. Also, remember that both processes do not differ in that they both approach the same limit of probability asymptotically.

Dawkins’ comparison of replacing the single stage with a series of sub-stages is the comparison of two processes.

In the numerical examples above we can calculate and compare the efficiencies of the two processes at a constant output, e. g. of 85.15% probability and at a constant input, e.g. of 54 mutations.

At the constant output of 85.15%, the efficiency for the single stage 85.15/478 = 0.18. For the series of sub-stages the efficiency is 85.15/54 = 1.57. The mutational efficiency is greater for the series than for the single stage at the constant output of 85.15% for both processes.

At the constant input of 54 mutations, the probability for the single stage is P = 1 – (215/216)^54 = 22.17%. Therefore, the efficiency is 22.17/54 = 0.41. At this constant input, efficiency for the series is 85.15/54 = 1.57. The mutational efficiency is greater for the series than for the single stage at the constant input of 54 mutations for both processes.

At the 85.15% probability level, the series is greater in mutational efficiency than the single stage by a factor of 478/54 = 8.8

Further evidence that Dawkins is illustrating an increase in efficiency and not an increase in probability is that he compares the temporal efficiencies of two computer programs. For both programs, the input of the number of random mutations is equated with the time of operation from initiation to termination. Termination is upon the random inclusion of one specific mutation. The sub-stages based program typically wins the race against the single stage based program. This demonstrates the greater mutational efficiency of the sub-series, not the greater probability of success.

In the numerical example of three sites of six mutations each, the specific mutation would be one of 216. Let us modify the computer program races slightly. This will give us a greater insight into the meaning of probability and the meaning of efficiency.

Let each program be terminated after 54 and 478 mutations for the series and the single stage, respectively. If the comparison is performed 10,000 times, one would anticipate that on the average, both programs would contain at least one copy of the specific mutation in 8,515 of the trials and no copies in 1,485 of the trials. The series program would be more efficient because it took only 54 mutations or units of time, compared to 478 mutations or units of time for the single stage program to achieve a probability of 85.15%.

For the numerical illustration of three mutation sites of three mutations each, both the single stage and the series of sub-stages have the same initial probability of success greater than zero, namely, 0.46%. Both can achieve any value of probability short of the asymptotic value of 100%. They do not differ in the probability of success attainable.

It doesn’t matter whether we compare the relative efficiencies of the series vs. the single stage at a constant output or a constant input, the series has the greater mutational efficiency for total mutations greater than six.

For the numerical illustration of three mutation sites of three mutations each, at a probability of 85.15%, the series is greater in mutational efficiency by a factor of 8.8. At 90% probability, the factor of efficiency is 8.9 in favor of the series. At a probability of 99.9999%, the factor of efficiency is 12.1 in favor of the series.

Analogy to a Different Set of Two Equations

Let the distance traveled by two autos be plotted as a function of fuel consumption. Distance increases with the amount of fuel consumed. Let the distance traveled at every value of fuel consumption be greater for auto two than auto one. Similarly, at every value of distance traveled, auto two would have used less fuel than auto one. My understanding would be inadequate and lacking comprehension, if I said that replacing auto one with auto two increases the distance traveled. It would be equally inane to say that auto two solves the problem of too low a distance. My understanding would be complete and lucid, if I said that replacing auto one with auto two increases fuel efficiency.

There is a distinction between distance and fuel efficiency. Understanding the comparison between the two autos is recognizing it as a comparison of fuel efficiency. Believing it to be a comparison of distances is a failure to understand the comparison.

For both the single stage and the series of sub-stages of evolution, probability increases with the number of random mutations. Except for the minimum number for the sub-series, at every greater number of random mutations, the probability is greater for the series of sub-stages than for the single stage of evolution. Similarly, except for the minimum positive value, at every value of probability, the series requires fewer random mutations. My understanding would be inadequate and lacking comprehension, if I said that replacing the single stage with the series increases the probability attained. It would be equally inane to say that the series solves the problem of too low a probability. My understanding would be complete and lucid, if I said that replacing the single stage with the series increases mutational efficiency.

The role of a series of sub-stages in replacing a single stage of random mutation and natural selection is to increase the efficiency of random mutation while having no effect on the probability of evolutionary success. This is evident by comparing the equations of probability for the series and for the single stage as functions of the number of random mutations. This is the very comparison proposed by Richard Dawkins for the sake of understanding evolution. He misunderstood it as “a solution to the problem of improbability” (The God Delusion, page 121), i.e. as solving the problem of too low a probability.

There is a distinction between probability and mutational efficiency. Understanding the comparison between the series of sub-stages and the single stage is recognizing it as a comparison of mutational efficiency. Believing it to be a comparison of probabilities is a failure to understand the comparison.