Archive

Monthly Archives: October 2016

Bayes’ theorem is a fraction, expressed algebraically in terms of other fractions.

The theorem applies to a set of data that may be tabulated in a two by two format. The data set consists of two rows by two columns. Tabulated data, with more than two rows and/or more than two columns, may be reduced to the two by two format. All rows, but the top row may be combined to form a single, bottom row as the complement of the top row. Similarly, all columns, but the left column may be combined to form a single, right column, as the complement of the left column.

Let the rows be labeled X and non-X. Let the columns be labeled A and non-A. The table presents four quadrants of data. Let the upper left quadrant be identified as (X,A). Let the total of row X be labeled TX, the total of column A be labeled TA and the grand total of the data be labeled T.

The Algebraic Form: Fractions

Bayes’ theorem or Bayes’ equation is,

(X,A) / TA = ((TX / T) * ((X,A) / TX)) / (TA / T) Eq. 1

The validity of Bayes’ equation can easily be demonstrated in that both T and TX cancel out on the right hand side of the equation, leaving the identity, (X,A) / TA ≡ (X,A) / TA

In accord with the fact that (X,A) + (non-X,A) = TA, the denominator, TA / T, is often expressed as,

((TX / T) * ((X,A) / TX) + (((Tnon-X) / T) * ((non-X,A) / (Tnon-X)) Eq. 2

The Verbal Form: Fractions

Verbalizing Eq. 1, we have,
Cell (X,A) as a fraction of Column A equals
(Row X as a fraction of the grand total, times Cell (X,A) as a fraction of row X) divided by column A as a fraction of the grand total.

Eq. 2, the denominator, i.e. column A as a fraction of the grand total, may be expressed as,
(Row X as a fraction of the grand total, times the Cell (X,A) as a fraction of row X) plus
(Row non-X as a fraction of the grand total, times Cell (non-X,A) as a fraction of row non-X)

Replacing the Row, Column and Element Labels

On page 50 of Proving History, Richard Carrier replaces the row, column and element labels. In place of the row labels, X and non-X, he uses ‘true’ and ‘isn’t true’. In place of the column label, A, he uses ‘our’. Instead of referring to the data elements of the table as elements, Carrier refers to them as explanations. The only data in a Bayesian analysis are the elements of the table. Consequently, the only evidence considered in a Bayesian analysis is the data. In Carrier’s terminology, the only data, thus the only ‘evidence’, are the ‘explanations’.

Carrier’s Terminology for the Fractions of Bayes’ Theorem

Probability is the fraction or ratio of a subset with respect to a set. Thus, probability is a synonym for those fractions, which are the ratio of a subset to a set. Each fraction in Bayes’ theorem is a probability, the ratio of a subset to a set.

Accordingly, Carrier uses the word, probability, for the lone fraction on the left hand side of Eq. 1. However, on the right hand side of the equation, he does not use the word, probability. He uses synonyms for probability. He refers to the ratio of probability as ‘how typical’ the subset is with respect to the set. Instead of probability, he also refers to probability as ‘how expected’ the subset is with respect to the set.

Probability and improbability are complements of one, just as the paired subsets in Bayes’ theorem are complements of the set. Thus, the probability of a subset with respect to a set may be referred to as the improbability of the complementary subset. Carrier does not use the expression, improbability. Instead of referring to the improbability of the complementary subset, he refers to ‘how atypical’ is the complementary subset.

Carrier’s Verbalization of Bayes’ Theorem

verbal-3
Left hand side Eq. 1

Adopting Carrier’s terminology, ‘Cell (X,A) as a fraction of Column A’ would be, ‘the probability of our true explanations with respect to our total explanations’. Carrier renders it, ‘the probability our explanation is true’. It is as if probability primarily referred to just one isolated explanation rather than a subset of explanations as a fraction of a set of explanations to which the subset belongs.

The Right Hand Side, Eq. 1, the Numerator

Adopting Carrier’s terminology, the first term of the numerator, ‘Row X as a fraction of the grand total’, would be ‘how typical all true explanations are with respect to total explanations’, i.e. the fraction is TX/T. Carrier renders it ‘how typical our explanation is’. Thus, Carrier would have it to be TA/T, rather than TX/T.

In Carrier’s terminology the second term of the numerator, ‘Cell (X,A) as a fraction of row X’ would be ‘how expected are our true explanations among the set of all true explanations’. Carrier renders it ‘how expected the evidence is, if our explanation is true’. The evidence, i.e. the data, that our explanations are true, is Cell (X,A). Carrier’s rendition is thus, ‘how expected are our true explanations among the set of our true explanations’. That would be the ratio, Cell (X,A) / Cell (X,A), and not Cell (X,A) / TX.

The Right Hand Side, Eq. 1, the Denominator as Eq. 2

The first two terms of Eq. 2 are the same as the numerator of Eq. 1. Thus, there are only two more terms to be considered, namely the two terms of Eq. 2, after the ‘plus’. The first is ‘Row non-X as a fraction of the grand total’. Adopting Carrier’s terminology, this would be, ‘‘how atypical true explanations are with respect to total explanations’, i.e. the fraction is (Tnon-X)/T, which is the improbability (i.e. the atypicality) of TX/T. Carrier renders it ‘how atypical our explanation is’. Carrier would have it to be (Tnon-A)/T, which is the improbability of TA/T, rather than the improbability of TX/T.

The other term is ‘Cell (non-X,A) as a fraction of row non-X’. Adopting Carrier’s terminology, this would be, ‘how expected are our non-true explanations among the set of all non-true explanations’. Carrier renders it, ‘how expected the evidence is, if our explanation isn’t true’. The evidence, i.e. the data, that our explanations aren’t true, is Cell (non-X,A). Carrier’s rendition is thus, ‘how expected are our non-true explanations among the set of our non-true explanations’. That would be the ratio, Cell (non-X,A) / Cell (non-X,A), and not Cell (non-X,A) / Tnon-X.

Valid, but Obscurant

Each fraction in Bayes’ theorem is a fraction, which may be expressed as a probability, but also as an improbability or an atypicality. For a Bayesian tabulation of explanations, where the top row is true and the left column is our, Bayes’ theorem is the probability of true explanations among our explanations. It is also the atypicality or the improbability of non-true explanations among our explanations. However, the words, atypicality and improbability can obscure rather than elucidate the meaning of Bayes’ theorem.

Conclusion

Bayes’ theorem can be verbalized using much of Carrier’s terminology including, probability, our, explanations, true, typical, expected and atypical. However, Carrier’s actual use of his terminology does not merely obscure, but totally obliterates the algebraic and intentional meaning of Bayes’ theorem.

Advertisements

On page 58 of Proving History, Richard C. Carrier states,

“So even if there was only a 1% chance that such a claim would turn out to be true, that is a prior probability of merely 0.01, the evidence in this case (e1983) would entail a final probability of at least 99.9% that this particular claim is nevertheless true. . . . Thus, even extremely low prior probabilities can be overcome with adequate evidence.”

The tabulated population data implied by Carrier’s numerical calculation, which uses Bayes’ theorem, is of the form:

carrier-1

 

Bayes’ theorem permits the calculation of Cell(X,A) / Col A by the formula,

((Row X / Total Sum) * (Cell(X,A) / Row X)) / (Col A / Total Sum)

The numerical values, listed within the equations on page 58, imply,

carrier-2

 

From these, the remaining values of the table can be determined as,

carrier-3

 

Carrier’s application of Bayes’ theorem in calculating the final probability and in identifying the prior probability are straight forward and without error.

How Error Slips In

In Bayesian jargon the ‘prior’ probability of X is the Sum of Row X divided by the Total Sum. It is 0.01 or 1%. The final probability or more commonly the consequent or posterior probability is the probability of X based solely on Column A, completely ignoring Column B. The probability of X, considering only Column A, is 0.01/0.0100099 or 99.9%. One may call this the final probability, the consequent probability, the posterior probability or anything else one pleases, but to pretend it is something other than based on a scope, exclusionary of Column B, is foolishness. It is in no sense ‘the overcoming of a low prior probability with sufficient evidence’ unless one is willing to claim that the proverbial ostrich by putting its head to the sand has a better view of its surroundings by restricting the scope of its view to the sand.

The way this foolishness comes about is this. The prior probability is defined as the probability that ‘this’ element is a member of the subpopulation X, simply because it is a member of the overall population. The consequent or posterior probability (or as Carrier says, the final probability) is the probability consequent or posterior to identifying the element, no longer as merely a generic member of the overall population, but now identifying it as an element of subpopulation A. The probability calculated by Bayes’ theorem is that of sub-subpopulation, Cell(X,A), as a fraction of subpopulation A, thereby having nothing directly to do with Column B or the total population. In Bayesian jargon we say the prior probability of X of 1% is revised to the probability of X of 99.9%, posterior to the observation that ‘this element’ is a member of the subpopulation A and not merely a generic member of the overall population.

Clarification of the Terminology

The terminology, ‘prior probability’ and ‘posterior probability’, refers to before and after the restriction of the scope of consideration from a population to a subpopulation. The population is one which is divided into subsets by two independent criteria. This classifies the population into subsets which may be displayed in a rectangular tabulation. One criterion identifies the rows. The second criterion identifies the columns of the table. Each member of the population belongs to one and only one of the cells of the tabulation, where a cell is a subset identified by a row and a column.

An Example

A good example of such a population would be the students of a high school. Let the first criterion, identify two rows, those who ate oatmeal for breakfast this morning and those who did not. The second criterion, which identifies the columns will be the four classes, freshmen, sophomores, juniors and seniors. Notice that the sum of the subsets of each criterion is the total population. In other words, the subsets of each criterion are complements forming the population.

In the high school example, the prior probability is the fraction of the students of the entire high school who ate oatmeal for breakfast. The prior is the scope of consideration before we restrict that scope to one of the subsets of the second criterion. Let that subset of the second criterion be the sophomore class. We restrict our scope from the entire high school down to the sophomore class. The posterior probability is the fraction of sophomores who ate oatmeal for breakfast. Notice the posterior probability eliminates from consideration the freshmen, junior and senior classes. They are irrelevant to the posterior fraction.

In Bayesian Jargon, prior refers to the full scope of the population prior to restricting the scope. Posterior refers to after restricting the scope. The posterior renders anything outside of the restricted scope irrelevant.

In Carrier’s example, the full scope covers all years, prior to restricting that scope to the year, 1983, thereby ignoring all other years. This is parallel to the high school example, where the full scope covers all class years, prior to restricting that scope to the class year, sophomores, thereby ignoring all other class years.

By some quirk let it be that 75% of the sophomore class ate oatmeal for breakfast, but none of the students of the other three classes did so. Let the four class sizes be equal. We would then say, ala Carrier, “The low prior probability (18.75%) of the truth that a student ate oatmeal for breakfast, was overcome with adequate evidence, so that the final probability of the truth that a sophomore student ate oatmeal for breakfast was 75%.” Note that this ‘adequate evidence’ consists in ignoring any evidence concerning the freshmen, juniors and seniors, which evidence was considered in determining the prior.

This conclusion of ‘adequate evidence’ contrasts a fraction based on a full scope of the population, ‘the prior’, to a fraction based on a restricted scope of the population, ‘the final’. The final does not consider further evidence. The final simply ignores everything about the population outside the restricted scope.

Prejudice as a Better Jargon

A more lucid conclusion, based on the restriction of scope, may be made in terms of prejudice. The following conclusion adopts the terminology of prejudice. It is based on the same data used in the discussion above.

Knowledge of the fraction of students in this high school, who ate oatmeal, serves as the basis for our prejudging ‘this’ high school student. We know the prior probability of the truth that ‘this’ student is ‘one of them’, i.e. those who ate oatmeal for breakfast, is 18.75%. Upon further review, in noting that ‘this’ student is a sophomore, we can hone our prejudice by restricting it in scope to the sophomore class. We can now restrict the scope upon which our original prejudice was based, by ignoring all of the other subsets of the population, but the sophomore class. We now know the final probability of the truth of our prejudice that ‘this’ student is ‘one of them’ is 75%, based on his belonging to the sophomore class.

This is what Carrier is doing. His prior is the prejudice, i.e. the probability based on all years of the population. His final is the prejudice, which ignores evidence from all years except 1983.

We can now see more clearly what Carrier means by adequate evidence. He means considering only knowledge labeled 1983 and ignoring knowledge from other years. Similarly, adequate evidence to increasing our prejudice that this student ate oatmeal, would mean considering only the knowledge that he is in the sophomore year and ignoring knowledge from other class years. It was the consideration of all years upon which our prior prejudice was based. Similarly it was all years, including 1983, upon which Carrier’s prior prejudice is based.

To form our prior prejudice, we consider the total tabulated count. We restrict the scope of our consideration of the tabulated count to a subset in order to form our final or posterior prejudice.

We refine our prejudice by restricting the scope of its application from the whole population to a named subpopulation. Is this what is conveyed by saying that even a low chance of a statement’s being true can be increased by evidence, or, that the low probability of its truth was overcome by adequate evidence? To me, that is not what is conveyed. From the appellations of truth and evidence, I would infer that more data were being introduced into the tabulation, or at least more of the tabulated data was being considered, rather than that much of the tabulated data was being ignored.

Conclusion

Carrier’s discussion of Bayes’ theorem gives the impression that the final probability of the 1983 data depends intrinsically upon the tabulated data from all the other years. In fact, the data from all the other years are completely extrinsic, i.e. irrelevant, to the final probability of the1983 data. The ‘final’ probability is the ratio of one subset of the 1983 data divided by the set of 1983 data, ignoring all other data.

Probability is the ratio of a subpopulation of data to a population of data. In Carrier’s discussion, the population of his ‘prior’ is the entire data set. The population of his ‘final’ is solely the 1983 data, ignoring all else. He is not evaluating the 1983 data, or any sub-portion of it, in light of non-1983 evidence.

One can easily be misled by the jargon of the ‘prior probability’ of ‘the truth’, the ‘final probability’ of ‘the truth’ and ‘adequate evidence’.

In the previous essay, Bayes’ theorem was illustrated in the case of continuous sets. This essay focuses on sets of discrete elements in a tabulated format.

Let a set be visualized as a column of tiers, i.e. a vertical array of subsets. Let the column have the header or marker, A. Let the subsets or tiers have headers or markers X, Y, Z etc. Let the number of elements per subset or cell of the vertical array be Cell(i), where i = X, Y, Z etc. The total number of elements in the singular linear dimension is Sum Column A. There are no overlapping subsets, so Bayes’ theorem is inapplicable.

In expanding the set by introducing more columns with markers or headers B, C, D etc., we would then have a two dimensional array of subsets or cells. The array would be one of multiple rows and multiple columns. Each i, would be the header of a row. Each column would be identified by a header or marker, j. The number of elements in each subset or cell of the two dimensional array would be Cell(i,j). The total number of elements in any row, e.g. row D would be designated Sum Row D. This two dimensional array of cells could be extended by one more orthogonal category of IDs or markers within geometry. However, it can be extended to any number of independent categories of IDs or markers algebraically.

Each subset, identified by a specific i and j, is a cell, the overlap of a row and a column. Bayes’ theorem may be applied to such a two dimensional array because of the overlap of rows and columns which form cells each identified by two markers, i and j. For illustrative simplicity we will use a two by two tabulated array,

table-1

The Form and Formation of Bayes’ Theorem

Bayes’ theorem depends upon an identity of the following algebraic form.

(R/C) ≡ (R/C)

We can then multiply both sides of the identity by 1, thereby preserving the equality. We multiply the left side numerator and denominator by L/RC and the right side numerator and denominator by 1/T. This yields,

(L / C) / (L / R) = (R / T) / (C / T)

Multiplying both sides by L/R yields,

(L/C) = ((L / R) * (R / T)) / (C/T)

By replacing L with Cell(X,A); C with Sum Col A; R with Sum Row X and T with Total Sum, we have Bayes’ theorem as applied to our illustrative table.

(Cell(X,A) / Sum Col A) =
((Cell(X,A) / Sum Row X) ⃰ (Sum Row X / Total Sum)) / (Sum Col A / Total Sum)     Eq. 1

However, the denominator, Sum Col A / Total Sum, is usually modified to,

((Cell(X,A) / Sum Row X) ⃰ (Sum Row X / Total Sum)) +
((Cell(Y,A) / Sum Row Y) ⃰ (Sum Row Y / Total Sum))

Bayes’ theorem is used to calculate (Cell(x,A) / Sum Col A). However, if we had the data from the table, we would just use these two table values for the calculation, we would not use Bayes’ theorem. We do use Bayes’ theorem when the numerical information we have is limited to the fractions of the right hand side of the equation. For illustration let these numerical values be:

Cell(X,A) / Sum Row X = 0.7857
Sum Row X / Total = 0.7 (N.B. therefore Sum Row Y / Total = 0.3)
Cell(Y,A) / Sum Row Y = 0.3333

From these values, Bayes’ theorem, Eq. 1, with the denominator modified, is

(Cell(X,A) / Sum Col A) =
(0.7857 * 0.7) / ((0.7857 * 0.7) + (0.3333 * 0.3)) = .5499/.64989 ≈ 55/65

From this information we can construct the table as percentages of the Total Sum of 100%, beginning with the four values in bold known from above and 100%

table-2

Verbally, Eq. 1 is:
The fraction of column A, that is also identified by row marker X
equals
(the fraction of row X, that is also identified by column marker A)
times
(the fraction of the total set, that is identified by row marker, X)
with this product divided by
(the fraction of the total set, that is identified by column marker A)

Probability is the fractional concentration of an element in a logical set. Consequently, a verbal expression of Eq. 1 is:

The probability of both marker A and marker X with respect to the subset, marker A
equals
(the probability of both marker A and marker X with respect to the subset, marker X )
times
(the overall probability of the marker X)
with this product divided by
(the overall probability of marker A)

However, this is not how it is usually expressed.

Misleading Common Jargon

In one instance of common jargon, Bayes’ theorem is expressed as:
Given the truth of A, the truth of belief X is credible to the degree, which can be calculated by Bayes’ theorem.

Another expression in common jargon is:
Bayes’ theorem expresses the probability of X, posterior to the observation of A, in contrast to the probability of X prior to the observation of A. In other words, the prior probability of X, which was 70% is revised to 55%, due to the observation of A.

Another expression is: The Bayesian inference derives the posterior probability of 55% as a consequence of two antecedents, the prior probability of 70% and the likelihood function, which numerically is 0.7857.

The Bayesian inference is also viewed as the likelihood of event X, given the observation of event A. The inference is based on three priors. The priors are the probability of event A given event X, 55/70 as 78.57%, the probability of event X, 70%, and the probability of event A, 65%.

Evaluation of Common Jargon

To label A an observed fact of evidence in support of the truth of belief, X is gratuitous, because the meanings of evidence and belief imply extrapolation beyond the context of nominal markers. Philosophical conclusions, e.g. when labeled beliefs, are not nominal Bayesian markers. It is also gratuitous to label elements of sets, ‘events’.

Probability is the fractional concentration of an element in a logical set. The IDs of the elements are purely nominal because none of the characteristics associated with the ID is relevant to probability. The only characteristic of an element that matters within the context of probability is its capacity to be counted.

A Proper View

From a valid Bayesian perspective, some markers of elements of sets are observed, in the example, A and B, while some markers are not observed, in the example, X and Y. Remember that the ID of each element is a pair of markers, one from a row and one from a column. The Bayesian inference provides quantification of the prejudice that an element has one of the unobserved markers, such as X, where the prejudice is based upon observing that this element has one of the observable markers, such as A.

Bayesian inference is the quantification of prejudice, not the provision of evidence of the truth of a verbal belief.

Such quantification of prejudice is useful in making prudential decisions, e.g. in industry where the past performances (X, Y etc.) of a variety of material processing methods (A, B etc) serve as the basis of predicting their future performance. There are a variety of other areas in which Bayesian analysis may be incorporated into algorithms for reaching decisions. Of course, prudence is not the determination of truth, but the selection of a good course of action to achieve a goal.

The Bayesian quantification of prejudice can be harmful in social and employment settings.

A Contrast of ‘Truth’ Jargon vs. ‘Prejudice’ Jargon

Using the table for a numerical example, focused on Cell(X,A):

In common jargon, the Bayesian inference is: Given the truth of observation, A, the probability of unobserved belief, X, is revised from the prior value for the probability of belief X, which was 70%, to the posterior probability that belief X is true, given the truth of A. The posterior probability is 55%.

To more clearly elucidate the Bayesian inference, I prefer the jargon: The observation of marker A prejudices the presence of the unobserved marker, X, at a quantitative level of 55%. If the presence of A is not specified, the probability of marker X for the population as a whole = 70%.

A Further Critique of the Jargon of the Truth of Observation Leading to the Truth of Belief

Bayes’ theorem applies to a population of elements in which each element is identified by two markers, one from each of two categories of markers. The first category of markers are the IDs of the rows of a rectangular display of the elements of the population. The second category of markers are the IDs of the columns of the rectangular display. In such a rectangular or orthogonal distribution of a population, the elements with respect to the one category of markers is independent of the distribution of the elements with respect to the other category of markers.

Bayes’ theorem expresses the probability of one marker of category one, i.e. a row marker, with respect to the entire column of a given marker of category two, i.e. a column marker.

The probabilities summed, row plus row, within a column equal the probability of the marker of that column. In other words, as is typical of probability, the probabilities of the rows within a column are supplements forming the probability of the column as a whole. Consequently, the marker of one row cannot be the antithesis of the marker of another row. Subsets which add to form a column must be compatible as parts of a whole.

Complements are of the forms, ‘Some are red’ and ‘Some are non-red’. Their sum is the whole. The row markers of a population to which Bayes’ theorem can be applied, may be just two. However, the row markers of such a population may be any number, the sum of which within a column is the complement of that column. Just as the sum of the elements row by row within a column are the complement of the column, the same is true of the probabilities. Likewise, the sum of the column probabilities within a row equal the probability of the row.

No row marker in an orthogonally identified population to which Bayes’ theorem is applied, may be viewed as the antithesis of another row marker. They may be nominally antithetical as true and false. However, in the context of Bayes’ theorem, subsets labelled true and false must be compatible as complements.

The rows of an orthogonal population distribution within a column are complementary, as the parts of the column as a whole. The rows of an orthogonal population distribution overall are the complementary parts of the whole population.

From this perspective, identifying a column marker as true and thereby leading to a judgment of the degree of certitude of the truth of a row marker, as a belief, is misleading to say the least. It confounds mathematical probability with probability in the sense of human certitude.

Mathematical probability is the fractional concentration of an element in a logical set. Probability as human certitude is a quality characterizing one’s own subjective judgment even when one employs a quasi-quantitative value to express his subjectivity.

In contrast, for a given element of a population, a column marker may be said to be observed in the element and thereby a Bayesian calculation may be said to determine the degree of prejudice of the presence of an unobserved row marker in that element.

Note: What I have mistakenly called contradictories in this essay, are, in classical logic, called contraries (every vs. none). What I have called contraries, are classically called sub-contraries (some are vs. some are not).