Probability and induction.
‘Probability’ is an ambiguous word. In the history of ideas, it has
been used with many different senses giving rise to different concepts
of probability. Being associated with games of chance and gambling,
death tolls and insurance policies, statistical inferences and the
chancy world of modern physics, probabilities have been made susceptible
to different interpretations. These interpretations may reflect on
probabilities the objectivity of logic or the subjectivity of a person’s
belief and lack of knowledge, the frequencies of observed data or the
real tendency of a system to yield an outcome. Commonly, but not always,
are considered to be interpretations of the mathematical concept of
probability which by itself and in itself has no empirical meaning.
The article attempts to present the different meanings of
‘probability’ and provide an introductory topography of the conceptual
landscape. Without trying to provide a history of the idea, historical
elements have been considered. Also, realizing that an exhaustive
treatment would be difficult, we are focusing, mainly, on the discussion
of induction and confirmation. The article is intended as a companion to
another entry on IEP in which we discuss The Problem of Induction
(Psillos and Stergiou, 2022); this explains why we do not deal here with
Hans Reichenbach’s major contribution to the interpretation of
probability theory.
Table of Contents
- Elements of Probability Theory and its Interpretations
- On Mathematical Probability
- Interpretations of Probability
- What is Probability?
- The Classical Interpretation
- Probability as a Measure of Ignorance
- Probabilities as Frequencies
- Are Propensities Probabilities?
- The Classical Interpretation
- Probability as the Logic of Induction
- Keynes and The Logical Concept of Probability
- The Principle of Indifference
- Keynes on the Problem of Induction
- On the Rule of Succession
- Carnap’s Inductive Logic
- Two Concepts of Probability
- C-functions
- The Continuum of Inductive Methods
- Subjective Probability and Bayesianism
- Probabilities as Degrees of Belief
- Dutch Books
- Bayesian Induction
- Too Subjective?
- Some Success Stories
- Appendices
- Lindenbaum algebra and probability in sentential logic.
- The Rule of Succession: a mathematical proof
- The mathematics of Keynes’s account of Pure Induction
- References and further reading
Elements of
Probability and its Interpretations
On Mathematical Probability
In the monograph Foundations of the Theory of Probability,
first published in German in 1933, the Soviet mathematician A. N.
Kolmogorov presented the definitive form of what is nowadays regarded an
axiomatization of mathematical probability.
The challenge of axiomatization has been set by D. Hilbert in the
sixth of his famous twenty-three problems at the beginning of twentieth
century (1902):
…to treat in the same manner [as geometry], by means of axioms, those
physical sciences in which mathematics plays an important part; in the
first rank are the theory of probabilities and mechanics.
Kolmogorov, addressing the problem, developed a theory of probability
as a mathematical discipline “from axioms in exactly the same way as
Geometry and Algebra” (1933:1). In his axiomatization, probability and
the other primary concepts, devoid of any empirical meaning, are defined
implicitly in terms of consistent and independent axioms in a
set-theoretic setting. Thus, modern mathematical probability theory grew
within the branch of mathematics called measure theory.
Kolmogorov called elementary theory of probability “that
part of the theory in which we have to deal with probabilities of only a
finite number of events.” (ibid). A random event is an
element of an event space; the latter being formalized by the
set- theoretic concept of field, introduced by Hausdorff in
Set Theory (1927). A field is a non-empty collection of subsets
𝒮 of a given non-empty set 𝛤 that has the following properties:
- for every pair of elements, 𝐴, 𝐵 of 𝒮, their union, 𝐴 ∪ 𝐵,
belongs in 𝒮; - for every element 𝐴 of 𝒮 its complement with respect to 𝛤,
𝐴𝑐 = 𝛤\𝛢, is in
𝒮.
In probability theory the set 𝛤 is called sample space.
To understand the above formalization, consider the simple example of
tossing a die. Let 𝛤 be the set of the six possible outcomes:
𝐸1, 𝐸2, 𝐸3, 𝐸4,
𝐸5, 𝐸6.
The collection 𝒮 of all 26 = 64 subsets of 𝛤,
∅, {𝐸1}, {𝐸2}, … , {𝐸6},
{𝐸1, 𝐸2}, {𝐸1,
𝐸3}…,{𝐸5, 𝐸6}, {𝐸1,
𝐸2, 𝐸3}, …, {𝐸4, 𝐸5,
𝐸6},{𝐸1, 𝐸2, 𝐸3, 𝐸4},
…,{𝐸3, 𝐸4, 𝐸5, 𝐸6},
{𝐸1, 𝐸2, 𝐸3, 𝐸4,
𝐸5},…, {𝐸2, 𝐸3, 𝐸4,
𝐸5, 𝐸6},{𝐸1, 𝐸2, 𝐸3, 𝐸4,
𝐸5, 𝐸6},
satisfies conditions (a) and (b); 𝒮 is a field. The subsets of 𝛤
represent different possibilities that can be realized in tossing a
single die: the empty set, ∅, is a random event that represents an
impossible happening. The singletons, {𝐸1}, {𝐸2},
… , {𝐸6}, are the elementary events, since any other
random event (except ∅) is a disjunction of these events, expressed by
taking the set-theoretic union of the respective singletons. Finally, 𝛤
= {𝐸1, 𝐸2, 𝐸3, 𝐸4,
𝐸5, 𝐸6}, is an event that represents the
realization of any possibility.
A function from a field 𝒮 to the set of real numbers, ℝ,
𝑝: 𝒮 → ℝ,
is called a probability function on 𝒮, if it satisfies the
following axioms:
- 𝑝(𝐴) ≥ 0, for 𝐴 ∈ 𝒮;
- 𝑝(𝛤) = 1;
- 𝑝(𝐴 ∪ 𝐵) = 𝑝(𝐴) + 𝑝(𝐵), for 𝐴 ∩ 𝐵 = ∅;
In the simple example of tossing a die, a probability function 𝑝
would assign a
non-zero real number 𝑝(𝐸) to each element 𝐸 of 𝒮, according to axiom
(i). Axiom (ii) requires that the random event which describes any
possible outcome has probability 1, 𝑝(𝛤) = 1. Axiom (iii), commonly
called finite additivity property, tells us how to calculate
the probability value of any random event from the probability values of
elementary events, for instance,
𝑝({𝐸1, 𝐸2, 𝐸3, 𝐸4}) =
𝑝({𝐸1, 𝐸2}) + 𝑝({𝐸3, 𝐸4}) =
𝑝({𝐸1}) + 𝑝({𝐸2}) +
𝑝({𝐸3}) + 𝑝({𝐸4}).
Notice that there are infinitely many admissible probability
functions on the event space of the tossing of a die and that only one
of them corresponds to a fair die, the
one with 𝑝({𝐸 }) = 1 .
𝑖 6
Problems concerning a countably infinite number of random events
require an
additional axiom and the formalization of the event space as a
σ-field. A field 𝒮 is a σ- field if and only if it satisfies
the following condition:
- for every infinite sequence of elements of 𝒮, {𝐴𝑛}𝑛∈ℕ,
the countably
infinite union of these sets, ⋃∞ 𝐴𝑛 belong in
𝒮.
𝑛=1
Every field 𝒮 of finite cardinality is a σ-field since any infinite
sequence in 𝒮 consists of a finite number of different subsets of 𝛤 and
their union is always in 𝒮, according to (a). Yet this may not be the
case if the field is constructed from a countably infinite set 𝛤.
Imagine, for instance, a die of infinite faces, where the set 𝛤 of
possible outcomes is:
𝐸1, 𝐸2, 𝐸3,…
Let the collection 𝒮 consist of subsets 𝐴 of 𝛤 which are either of
finite cardinality or their complement, 𝐴𝑐 = 𝛤\𝛢, is of
finite cardinality:
𝒮 = {𝛢 ⊂ 𝛤: 𝛢 is finite or 𝐴𝑐 is finite }.
It’s easy to show that 𝒮 is a field. Yet it is not a σ-field, since
the set
⋃{𝐸2𝑛}
𝑛∈ℕ
which is the infinite union of {𝐸2𝑛}, 𝑛 ∈ ℕ does not
belong to 𝒮.
A probability function on a σ-field 𝒮,
𝑝: 𝒮 → ℝ,
satisfies the following axioms:
i’. 𝑝(𝐴) ≥ 0, for 𝐴 ∈ 𝒮; ii’. 𝑝(𝛤) = 1;
iii’. 𝑝(⋃∞ 𝐴𝑛) = 𝑝(𝐴1) + ⋯ +
𝑝(𝐴𝑁) + ⋯ = ∑∞ 𝑝(𝐴𝑛) , for
𝐴𝑖 ∩ 𝐴𝑗 = ∅, for
𝑛=1
𝑛=1
𝑖 ≠ 𝑗.
It is evident that axiom (iii΄), commonly called countable
additivity property of the probability function, extends finite
additivity to the case of a countably infinite family of events.
Originally, Kolmogorov suggested a different axiom, equivalent to
countable additivity, the axiom of continuity (1933: 14):
iii΄΄. For a monotone sequence of events
{𝐴𝑛}𝑛∈ℕ, with 𝐴𝑛 ⊇ 𝐴𝑛+1, 𝑛 ≥
1 suchthat ⋂∞ 𝐴𝑛 = ∅, 𝑝(𝐴𝑛) ⟶ 0 when 𝑛 →
∞.
𝑛=1
In what follows we will see that many interpretations of mathematical
probabilities are actually interpretations of elementary probability
theory, and that they face serious problems when applied to mathematical
probability theory formulated in σ-fields.
A special probability function 𝑝(⦁|𝐴): 𝒮 → ℝ can be defined on 𝒮, if
we are given a function 𝑝 on 𝒮 and a random event 𝐴 ∈ 𝒮 such that 𝑝(𝐴) ≠
0:
𝑝(𝐵 ∩ 𝐴)
𝑝(𝐵|𝐴) = , for 𝐵 ∈ 𝒮
𝑝(𝐴)
𝑝(⦁|𝐴) determines the conditional probability 𝑝(𝐵|𝐴) of
some event 𝐵 ∈ 𝒮 given an event 𝐴, while 𝑝(𝐵) is
the unconditional probability of 𝐵.The conditional probability given an event 𝐴 ∈ 𝒮 of any random event
𝐵 ∈ 𝒮,
𝑝(𝐵|𝐴), can be understood as unconditional probability of an event 𝐷,
𝑝𝐴(𝐷), determined by a probability function 𝑝𝐴 on
a reduced event space 𝒮𝐴 consisting of subsets of the event 𝐴
∈ 𝒮 we conditionalize on; namely, 𝑝𝐴: 𝒮𝐴 → ℝ,
𝑝𝐴(𝐷) =
𝑝(𝐵|𝐴), where 𝒮𝐴 = {𝐷: 𝐷 = 𝐵 ∩ 𝐴, for 𝐵 ∈ 𝒮}.
In the tossing of a fair die example, the conditional probability of
any outcome, event 𝐵 = {𝐸𝑖}, 𝑖 = 1, … 6, given that it is an
even number, event 𝐴 = {𝐸2, 𝐸4, 𝐸6}, is
provided by the conditional probability function 𝑝(⦁|𝐴), defined on the
σ-field 𝒮.
Since the die is fair, 𝑝({𝐸 }) = 1 for 𝑖 = 1, … 6; also, 𝑝(𝐵 ∩ 𝐴) = 1
for 𝐵 =
𝑖 6 6
{𝐸𝑖}, 𝑖 = 2,4,6, while 𝑝(𝐵 ∩ 𝐴) = 0 otherwise; using the
finite additivity axiom,
𝑝(𝐴) = 𝑝({𝐸 }) + 𝑝({𝐸 }) + 𝑝({𝐸 }) 1 1 1 1 ; so, 𝑝(𝐵|𝐴) = 1 , for 𝐵
=
2 4 6
=
+
+
=
6 6 6 2 3
{𝐸𝑖}, 𝑖 = 2,4,6, and 𝑝(𝐵|𝐴) = 0 otherwise. Now, consider
the reduced event space 𝒮𝐴
consisting of the subsets of {𝐸 , 𝐸 , 𝐸 }. Since the die is fair, 𝑝
({𝐸 }) = 1 for 𝑖 =
2 4 6
𝐴 𝑖 3
2,4,6 and, ( ) 1
( ) for 𝐵 = {𝐸 }, 𝑖 = 2,4,6, while 𝑝
(∅) = 0 = 𝑝(𝐵|𝐴)
otherwise.
𝑝𝐴 𝐵
=
= 𝑝
3
𝐵|𝐴 𝑖 𝐴
Kolmogorov’s axiomatic account, the standard mathematical textbook
account of probability theory, explicates the concepts of random event
and event space in terms of set theory. Yet, Boole proposed
… another form under which all questions in the theory of
probabilities may be viewed; and this form consists in substituting for
events the propositions which assert that those events have occurred, or
will occur; and viewing the element of numerical probability as having
reference to the truth of those propositions, not to the occurrence of
the events concerning which they make assertion. (1853:190)
This formulation of probability theory is very common in
philosophical contexts, especially when discussing inductive inference.
It typically concerns elementary probability theory, presented
in the language of sentential logic. Elements of this account can be
found in Appendix 6.a and the reader may also consult (Howson and Urbach
2006: Ch.2). Here, we present just a few propositions of elementary
probability theory as formulated in this setting that will be found
useful in what follows:
- Probability 1 is assigned to tautologies and probability 0 to
contradictions. All other sentences have probability values between 0
and 1. - The probability of the negation of sentence 𝑎 is 1 −
𝑝(𝑎). - The probability of the disjunction of two inconsistent sentences
𝑎, 𝑏 is the sum of probabilities of the sentences:
𝑝(𝑎 ∨ 𝑏) = 𝑝(𝑎) + 𝑝(𝑏).
- The conditional probability of a sentence 𝑎 given the truth of a
sentence 𝑏 is:
𝑝(𝑎 ∧ 𝑏)
𝑝(𝑎|𝑏) = , 𝑝(𝑏) ≠ 0.
𝑝(𝑏)
- Bayes’s Theorem. The posterior probability of a
hypothesis ℎ – i.e., the probability of ℎ conditional on evidence 𝑒 –
is:
𝑝(𝑒|ℎ)𝑝(ℎ)
𝑝(ℎ|𝑒) =
𝑝(𝑒)
, where 𝑝(ℎ), 𝑝(𝑒) > 0,
where 𝑝(𝑒|ℎ) is called likelihood of the hypothesis and
expresses the probability of the evidence conditional on the hypothesis;
𝑝(ℎ) is called prior probability of the hypothesis; and 𝑝(𝑒) is
the probability of the evidence.
We conclude this brief introduction to mathematical probability with
the following instructive application of Bayes’s theorem. A factory uses
three engines 𝐴1, 𝐴2, 𝐴3 to produce a
product. The first engine, 𝐴1, produces 1000 items, the
second, 𝐴2, 2000 items and the third, 𝐴3, 3000
items, per day. Of these items, 4%, 2% and 4%, respectively, are faulty.
What is the probability of a faulty product having been produced by a
given engine in a day? Let ℎ𝑖 be the hypothesis: “A product
has been produced by engine 𝐴𝑖 in a day”, for 𝑖 = 1,2,3, and
𝑒: “A faulty product has been
produced in a day”. Then the prior probabilities of ℎ𝑖
are, 𝑝(ℎ1)
1
=
; 𝑝
6
(ℎ2) =
1 ; 𝑝(ℎ ) = 1 and the likelihoods are 𝑝(𝑒|ℎ ) = 0.04,
𝑝(𝑒|ℎ ) = 0.02; 𝑝(𝑒|ℎ ) =
3 3 2
1 2 3
0.04, respectively. Using the theorem of total probability (see,
Appendix 6a), we can
calculate 𝑝(𝑒) = 𝑝(ℎ )𝑝(𝑒|ℎ ) + 𝑝(ℎ )𝑝(𝑒|ℎ ) + 𝑝(ℎ )𝑝(𝑒|ℎ ) 1 1
1 1 2 2
3 3
1 1
=
∙ 0.04 +
∙
6 3
0.02 +
∙ 0.04 =
2
. By applying Bayes’s theorem we obtain the posterior probability
30
for each hypothesis, 𝑝(ℎ1|𝑒) = 0.20; 𝑝(ℎ2|𝑒) =
0.20; 𝑝(ℎ3|𝑒) = 0.60, that is, the
probability of a faulty product to have been produced by a given
engine in a day.
Interpretations of
probabilities
As any other part of mathematics, probability theory does not have on
its own any empirical meaning and cannot be applied to games of chance,
to the study of physical or biological systems, to risk evaluation or
insurance policies and, in general, to empirical science and practical
issues, unless we provide an interpretation of its axioms and theorems.
This is what Wesley Salmon (1966: 63) dubbed the philosophical
problem of probability:
It is the problem of finding one or more interpretations of the
probability calculus that yield a concept of probability, or several
concepts of probability, which do justice to the important applications
of probability in empirical science and in practical affairs. Such
interpretations whether one or several would provide an explication of
the familiar notion of probability.
Salmon suggested three criteria that an interpretation of probability
is desirable to satisfy. The first one is called admissibility,
and it requires that the probability concepts satisfy the mathematical
relations of the calculus of probability, i.e., the axioms of
Kolmogorov. This is a minimal requirement for the concept of probability
to be an interpretation of mathematical probability but not a trivial
one, since countable additivity may be a problem for some
interpretations of probability (see, 2.a.i and 2.b), while in others,
Kolmogorov’s axioms are supposed to follow naturally
from the practice of gambling (see, 5.a and 5.b). The second
criterion is ascertainability. It requires that there should be
a method by which, in principle at least, we can ascertain values of
probabilities. If it is impossible to find out what the values of
probability are, then the concept of probability is useless. Again, not
all suggested interpretations satisfy this requirement. According to
Salmon, Reichenbach’s frequency interpretation fails to meet this
requirement (1966: 89ff.). Finally, applicability is the third
criterion: a concept of probability should be applicable, i.e., it
should have a practical predictive significance. The force of this
criterion is manifested in everyday life, in science as well as in the
logical structure of science. The concept of scientific confirmation
provides a venerable example of application of probability theory.
Interpretations of probability theory may be classified under two
general families: inductive and physical probability.
The classical, the logical and the subjective interpretations of
probability are deemed inductive, while the frequency and the propensity
interpretations yield physical probabilities. To illustrate the
difference between inductive and physical probability, an example may be
instructive (Maher, 2006). Think of a coin that you know is either
two-headed or two-tailed, but you have no information about what it is.
What is the probability that it would land heads, if
tossed? One possible answer would be that the probability is
1 , since there are two
2
possibilities, and we have no evidence which one is going to be
realized. Another
answer would say that the probability is either 0, if the coin is
two-tailed, or 1, if two-
headed, but we do not know which. Maher suggests that if
‘1’ occurs as a natural
2
answer, then we understand ‘probability’ in the sense of inductive
probability while
the sense in which ‘0 or 1’ occurs as a natural answer is
physical probability. What is the difference between the two
meanings? Inductive probability is relative to available evidence, and
it does not depend on how the unknown part of the world is, i.e., on
unknown facts of the matter. Thus, if in this example we come to know
that the coin tossed has a head on one side, we should revise the
probability estimate in the light of new evidence and claim that now the
inductive probability is 1. On the other hand, physical probability is
not relative to evidence, and it depends on facts that may be unknown.
This is why the further piece of information we entertained does not
alter the physical probability (it is still ‘0 or 1’).
What is Probability?
The Classical Interpretation
Pierre Simon Laplace proposed what has come to be known as the
classical interpretation in his work, The Analytical Theory
of Probabilities (1812), and in the much shorter, A
Philosophical Essay on Probabilities (1814); a book based on a
lecture on probabilities he delivered in the Ecole Normale, in
1795. His deterministic view of the universe, Laplacian determinism, is
legendary. Not only did he believe that every aspect of the world, any
event that takes place in the universe is governed by the principle
of sufficient reason “…the evident principle that a thing cannot occur
without a cause which produces it” (1814: 3) but also that “[w]e
ought … to regard the present state of the universe as the effect of its
anterior state and as the cause of the one which is to follow.” (1814:
4). Moreover, he claimed that the universe is knowable, in principle,
and that a supreme intelligence that:could comprehend all the forces by which nature is animated and the
respective situation of the beings who compose it—an intelligence
sufficiently vast to submit these data to analysis—it would embrace in
the same formulathe movements of the greatest bodies of the universe and those of the
lightest atom. (ibid)
However, human intelligence is weak. It cannot provide an adequate
unified picture of the world and subsume the macroscopic and microscopic
realm under the province of a single formula. Nor can it give the causes
of all events that occur and render them predictable. Thus, ignorance
emerges as an expression of human limitation. Laplace stressed that:
[t] he curve described by a simple molecule of air or vapor is
regulated in a manner just as certain as the planetary orbits; the only
difference between them is that which comes from our ignorance. (1814:
6)
Due to ignorance of the true causes, he claimed, people believe in
final causation, or they make chance (‘hazard’ in Laplacian terminology)
an objective feature of the world. “[B]ut these imaginary causes”
explains Laplace, “have gradually receded with the widening bounds of
knowledge and disappear entirely before sound philosophy, which sees in
them only the expression of our ignorance of the true causes.” (1814:
3)
i. Probability as a
Measure of Ignorance
In this context, Laplace interpreted probability as a measure of our
ignorance making it dependent on evidence one is aware of, or, on lack
of such evidence:
Probability is relative, in part to this ignorance, in part to our
knowledge. We know that of three or a greater number of events a single
one ought to occur; but nothing induces us to believe that one of them
will occur rather than the others. In this state of indecision, it is
impossible for us to announce their occurrence with certainty. It is,
however, probable that one of these events, chosen at will, will not
occur because we see several cases equally possible which exclude its
occurrence, while only a single one favors it. (1814: 6)
The measure of probability of an event is determined by considering
equally probable cases that either favor or exclude its
occurrence and the concept of probability is reduced to the notion of
equally probable events:
The theory of chance consists in reducing all the events of the same
kind to a certain number of cases equally possible, that is to say, to
such as we may be equally undecided about in regard to their existence,
and in determining the number of cases favorable to the event whose
probability is sought. The ratio of this number to that of all the cases
possible is the measure of this probability, which is thus simply a
fraction whose numerator is the number of favorable cases and whose
denominator is the number of all the cases possible. (1814: 6- 7)
Laplace claims that the probability of an event is the ratio of the
number of favorable cases to that of all possible cases. And this
principle of the calculus of probability has for Laplace the
status of a definition:
First Principle.—The first of these principles is the definition
itself of probability, which, as has been seen, is the ratio of the
number of favorable cases to that of all the cases possible. (1814:
11)In the jargon of the mathematical theory of probability, one may
consider apartition {𝐴𝑘}𝑘=1…𝑛 of the event space 𝒮, i.e. a
family of mutually exclusive subsets
exhaustive of the sample space, 𝐴𝑖 ∩ 𝐴𝑗 = ∅ and
⋃𝑛 𝐴𝑘 = 𝛤 – and assume equal
𝑘=1
probability for all random events 𝐴𝑘, 𝑝(𝐴𝑖) =
𝑝(𝐴𝑗), for every 1 ≤ 𝑖, 𝑗 ≤ 𝑛.
Now, for every event 𝐸 that is decomposable into any sub-family
{𝐴𝑘𝑙}𝑙=1…𝑚 ⊆{𝐴𝑘}𝑘=1…𝑛,
the probability of 𝐸 is,
𝑚
𝑝(𝐸) =
𝑛
𝑚
𝐸 = ⋃ 𝐴𝑘𝑙,
𝑙=1
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝐸
= .
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠.
We can easily show that a function defined in this way satisfies the
axioms of elementary probability theory: 𝑝(𝐴) ≥ 0, for 𝐴 ∈ 𝒮; 𝑝(𝛤) = 1;
𝑝(𝐴 ∪ 𝐵) = 𝑝(𝐴) +
𝑝(𝐵), for 𝐴 ∩ 𝐵 = ∅. Hence, Laplace’s first principle suggests an
admissible, in Salmon’s sense, interpretation of the elementary
theory.
Countable additivity (axiom iii΄), on the other hand, is not
satisfied for an event space of countably infinite cardinality. To show
this, consider an infinite partition
{𝐴𝑘}𝑘=1…∞ and assign equal probability to all
𝐴𝑘s, 𝑝(𝐴𝑘) ≥ 0. Then by employing axioms i΄ and
ii΄ along with the equal probability condition and countable additivity
(axiom iii΄), we are led to the following absurdity:
∞ ∞ 0
1 = 𝑝(𝛤) = 𝑝(⋃ 𝐴𝑘) =𝐶𝑜𝑢𝑛𝑡.𝐴𝑑𝑑𝑖𝑡. ∑
𝑝(𝐴𝑘) = 𝑜𝑟
𝑘=1
𝑘=1 ∞
Hence, classical interpretation is not an admissible interpretation
of the mathematical theory of probability in general. It singles out
only certain models of probability theory (elementary theory) in which
the cardinality of the event space is finite.
Another criticism raised against the classical interpretation (Hajek,
2019) is related to its applicability. The classical interpretation of
probability allows only rational- valued probability functions, defined
in terms of a ratio of integers. However, in many branches of science,
theories (for instance, quantum mechanics) assign to events irrational
probability values. In these cases, one cannot interpret probability
value in terms of the ratio of the number of favorable, over the total
number of cases.
As we have already discussed, in the definition of probability,
Laplace presupposes that all cases are equally probable. This fact gives
rise to a well-known criticism, namely, that of circularity of the
definition of probability: if the relation of equiprobability of two
events depends conceptually on what probability is, then the definition
of probability is circular. To avoid this criticism, the soviet
mathematician and student of Kolmogorov Boris Gnedenko, considered the
notion of equal probability a primitive notion “which is …basic and is
not subject to a formal definition.” (1978: 23)
Laplace, in several places, wrote about “equally possible” cases as
if ‘possibility’ and ‘probability’ were terms that could be used
interchangeably. To assume that is to commit a category mistake, as
Hayek has pointed out, since possibilities do not come in degrees.
Nevertheless, as we shall see in section 3.a.1, the connection between
possibility and probability can be established in terms of Keynes’s
principle of
indifference. In the same section we will discuss the
paradoxes of indifference that also undermine Laplace’s idea of
probability.
Probabilities as Frequencies
The frequency interpretation of probability can be traced back to the
work of R. L. Ellis and John Venn, in the middle of nineteenth century
and it has been described as “a ‘British Empiricist’ reaction to the
‘Continental rationalism’ of Laplace” (Gillies 2000: 88). In Ellis’s
article “On the Foundations of the Theory of Probability” (1842) we
identify the rudiments of this interpretation:
If the probability of a given event be correctly determined, the
event will, on a long run of trials, tend to recur with frequency
proportional to this probability.Venn presented his own account, a few years later, in 1866, in
The Logic of Chance:we may define the probability or chance … of the event happening in
that particular way as the numerical fraction which represents between
the two different classes in the long run. (the quote is from
3rd edition, 1888: 163)The real boost, however, for the frequency interpretation has been
given in the early twentieth century, with the advent of Logical
Empiricism, by Richard von Mises, in Vienna, and Hans Reichenbach, in
Berlin. The first, in his work Probability, Statistics and
Truth, published in German in 1928, provides a thorough
mathematical and operationalist account of probability theory as
empirical science, alike empirical geometry and the science of
mechanics. The account has been presented more rigorously in von Mises’
posthumously published work, entitled Mathematical Theory of
Probability and Statistics (1964). Reichenbach presented his mature
views on probability in the work The Theory of Probability: an
inquiry into the logical and mathematical foundations of the calculus of
probability originally published in Turkey, in 1935. In this work,
Reichenbach attempted to establish a probability logic, based on the
relation of probability implication, which is governed by four
axioms.
Relative frequencies of sub-series of events in a larger series are
interpreted as probabilities and they are shown to satisfy the axioms of
probability logic. However, Reichenbach’s milestone contribution
concerns the connection between probability theory and the problem of
induction. In this section, we will focus, mainly, on the frequency
interpretation of probability as suggested by von Mises while for
Reichenbach’s views the reader may consult our IEP entry on The Problem
of Induction (Psillos and Stergiou, 2022).
Von Mises claimed that the subject matter of probability theory are
the repetitive events – “same event that repeats itself again and again”
– and the mass phenomena – “a great number of uniform elements …
[occurring] at the same time” (1928: 11).
Probability, according to von Mises, is defined in terms of a
collective, a concept which “denotes a sequence of uniform
events or processes which differ by certain observable attributes, say
colors, numbers or anything else” (1928: 12). For example, take a plant
coming from a given seed as a single instance of a collective which
consists of a large number of plants coming from the given type of seed.
All members of the collective differ from each other with respect to
some attribute, say the color of the flower or the height of the plant.
Respectively, in the case of tossing a die the collective consists of
the long series of tosses and the attribute which distinguishes the
instances is the number that appears on the face of the die. The
mathematical
representation of such finite empirical collectives is given in terms
of their idealized counterpart, the infinite ordered sequences of
events, which exhibit attributes that are subsets of the attribute
space of the collective (which is no different from what we have
called sample space).
Yet, to be an empirical collective, a sequence of events should
satisfy two empirically well-confirmed laws that dictate the
mathematical axioms of probability theory in the ideal case of the
infinite sequences. The first law, dubbed by Keynes (1921: 336), Law
of Stability of Statistical Frequencies, requires that:
the relative frequencies of certain attributes become more and more
stable as the number of observations is increased. (von Mises 1928:
12)
Thus, if 𝛺 is the attribute space, 𝐴 ⊆ 𝛺 is an attribute and 𝑚(𝛢) is
the number of manifestations of 𝛢 in the first 𝑛 members of the
collective, the relative frequency,
𝑚(𝛢), tends to a fixed number as the number 𝑛 of
observations increases. According to
𝑛
von Mises, the Law of Stability of Statistical Frequencies is
confirmed by
observations in all games of chance (dice, roulette, lotteries,
etc.), in data from insurance companies, in biological statistics, and
so on (von Mises 1928: 16-21). This empirical law gives rise to the
axiom of convergence for infinite sequences of events:
for an arbitrary attribute 𝐴 of a collective 𝐶, lim
𝑚(𝐴) exists.
𝑛→∞ 𝑛
Τhis law can be traced back to the views of von Mises’s predecessors.
For instance, Venn thought that probability is about “a large number or
succession of objects, or, as shall term it, series of them”
(1888: 5). This series should be ‘indefinitely numerous’ and it should
“combine[s] individual irregularity with aggregate regularity” (1888:
4). All series, for Venn, initially exhibit irregularity, if one
considers only their first elements, while, subsequently, a regularity
may be attested. This regularity, however, can be unstable and it can be
destroyed in the long run, in the “ultimate stage” of the series.
According to Venn, a series is of the fixed type if it
preserves the uniformity while it is of the fluctuating type if
“the uniformity is found at last to fluctuate.” (1888: 17). Probability
is defined only for series of the fixed type; if a series is of the
fluctuating type, it is not the subject of science (1888: 163). But what
does it mean, in terms of relative frequencies, that a series is of the
fixed type? “The one [fixed type] tends without any irregular variation
towards a fixed numerical proportion in its uniformity”.
(ibid).
In more detail:
[a]s we keep on taking more terms of the series we shall find the
proportion still fluctuating a little, but its fluctuations will grow
less. The proportion, in fact, will gradually approach towards some
fixed numerical value, what mathematicians term its limit. (1888:
164)
The second presupposition for a sequence to be a collective is an
original contribution of von Mises. Apart from the existence of limiting
relative frequencies in infinite sequences, he demanded the sequence to
be random in the sense that there is no rule-governed selection
of a subsequence of the original sequence that would yield a different
relative frequency of the attribute in question from the one obtained in
the original sequence. In von Mises (1957: 29) own words:
…these fixed limits are not affected by place selection. That is to
say, if we calculate the relative frequency of some attribute not in the
original sequence, but in a partial set, selected according to some
fixed rule, then we require that the relative frequency so calculated
should tend to the same limit as it does in the original set… The
fulfilment of the condition…will be as the Principle of Randomness or
the Principle of Impossibility of a Gambling System.
In a more detailed account of how the subsequence is obtained by
place selection, von Mises (1964: 9) explained that in inspecting all
elements of the original sequence, the decision to keep the nth element
in or to reject it from the subsequence depends either on the ordinal
number 𝑛 of this element or on the attributes manifested in the (𝑛 − 1)
preceding elements. This decision does not depend on the attribute
exhibited by the nth or by any subsequent element.
Von Mises suggested that we should understand the Principle of
Impossibility of a Gambling System by analogy to the Principle of
Conservation of Energy. As the energy principle is well-confirmed by
empirical data about physical systems, so the principle of randomness is
well-confirmed for random sequences manifested in games of chance and in
data from insurance companies. Moreover, as the principle of
conservation of energy prohibits the construction of a perpetual motion
machine, the principle of impossibility of a gambling system prohibits
the realization of a rule- governed strategy in games of chance that
would yield perpetual wealth to the gambler:
We can characterize these two principles, as well as all far-reaching
laws of nature, by saying that they are restrictions which we impose on
the basis of our previous experience, upon our expectation of the
further course of natural events. (1928: 26)
Having defined the concept of a collective that is appropriate for
the theory of probability in terms of the two aforementioned laws, we
may, now, define the ‘probability of an attribute 𝐴 within a given
collective 𝐶’ in terms of the limiting value of relative frequency of
the given attribute in the collective:
𝑝 (𝐴) = lim
𝐶
𝑛→∞
𝑚(𝐴)
𝑛 .
Thus defined, probabilities are always conditional to a given
collective. Does, however, this definition provide an admissible concept
of probability in compliance with Kolmogorov’s axioms?
It is straightforward that axioms (i) and (ii) are satisfied. Namely,
since for every
𝑛 ∈ ℕ, 0 ≤ 𝑚(𝐴) ≤ 1, it follows that 0 ≤ 𝑝 (𝐴) ≤ 1. And if the
attribute examined
𝑛 𝐶
consists in the entire attribute space 𝛺 then it will be satisfied by
any member of the
sequence, 𝑚(𝛺) = 𝑛 = 1, so,
taking limits, 𝑝 (𝛺) = 1.
𝑛 𝑛 𝐶
Regarding the axiom of finite additivity, (iii), we have that for any
pair of mutually exclusive attributes, 𝐴, 𝐵, the number of times that
either 𝐴 or 𝐵 occurs is the sum of the occurrences of 𝐴 and 𝐵, since the
two cannot occur together:
𝑚(𝐴 ∪ 𝐵) = 𝑚(𝐴) + 𝑚(𝐵) ⇒ 𝑚(𝐴∪𝐵) = 𝑚(𝐴) + 𝑚(𝐵).
𝑛 𝑛 𝑛
By taking limits:
𝑝𝐶(𝐴 ∪ 𝐵) = 𝑝𝐶(𝛢) + 𝑝𝐶(𝛣).
However, von Mises’ concept of probability does not satisfy the axiom
of countable additivity (axiom iii΄). To show that, consider the
following infinite attribute space 𝛺 = {𝛢1, … ,
𝛢𝑘, … } and assume that each attribute 𝐴𝑘 appears
only once in the course of an infinite sequence of repetitions of the
experiment, then
𝑝𝐶(𝛢𝑘) = 0, for every 𝑘 ∈ ℕ. If the countable additivity
condition were true, then
𝑝𝐶(𝛺) = 𝑝𝐶(𝐴1) + ⋯ + 𝑝𝐶(𝐴𝑘) + ⋯ = 0. However,
this is absurd, since it violates the normalization condition 𝑝𝐶(𝛺) = 1.
To provide a probability theory that satisfies all Kolmogorov axioms,
von Mises restricted further the scope of a collective. In addition to
the Law of Stability of Statistical Frequencies and the Principle of
Randomness, in his Mathematical Theory of Probability he
required a third, independent, condition that a collective should
satisfy (von Mises 1964: 12). Namely, that for a denumerable attribute
space 𝛺 = {𝛢1, … , 𝛢𝑘, … }:
∞ 𝑚(𝐴 )
∑ lim 𝑘 = 1.
𝑛→∞ 𝑛
𝑘=1
To define conditional probability, we may begin with a given
collective 𝐶 and pick out all elements that exhibit some attribute 𝐵.
Assuming that they form a new
collective 𝐶
𝐵, we calculate the limiting relative frequency 𝑝
𝐵
𝐶
(𝐴) = lim 𝑚(𝐴) in 𝐶 .
𝑛→∞ 𝑛
𝐵
The conditional probability of 𝐴 given 𝐵 in the collective 𝐶 is
then:𝑝𝐶(𝐴|𝐵) = 𝑝𝐶𝐵(𝐴).
In case attribute 𝐵 is manifested only a finite number of times in 𝐶,
then 𝐶𝐵 is a set of a finite cardinality; hence, it does not
qualify as a collective and conditional probability is not defined. To
avoid this ill-defined case, Gillies suggested that we require that
𝑝𝐶(𝐵) ≠ 0. Given this condition he shows all prerequisites for
𝐶𝐵 to be a collective are satisfied and conditional
probability can be defined (Gillies, 2000:112).
Von Mises’s account of probability has been criticized as being too
narrow with respect to the common use of the term ‘probability’: there
are important situations in which we apply the term although we cannot
define a collective. Take for instance, von Mises’s question “Is there a
probability of Germany being at some time in the future involved in a
war with Liberia?” (1928: 9) Since we do not refer to repetitive or mass
events, we cannot define a collective and, in the frequency
interpretation, the question is meaningless, since ‘probability’ is
meaningfully used only with reference to a collective. Hence, many
common uses of ‘probability’ in ordinary language become illegitimate if
we think in terms of the empirical science of probability as delineated
by von Mises.
Some may think that this is not an objection at all: von Mises
explicates probability in a way that legitimizes only some uses
of the term as it occurs in ordinary language and in this way he deals
with the problem of single-case probabilities that burdens the
frequency interpretation: associating probability with (limiting)
relative frequency
yields trivial certainty (probability equal to 1) for all unrepeated
or unrepeatable events. The solution offered by von Mises is to exclude
definitionally such events from the domain of application of the concept
of probability.
Of course, there are alternative ways to understand probability, not
as relative frequency, that render its use to unrepeated or unrepeatable
events legitimate. Take for instance the subjectivist account (see
section 5), which considers probability as a measure of the degree of
belief. In this conception, the question acquires meaning requesting the
degree of belief an agent would assign to that proposition. In addition,
to be on the safe side and avoid paradoxes, one may request coherence
from the agent, i.e., that their degrees of belief satisfy Kolmogorov’s
axioms of probability.
A criticism raised against von Mises’s account by de Finetti
underlines that the theory fails to deal with the role of probability in
induction and confirmation:
If an essential philosophical value is attributed to probability
theory, it can only be by assigning to it the task of deepening,
explaining or justifying the reasoning by induction. This is not done by
von Mises… (De Finetti 1936)
In response to investigations on probability that aim to produce a
theory of induction, von Mises claims that probability theory itself is
an inductive science and it would be circular to try to justify
inductive methodology by means of a science that applies it or to
provide any degree of confirmation for any other branch or science:
According to the basic viewpoint of this book, the theory of
probability in its application to reality is itself an inductive
science; its results and formulas cannot serve to found the inductive
process as such, much less to provide numerical values for the
plausibility of any other branch of inductive science, say the general
theory of relativity. (1928: vii)
However, it’s not that frequency interpretation, in general, does not
contribute to the problem of induction. As we have examined elsewhere,
[IEP entry on The Problem of Induction (Psillos and Stergiou, 2022)],
Reichenbach thought that the frequency interpretation of probability
theory provides a new context for understanding the problem of
induction.
Are Propensities
Probabilities?
The propensity interpretations are a family of accounts of physical
probability. They aim to provide an account of objective chance in terms
of probability theory.
Originally, this interpretation has been developed by Karl Popper
(1959) but later David Miller, James Fetzer, Donald Gillies and others
developed their own accounts (see, Gillies 2000). Paul Humphreys (1985)
describes propensities as:
[I]ndeterministic dispositions possessed by systems in a particular
environment, exemplified perhaps by such quite different phenomena as a
radioactive atom’s propensity to decay and my neighbor’s propensity to
shout at his wife on hot summer days.
The problems that guided Popper to abandon the frequency
interpretation of probability and to develop this new account had to do,
on the one hand, with the
interpretation of quantum theory, on the other, with the objective
single-case probabilities.
To deal with the problem of single-case probabilities, Popper
suggested that probabilities should be associated not with sequences of
events but with the generating conditions of these sequences i.e., “the
set of conditions whose repeated realisation produces the elements of
the sequence” (1959). He claimed that “probability may … be said to be
a property of the generating conditions” (ibid). This
was not just an analysis of the meaning of the term ‘probability’.
Popper claimed to have proposed, “a new physical hypothesis (or
perhaps a metaphysical hypothesis) analogous to the hypothesis of
Newtonian forces. It is the hypothesis that every experimental
arrangement (and therefore every state of the system) generates physical
propensities which can be tested by frequencies.” (ibid).
The propensity interpretation is supposed to avoid a number of
problems faced by the frequency interpretation; for instance, it avoids
the problem of inferring probabilities in the limit. But, especially in
Popper’s version, it faces the problem of specifying the conditions on
the basis of which propensities are calculated – the ascertainability
requirement fails. Given that an event can be part of widely different
conditions, its propensity will vary according to the conditions. Does
it then make sense to talk about the true objective singular probability
of an event?
Even if this problem is not taken seriously (after all, the advocate
of propensities may well claim that propensities are the sort of thing
that varies with the conditions), it has been argued on other grounds
that probabilities cannot be identified with propensities. Namely, the
so-called inverse probabilities, although they are
mathematically well-defined, remain uninterpreted since it does not make
sense to talk about inverse propensities. Suppose, for instance, that a
factory produces red socks and blue socks and uses two machines (Red and
Blue) one for each color.
Suppose also that some socks are faulty and that each machine has a
definite probability to produce a faulty sock, say one out of ten socks
produced by the Red machine are faulty. We can meaningfully say that the
Red machine has an one tenth propensity to produce faulty socks. But we
can also ask the question: given an arbitrary faulty sock, what is the
probability that it has been produced by the Red machine? From a
mathematical point of view, the question is well-posed and has a
definite answer [for a detailed computation of probabilities in a
similar example, see section 1a above]. But we cannot make sense of this
answer under the propensity interpretation. We cannot meaningfully ask:
what is the propensity of an arbitrary faulty sock to have been produced
by the Red machine? Propensities, as dispositions, possess the asymmetry
of the cause-and-effect relation that cannot be adequately expressed in
terms of the symmetric conditional probabilities. Thus, there are well-
defined mathematical probabilities that cannot be interpreted as
propensities (see Humphreys 1985).
Is this really a problem for the propensity interpretation? We would
say ‘yes’ if a probability interpretation aspires to conform with
Kolmogorov’s axioms (admissibility requirement) and, also,
claims to provide a complete interpretation of probability
calculus. But this condition is not universally accepted. One may
suggest that probability interpretations are partial
interpretations of the probability calculus or even take the more
radical position to abandon the criterion of admissibility, as Humphreys
suggested.
Probability as the Logic
of Induction
Keynes and The
Logical Concept of Probability
John Maynard Keynes presented his account of probability in the work
titled A Treatise on Probability (1921). He attempted
to provide a logical foundation for probability based on the concept of
partial entailment. In deductive logic, entailment, considered
semantically, expresses the validity of an inference and partial
entailment is meant to be its extension to inductive logic. From a
semantical point of view, partial entailment expresses a probability
relation between the conclusion of an inference and its premises, i.e.,
that the conclusion is rendered likely true (or more likely to
be true) given the truth of the premises. Here is how Keynes (1921: 52)
understood this extension and its relation to probability:
Inasmuch as it is always assumed that we can sometimes judge directly
that a conclusion follows from a premiss, it is no great extension of
this assumption to suppose that we can sometimes recognise that a
conclusion partially follows from, or stands in a relation of
probability to a premiss.
And:
We are claiming, in fact, to cognise correctly a logical connection
between one set of propositions which we call our evidence and which we
suppose ourselves to know, and another set which we call our
conclusions, and to which we attach more or less weight according to the
grounds supplied by the first. It is not straining the use of words to
speak of this as the relation ofprobability. (Keynes 1921: 5–6)
Thus, partial entailment rests on an analogy with deductive (full)
entailment and both concepts express logical relations, the former of
deductive and the latter of inductive logic. Here is an example: the
conjunction (p and q) entails deductively p; by analogy, it is said
that, though proposition p does not (deductively) entail the conjunction
(p and q), it entails it partially, since it entails one of its
conjuncts
(for instance, p). The difference between the two kinds of entailment
stems from the fact that validity of an inference, expressed in
deductive entailment, is a yes-or-no question, while the probability
relation, expressed in partial entailment, comes in degrees. Keynes
(1921: 4) considered probability to be the degree of rational
belief that a future occurrence of an event under specified
circumstances is partially entailed from past evidence for the
occurrence of similar events under similar circumstances:
Let our premises consist of any set of propositions ℎ, and our
conclusion consist of any set of propositions 𝑎, then, if a knowledge of
ℎ justifies a rational belief in 𝑎 of degree 𝛼, we say that there is a
probability-relation of degree 𝛼 between 𝑎 and ℎ.
To say that the probability of a conclusion is high or low given a
set of premises is not for Keynes a matter of subjective evaluation of
the believer. It shares the objectivity of any other logical relation
between propositions. That is why Keynes (1921: 4) talks about the
degree of rational belief and not simply of a degree
of belief:
… in the sense important to logic, probability is not subjective.
It is not, that is to say, subject to human caprice. A proposition is
not probable because we think it so. When once the facts are given which
determine our knowledge, what is probable or improbable in these
circumstances has been fixedobjectively, and is independent of our opinion. The Theory of
Probability is logical, therefore, because it is concerned with the
degree of belief which it is rational to entertain in given
conditions, and not merely with the actual beliefs of particular
individuals, which may or may not be rational.
It should be noted that Keynes based his defense of the logical
character of the probability relations on what he called “logical
intuition”, viz., a certain capacity possessed by agents in virtue of
which they can simply “see” the logical relation between the evidence
and the hypothesis. It is in virtue of this shared intuition that
different agents can have the same rational degree of belief in a
certain hypothesis in light of certain evidence. This view was
immediately challenged by Frank Ramsey, who, referring to Keynes’s
“logical relations” between statements, noted: “I do not perceive them
and if I am to be persuaded that they exist it must be by argument”
(1926, 63).
It should be clear that for Keynes probability is not always
quantitative. He believed that qualitative probabilities are meaningful
as well and that the totality of probabilities, or of degrees of
rational belief, may include both numbers and non- numerical elements.
In the usual numerical probabilities, all probabilities lie within the
unit interval and they are all comparable in terms of the relation
‘being greater than or equal to’ as defined in real numbers. This
relation induces a complete ordering to the unit interval which acquires
the structure of a completely ordered set. Since for Keynes
probabilities may not be numerical, a different interpretation of the
relation “being more probable than or equally probable to” expressing
the comparability of probabilities is required. In the class of
probabilities, Keynes defines a relation of ‘between’:
𝐴 is between 𝐵 and 𝐶, (𝐴, 𝐵, 𝐶)
where, for any three probabilities 𝐴, 𝐵, 𝐶 the relation, if
satisfied, is satisfied by a unique ordered triple (𝐴, 𝐵, 𝐶). He
identifies two distinguished probabilities, impossibility, 𝑂, and
certainty, 𝐼, between which all other probabilities lie.
Finally, he used the relation of betweenness to compare
probabilities:
If 𝐴 is between 𝑂 and 𝐵, the probability 𝐵 is said to be greater than
the probability 𝐴.
To illustrate these relations among probabilities, Keynes suggested
the following diagram. In this diagram, all probabilities comparable in
terms of the ‘greater than’ relation are connected with a continuous
path:

In Keynes’s (1921: 39) words:
𝑂 represents impossibility, 𝐼 certainty, and 𝐴 a numerically
measurable probability intermediate between 𝑂 and 𝐼; 𝑈, 𝑉, 𝑊, 𝑋, 𝑌, 𝑍
are nonnumerical probabilities, of which, however, 𝑉 is less than the
numerical probability 𝐴, and is also less than 𝑊, 𝑋, and 𝑌. 𝑋, and 𝑌 are
both greater than 𝑊, and greater than 𝑉, but are not comparable with one
another, or with 𝐴. 𝑉 and 𝑍 are both less than 𝑊, 𝑋, and 𝑌, but are not
comparable with one another; 𝑈 is not quantitatively comparable with any
of the probabilities 𝑉, 𝑊, 𝑋, 𝑌, 𝑍.Probabilities which are numerically comparable will all belong to one
series, and the path of this series, which we may call the numerical
path or strand, will be represented by 𝑂𝐴𝐼.
The Principle of
Indifference
To have numerical probabilities between alternative cases, Keynes
(1921: 41) believed that equiprobability of the alternatives is
required:
And:
In order that numerical measurement may be possible, we must be given
a number of equally probable alternatives.
It has always been agreed that a numerical measure can actually be
obtained in those cases only in which a reduction to a set of exclusive
and exhaustive equiprobable alternatives is practicable. (1921: 65)
In the terminology of the mathematical theory of probability, Keynes
stipulates that a real number 𝑝(𝐸|𝐻) denotes the numerical
probability of an event 𝐸 given the truth of some
hypotheses 𝐻, assigned by a function 𝑝 satisfying
Kolmogorov’s axioms, only if 𝑝(𝐸|𝐻) can be deduced by or it can be
reduced to some initial numerical probabilities 𝑝(𝐴𝑘|𝐻) assigned to
the members of a partition {𝐴𝑘}𝑘=1…𝑛 of the event space 𝒮 that satisfy
the equiprobability condition:
𝑝(𝐴𝑘|𝐻) = 𝑝(𝐴𝑗|𝐻), 𝑘, 𝑗 = 1, … , 𝑛.
What is the basis of equiprobability and how can it be justified?
Keynes (1921: 45) suggested that the justification of equiprobability
follows from the Principle of Indifference which states that:
if there is no known reason for predicating of our subject one rather
than another of several alternatives, then relatively to such knowledge
the assertions of each of these alternatives have an equal probability.
Thus, equal probabilities must be assigned to each of several arguments,
if there is an absence of positive ground for assigning unequal
ones.
The term ‘Principle of Indifference’ was coined by Keynes in the
Treatise on Probability. According to Ian Hacking (1971), this
principle can be traced back to Leibniz’s paper “De incerti
aestimatione” (1678). In this, Leibniz, anticipating Laplace, claimed
that:
Probability is the degree of possibility. Hope is the probability of
having. Fear is the probability of losing.
Leibniz considered the above claim as an axiom—something very similar
to the Principle of Indifference:
Axiom. If players do similar things in such a way that no distinction
can be drawn between them, with the sole exception of the outcome, there
is the same proportion of hope to fear.
Moreover, he suggested that we understand this axiom as having its
source in metaphysics, which seems to be an allusion to the Principle of
Sufficient Reason and, in particular, to the claim that God does, or
creates, nothing without a sufficient reason. Applying this metaphysical
principle to the expectations of rational agents, i.e., ‘players’, we
get the foregoing axiom, as Hacking suggested (1975:126):
If several players engage in the same contest in such a way that no
difference can be ascribed to them (except insofar as they win or lose)
then each player has exactly the same ground for ‘fear or hope’.
Keynes, however, traces the principle of indifference to Jacques
(James) Bernoulli’s Principle of Non-Sufficient Reason (1921:
41). Bernoulli in his Ars Conjectandi, attempted to
calculate the “degree of certainty, or probability, that the argument
generates” [Notice that by ‘argument’ he meant a piece of evidence.] and
he assumed that “all cases are equally possible, or can happen with
equal ease.” There are examples, however, in which a case happens more
‘easily’ than others. Then, according to Bernoulli (1713: 219), we need
to make a correction:
For any case that happens more easily than the others as many more
cases must be counted as it more easily happens. For example, in place
of a case three times as easy I count three cases each of which may
happen as easily as the rest.
Thus, Bernoulli suggested that to save equiprobability we should
consider a finer partition of the sample space by subdividing the
ill-behaved case into distinct cases.
Keynes was aware that the principle faces a number of difficulties
which take the form of a paradox: it predicted contradictory evaluations
of probabilities in specific cases. To resolve these paradoxes and avoid
ill cases, he attempted to provide restrictions to the application of
the principle of indifference.
The first paradox is known as the Book Paradox. Consider a book of
unknown cover color. We have no reason to believe that its color is red
rather than not red.
Hence, by the principle of indifference the probability of being red
is 1. In a similar
2
vein, the probability of being green, yellow or blue are all
1 which contradicts the
2
theorem of probability that the sum of probabilities of mutually
exclusive events is
less than or equal to 1.
The second paradox is the Specific Volume Paradox. Consider the
specific volume
𝑣 of a given liquid and assume that 1 ≤ 𝑣 ≤ 3 in some system of
units. Given that there is no reason to assume that 1 ≤ 𝑣 ≤ 2 , rather
than 2 ≤ 𝑣 ≤ 3, by the principle of indifference it is equally likely
for the specific volume to lie in each one of these intervals. Next,
consider the specific density 𝑑 = 1. Given our original
assumption,
𝑣
we are justified to infer that 1 ≤ 𝑑 ≤ 1. Similarly, the
principle of indifference
3
maintains that it is equally likely for the specific density to have
a value, 1 ≤ 𝑑 ≤ 2 ,
3 3
or to have a value, 2 ≤ 𝑑 ≤ 1. Turning now to
considerations about specific volume
3
we find that it is equally likely that 1 ≤ 𝑣 ≤ 3
2
or 3 ≤ 𝑣 ≤ 3. But we have already
2
shown that it is as likely 𝑣 to lie between 1 and 2 as between 2 and
3.
The third paradox that seems to challenge the principle of
indifference is Bertrand’s paradox. Bertrand in his Calcul des
Probabilités (1888) argues that the principle of indifference can
be applied in more than one way in cases with infinitely many
possibilities giving rise to contradictory outcomes regarding the
evaluation of probabilities. In support of his argument he presented,
among other examples, his famous paradox: We trace at random a chord in
a circle. What is the probability that it would be longer than the side
of the inscribed equilateral triangle? Here are some different ways to
apply the principle of indifference to solve the problem, each leading
to different probability values. The first solution assumes that one end
of the requested chord is at a vertex of the triangle and the other lies
on the circumference.
The circumference is divided in three equal arcs by the vertices of
the triangle. From all possible chords traced from the given vertex,
only those that lie in the arc which subtends the angle at that vertex
are longer than the side of the equilateral triangle.
Therefore, the probability is 1. For the second solution,
we assume that the chord is
3
parallel to a side of the triangle. From these parallel chords only
the ones with
distance less than one-half of the circle’s radius will have a length
greater than the
side of the inscribed equilateral triangle. Thus, the requested
probability is 1. Finally,
2
we yield a third solution by assuming that the chord is defined by
its midpoint. Then a
chord is longer than the side of triangle if its midpoint falls
within a concentric circle of a radius one-half of the outer circle. The
probability is calculated as the ratio of the
areas of the two circles and is found 1. Notice
that Bertrand’s Paradox can undermine
4
the principle of indifference if and only if the problem at hand is a
determinate
problem with no unique solution. But there is no agreement on that!
Many believe that the problem is ambiguous or underspecified and, in
this sense indeterminate. They claim that once we select the set of
chords from which we draw one at random, the problem has a unique
solution by applying the principle of indifference. [For an interesting
discussion, see Shackel, 2007].
To address the Book and the Specific Volume Paradoxes, Keynes
suggested that we should place a restriction to the application of the
Principle of Indifference. We should require that given our state of
knowledge, the partition of the sample space, i.e., the number
of alternative cases, is finite, and each alternative cannot be split up
further into a pair of mutually exclusive sub-alternatives which have
non-zero probability to occur (see 1921: 60). Now it is obvious that the
class of books with a non-red cover can be further subdivided into the
class of books with a blue cover and those with a non-blue cover and so
on; thus the adequacy condition for the application of the principle is
not satisfied. Similarly, in the case of the ranges of values of the
specific volume and the specific density, the principle does not apply
since there is no range of values which does not contain within itself
two similar ranges. Finally, for Bertrand’s paradox, since areas, arcs
and segments can be subdivided further into
non-overlapping parts without a limit, the principle of indifference
is not applicable (see 1921: 62). Yet, for the geometric example, Keynes
suggested a solution. Instead of considering as an alternative a point
in a continuous line, we may divide that line into a finite number of 𝑚
segments, no matter how small, and take as an alternative
the segment in which the point under consideration lies. Then we can
apply the principle of indifference to the 𝑚 alternatives which we
consider indivisible.
However, Keynes solution is not at all clear. Number 𝑚 can be as
great as one desires on the condition that we keep it finite. Hence, who
decides what is the number of alternatives to which the principle of
indifference is applied? If, on the other hand, we allow 𝑚 to increase
indefinitely then we get the continuous case we sought to avoid. (see
Childers 2013: 126)
Keynes on the Problem of
Induction
For Keynes, probability is the part of logic that deals with rational
but inconclusive arguments; and since inductive reasoning is both
inconclusive but rational, induction becomes inductive logic. The key
question, of course, is the following: on what grounds one is justified
to believe that induction is rational?
According to Keynes, though Hume’s skeptical claims are usually
associated with causation, the real object of his attack is
induction i.e., the inference from past particulars to future
generalizations (see 1921: 312).
Keynes’s argument is the following:
- A constant conjunction between two events has been observed in
the past. This is a fact. Hume does not challenge this at all. - What Hume challenges is whether we are justified to infer from a
past constant conjunction between two events that it will also hold in
the future. - This kind of inference is called inductive.
- So, Hume is concerned with the problem of induction.
To see Keynes’s reaction to the problem of induction, let’s first
clarify what is for him an inductive argument: (1921: 251)
It will be useful to call arguments inductive which depend in anyway
on the methods of Analogy and Pure Induction.
Arguments from analogy are based on similarities among the objects of
a collection, on their likeness, while Pure Induction is induction by
enumeration. As Keynes (ibid) put it:
[w]e argue from … Pure Induction when we trust the number of the
experiments.
Keynes criticized Hume for not taking into account the analogical
dimension of an inductive argument by considering the observed instances
which serve as premises, as absolutely uniform (see 1921: 252). Instead,
Keynes suggested that the basis of Pure Induction is the likeness of
instances in certain respects (positive analogies) and their
dissimilarity in others (negative analogies). Only after having verified
such a likeness, we can single out some features and predict the
occurrence of other features or infer a generalization of the sort “all
A is B”. Hence (1921: 253):
In an inductive argument, therefore, we start with a number of
instances similar in some respects AB, dissimilar in others C. We pick
out one or more respects A in which the instances are similar, and argue
that some of the otherrespects B in which they are also similar are likely to be associated
with the characteristics A in other unexamined cases.
So, assume that a finite number, 𝑛, of instances exhibits a certain
group of qualities,
𝑎1, … , 𝑎𝑟 and single out two subgroups:
𝑎1, 𝑎2, 𝑎3 and 𝑎𝑟−1,
𝑎𝑟
An inductive argument, for Keynes, would conclude that in every
instance of
𝑎1, 𝑎2, 𝑎3, qualities
𝑎𝑟−1, 𝑎𝑟 are also exhibited. Or that
𝑎𝑟−1, 𝑎𝑟 “bound up” with qualities 𝑎1,
𝑎2, 𝑎3. (1921: 290) This account of induction
presupposes, claims Keynes (ibid), that qualities in
objects are exhibited in groups and “a sub-class of each group [is]
an infallible symptom of the coexistence of certain other members of it
also.”
However, the world may not co-operate to the success of an inductive
argument.
Keynes identifies three “open possibilities” that would compromise
inductive generalization:
- Some quality 𝑎𝑟−1 or 𝑎𝑟, may be independent
of all other qualities of the instances, i.e., there are no groups of
qualities that contain the said quality and at least some of the
others. - There are no groups to which both 𝑎1, 𝑎2,
𝑎3 and 𝑎𝑟−1, 𝑎𝑟 belong. - 𝑎1, 𝑎2, 𝑎3 belong to groups that
include 𝑎𝑟−1, 𝑎𝑟 and to other groups that do not
include them.
In any of the three cases, “All 𝑎1, 𝑎2,
𝑎3’ are 𝑎𝑟−1, 𝑎𝑟” fails. Hence
induction fails.
Keynes (1921: 291) suggested an assumption of probabilistic nature
that would save us from such ‘pathological’ cases and would lead to a
successful induction; namely:
If we find two sets of qualities in coexistence there is a finite
probability that they belong to the same group, and a finite probability
also that the first set specifies this group uniquely.
If we grant this assumption, then inductive methodology aims to
increase the prior probability and make it large, in the light of new
evidence. But to this point we will return later.
Keynes discusses the justificatory ground of this assumption and
shows that it requires an a priori commitment to the claim that
qualitative variety in nature is limited. Although the individuals do
differ qualitatively, “their characteristics, however numerous, cohere
together in groups of invariable connection, which are finite in number”
(1921: 285).
This idea is incorporated in the Principle of Limited Variety of a
finite system (PLV), which Keynes (1921: 286) stated thus:
the amount of variety in the universe is limited in such a way that
there is no one object so complex that its qualities fall into an
infinite number of independent groups (i.e. groups which might exist
independently as well as in conjunction); or rather that none of the
objects about which we generalise are as complex as this; or at least
that, though some objects may be infinitely complex, we sometimes have a
finite probability that an object about which we seek to generalise is
not infinitely complex.
The gist behind the role of PLV is this. Suppose that although a
group of properties, say 𝐴 , has been invariably associated with a group
of properties, 𝐵, in the past, there is an unlimited variety of groups
of properties, 𝐵1, … , 𝐵𝑛, such that it is
logically possible that future occurrences of A will be
accompanied by any of the 𝐵𝑖’s, instead of 𝐵. Then, and if we
let 𝑛 (the variety index) tend to infinity, we cannot even start to say
how likely it is that 𝐵 will occur given 𝐴, and the past association of
𝐴s with 𝐵s. PLV excludes the possibility just envisaged.
But as PLV stipulates there are no infinitely complex objects;
alternatively, the qualities of an object cannot fall into an infinite
number of independent groups. For Keynes, the qualities of an object are
determined by a finite number of primitive qualities; the latter (and
their possible combinations) can generate all apparent qualities of an
object. Since the number of primitive qualities is finite, the number of
groups they generate alone or by being combined is finite. Hence, for
any two sets of apparent properties, Keynes (1921: 292) concludes, there
is, “in the absence of evidence to the contrary, a finite probability
that the second set will belong to the group specified by the first
set.”
In any case, Keynes takes it that a generalization of the form ‘All
𝐴s are 𝐵s’ should be read thus ‘It is probable that any given 𝐴 is 𝐵’
rather than thus ‘It is probable that all 𝐴s are 𝐵s’. So, the issue is
the next instance of the observed regularity and not whether it holds
generally (1921: 287-288).
The absolute assertion of the finiteness of a system under
consideration as expressed by the Principle of Limited Variety is called
Inductive Hypothesis (IH) (1921: 299), and provides one of the
premises of an inductive argument; namely, that the a priori
probability of our conclusion, 𝑝(𝐶|𝐼𝐻), has a finite value. Keynes
distinguished (IH) from Inductive Method (IM) which amounts to
the process of increasing the a priori probability of the
conclusion, 𝑝(𝐶|𝐼𝐻), by taking into account the evidence 𝑒:
𝑝(𝐶|𝑒&𝐼𝐻) > 𝑝(𝐶|𝐼𝐻).
[For the mathematics of Keynes’s account of inductive method and the
emergence of the need for the inductive hypothesis in order that new
evidence strengthen our belief in the truth of the conclusion of an
inductive argument, the reader may consult Appendix 6.c]
Significantly, Keynes adds that the Inductive Method may be used to
strengthen the Inductive Hypothesis itself. Since 𝐼𝐻 is a hypothesis and
since 𝐼𝑀 is indifferent to the content/status of the hypothesis it
applies to, it can be applied to 𝐼𝐻 itself. In other words, 𝐼𝑀 brings
some evidence to bear on the truth of 𝐼𝐻. What Keynes suggests is
this:
𝑝(𝐼𝐻|𝑒′&𝐼𝐻′ ) > 𝑝(𝐼𝐻|𝐼𝐻′),
where 𝐼𝐻′ is another general hypothesis, “more primitive and less
far-reaching” than
𝐼𝐻 such that 𝑝(𝐼𝐻|𝐼𝐻′) has a finite value, and 𝑒′ other evidence. The
argument is non-circular since the justification of the inductive
hypothesis is not accomplished by the hypothesis itself but in terms of
some other hypothesis more fundamental, by means of inductive method. Of
course, the account runs the risk of exchanging circularity for infinite
regress unless there exist some primitive inductive hypothesis.
But what would such a primitive inductive hypothesis be? We are left in
the dark:
We need not lay aside the belief that this conviction gets its
invincible certainty from some valid principle darkly present to our
minds, even though it still eludes the peering eyes of philosophy.
(1921: 304)
However, in the end of the day, Keynes simply argues that a non-zero
(finite) a priori probability is assigned to the inductive
hypothesis 𝐼𝐻 (which is equivalent to PLV). What would be the reason to
assign an a priori non-zero probability to the inductive hypothesis 𝐼𝐻?
Keynes answer, honest to the bone, shows the limitations of all attempts
to satisfy the inductive sceptic: “It is because there has been so much
repetition and uniformity in our experience that we place great
confidence in it.” (1921: 289-290)
It seems we cannot do better than relying on past experience. The
Inductive Hypothesis that supports induction, PLV in Keynes’s case, is
neither a self-evident logical axiom nor an object of direct
acquaintance (1921: 304). But nevertheless, he insists that it is true
of some factual systems. How do we know this? By past experience!
On the Rule of Succession
Before we leave Keynes let us consider his critique of Laplace’s Rule
of Succession, i.e., the theorem of mathematical probability which
claims that if an event has
occurred m times in succession, then the probability that it will
occur again is 𝑚+1.
𝑚+2
As discussed elsewhere [see our entry in IEP on The Problem of
Induction (Psillos
and Stergiou, 2022)] Venn had reasons not to “take such a rule as
this seriously.”(1888: 197), but Keynes’s criticism goes well beyond these
reasons.
The crux of Keynes’ criticism consists in that the derivation of the
rule of succession combines two different methods for the determination
of the probability of an event which yield different probability values.
Thus, their combination is inconsistent and it includes a latent
contradiction.
Consider several possible events 𝐸1, 𝐸2, … ,
𝐸𝑛 that are alternatives, i.e., they are mutually exclusive
and exhaustive of the sample space, and choose any one of them,
𝐸𝑖.
The first method stipulates that “when we do not know anything about
an alternative, we must consider all the possible values of the
probability of the alternative; these possible values can form in their
turn a set of alternatives, and so on. But this method by itself can
lead to no final conclusion.” (1921: 426) Let the probability of the
alternative be 𝑝(𝐸𝑖). The method stipulates that we should
consider all probability values of 𝐸𝑖 assigned by any
admissible probability functions 𝑝. These probability values for
𝐸𝑖 form another set of alternatives, say,
𝑝1(𝐸𝑖), … , 𝑝𝑛(𝐸𝑖),… And the
same process may be repeated, again and again, involving us in an
infinite regress. Thus, the first method is inconclusive.
The second method applies the principle of indifference stipulating
that “when we know nothing about a set of alternatives, we suppose the
probabilities of each of them to be equal.” (ibid)
Thus, the second method concludes that, 𝑝(𝐸1), = ⋯ =
𝑝(𝐸𝑛).
Consider the event that 𝐸1: “the sun will rise tomorrow”
and its alternative that the 𝐸2: “the sun will not rise
tomorrow”. If we apply the first method only, we reach no conclusion
about probability and we are involved in infinite regress. Secondly, if
we
apply the second method only, we obtain 𝑝(𝐸 ) = 𝑝(𝐸 ) = 1.
Finally, in deriving the
1 2 2
rule of succession both methods are applied subsequently. Namely, the
probability of
𝐸1 is unknown, and any probability value is possible
according to the first method. Thus, we form a set of alternatives for
the probability of 𝐸1 which, at a second stage are reduced to
the equal probability case by applying the second method. This reasoning
is presupposed by the rule of succession.
The latent contradiction included in the rule of succession is that
for its derivation it is assumed that the a priori probability
of the event can be any number in the interval [0,1], with all numbers
being equally probable, while by application of the
rule the a priori probability, calculated in the absence of
any observations (𝑁=0) is 1.
2
In Keynes’s (1921: 430) own words:
The principle’s conclusion is inconsistent with its premises. We
begin with the assumption that the a priori probability of an event,
about which we have no information and no experience, is unknown, and
that all values between 0 and 1 are equally probable. We end with the
conclusion that the a priori probability of such an
event is 1 … this contradiction was latent, as soon as the
Principle of Indifference was
2
superimposed on the principle of unknown probabilities.
Carnap’s Inductive Logic
Two Concepts of Probability
Carnap presented his views of probability and induction mainly in the
two books entitled the Logical Foundations of Probability
(1950) and The Continuum of Inductive Methods (1952) and in his
papers “A basic system of inductive logic, I, II” (1971 and 1980,
respectively) and “Replies and Systematic Expositions” (1963). For
Carnap, the theory and principles of inductive reasoning, inductive
logic, is the same as probability logic (1950, v) and the primary task
to be set toward an account of inductive logic is the explication of
probability.
Explication, according to Carnap (1950: 3), is the
transformation of an inexact, possibly prescientific concept, the
explicandum, into a new exact concept, the explicatum,
that obeys explicitly stated rules for its use. By means of this
transformation a concept of ordinary discourse or a metaphysical concept
may be incorporated into a well-structured body of logico-mathematical
or empirical concepts. Explication has a long history as a philosophical
method that, in a wide sense, may be traced back even to Plato’s
investigations on definitions. Strictly speaking, however, Carnap
borrowed the term “Explikation” from Kant and Husserl while Frege may be
considered his precursor in this method of philosophical analysis and
Goodman, Quine and Strawson among his prominent intellectual inheritors.
[For a general presentation of the notion explication, consult IEP’s
entry on Explication, (Cordes and Siegwart 2019).]
Two concepts are distinguished as explicanda of probability
according to Carnap: the logical or inductive probability, called
‘probability1’ and the statistical probability, called ‘probability2’.
Both concepts are important for science and lack of recognition of this
fact, Carnap claimed, has fueled many futile controversies among
philosophers. The meaning of probability2 is that of relative frequency
of a kind of event in a long sequence of events, and in science it is
applied to the description and statistical analysis of mass phenomena.
All sentences about statistical probability are factual, empirical.
The logical concept of probability, probability1, is the basis for
all inductive reasoning. For Carnap (1950: 2), the problem of induction
is the problem of the logical relation between a hypothesis and some
confirming evidence for it and
“inductive logic is the theory based upon what might be called the
degree of inducibility, that is, the degree of confirmation.” Hence, by
taking probability1 to mean “the degree of confirmation of a hypothesis
ℎ with respect to an evidence statement 𝑒, e.g., an observational
report” (1950: 19) Carnap made it the basis of inductive logic. As for
any logical sentence, the truth or falsity of sentences about
probability1 is independent of extralinguistic facts.
In addition, logical probability is an objective concept, i.e., “if a
certain probability1 value holds for a certain hypothesis with respect
to a certain evidence, then this value is entirely independent of what
any person may happen to think about these sentences, just as the
relation of logical consequence is independent in this respect.”(1950:
43) The objectivity of probability1, Carnap recognized it in the views
of Keynes and Jeffreys who interpreted probability in terms of rational
degrees of beliefs as distinguished from subjective, actual degrees of
belief a person might bear on the truth of a sentence given some
evidence. Later, he (1963: 967) came to accept the interpretation of
probability1 as “the degree to which [one]… is rationally entitled to
believe in ℎ on the basis of 𝑒.”
C-functions
Carnap suggested three different concepts of confirmation. The
classificatory concept of confirmation, which expresses a
logical relation between a piece of evidence 𝑒 and a hypothesis ℎ and,
if satisfied, it qualifies the former as a confirming instance of the
latter. To signify the explicatum of this concept, Carnap used
the symbol ‘ℭ’and
ℭ(ℎ, 𝑒) corresponds to “ℎ is confirmed (or, supported) by 𝑒”. The
second concept of confirmation he employed is the comparative
concept which compares the strength by which a piece of evidence
𝑒1 confirms a hypothesis ℎ1 with the corresponding
strength by which 𝑒2 confirms ℎ2. Thus,
comparative confirmation requires the underlying classificatory
confirmation and it is, in general, a tetradic relation. Its explicatum
is symbolized by ‘𝔐ℭ’, where 𝔐ℭ(ℎ1, 𝑒1,
ℎ2, 𝑒2) corresponds to the statement
‘’ℎ1 is confirmed by 𝑒1 at least as strongly
(i.e., either more, or equally, strongly) as ℎ2 by
𝑒2”. Finally, there is a quantitative (or, metrical)
concept of confirmation, the degree of confirmation, which
assigns a numerical value to the degree to which a hypothesis ℎ is
supported by given observational evidence 𝑒. The explicatum of this
concept is symbolized by ‘𝔠’, where ‘the degree of ‘𝔠(ℎ, 𝑒) = 𝑟’ is the
statement, “the degree of confirmation of ℎ with respect to 𝑒 is 𝑟”,
where ℎ and 𝑒 are sentences and 𝑟 a real number in the unit
interval.
In this context, Carnap points out that Keynes’s objective conception
of probability is similar to the comparative concept of confirmation and
only in some special cases, when the principle of indifference is
applicable, it can be interpreted quantitatively similar to his concept
of degree of confirmation (1950: 45 & 205). Moreover, notice that
all three conceptions of confirmation Carnap (1950: 19) suggested are
semantical:
The concepts of confirmation to be dealt with in this book are
semantical, i.e., based upon meaning, and logical, i.e., independent of
facts.
The inductive relation the three concepts of confirmation attempt to
explicate is not determined by the form of the sentences, as Hempel
required in his syntactic account of confirmation (1945), nor depend on
the users of a language, as Goodman suggested in his pragmatic solution
of the new riddle of induction (1955) (See also our other entry in IEP
on The Problem of Induction (Psillos and Stergiou, 2022)). Rather:
[O]nce ℎ and 𝑒 are given, the question mentioned requires only that
we be able to understand them, that is, to grasp their meanings, and to
establish certain relations which are based upon their meanings (1950:
20).
Carnap begins with the construction of the language(s) in which
inductive logic is to be applied. He defines several language systems
each one characterized by the number of names (constants) it contains
(1950: 58). Each name refers to individuals in the corresponding
universe of discourse, be they things, events, or the like. Thus, he
considered an infinite language system 𝔏∞, having an infinite
number of names and a sequence 𝔏1, 𝔏2, … ,
𝔏𝑁, … of language systems each one characterized by the index
𝑁 that runs through all positive integers indicating the number of names
the system includes. Hence, 𝔏1 contains only ‘𝑎1‘;
𝔏2 contains ‘𝑎1‘ and ‘𝑎2‘; etc. Notice
that any sentence of 𝔏∞ is contained in an infinite number of
finite language systems of the hierarchy since if ‘𝑎𝑁’ is the
name with highest subscript that appears in that sentence, then this
sentence will be represented in any language system 𝔏𝑛 with 𝑛
≥ 𝑁. Apart from names, 𝔏∞ contains a finite number of
primitive (atomic) predicates of any degree (unary, binary etc.)
designating properties and relations among individuals in the universe
of discourse. Carnap considered only three connectives as primitive for
his language systems: the negation ‘~’, the conjunction ‘&’ and the
inclusive disjunction ‘∨’ – and he defined implication and biconditional
in terms of these three. Each language system contains an infinite
number of variables, 𝑥, 𝑦, 𝑧, 𝑥1, 𝑥2 …, and two
quantifiers, the existential ‘(∃𝑥)’ and the universal one, ‘(𝑥)’. The
sentence ‘(𝑥)𝑃𝑥’ is taken to be logically equivalent to
‘𝑃𝑎1&𝑃𝑎2 … &𝑃𝑎𝑁’ in a language
𝔏𝑁, according to the semantics adopted. The same is not true
for the case of 𝔏∞ since in this case the conjunction of an
infinite number of sentences is not a well-formed formula of the
language. Apart from the atomic predicates, molecular predicates may be
defined. They are formed by atomic or more basic molecular predicates
with the help of connectives. For example, if 𝑃1,
𝑃2, 𝑃3 are atomic predicates, then
‘~𝑃1’ or ‘𝑃1&𝑃2’ or ‘𝑃1
∨ 𝑃3’ are molecular predicates understood as follows: for any
variable
𝑥, (~𝑃1)𝑥 stands for ‘~(𝑃1𝑥)’;
(𝑃1&𝑃2)𝑥 for
‘𝑃1(𝑥)&𝑃2(𝑥)’; and (𝑃1 ∨
𝑃3)𝑥 for ‘𝑃1(𝑥) ∨ 𝑃3(𝑥)’. Finally,
language systems contain an equality symbol ‘=’ designating identity of
individuals in the universe of discourse and a tautological sentence
‘𝑡’. As any language, these language systems are equipped with some
rules
for the formation of well-formed formulas (sentences) and some rules
of truth, i.e., a semantics.
A state description 𝔙 is an explication of the vague concept
of a state of affairs relativized to a given language system 𝔏 (1950:
70ff). It purports to describe possible states of the universe of
discourse of 𝔏. A state description describes for every individual
designated by some name ‘𝑎’ and for every property designated by an
atomic predicate ‘𝑃’ of 𝔏 whether or not this individual has that
property, and similarly for relations. Thus, a state description will
contain exactly one sentence from the pair ‘𝑃𝑎’, ‘~𝑃𝑎’: either ‘𝑃𝑎’ or
‘~𝑃𝑎’ but not both, and no other element (similarly for relations). In
the case of a finite language system 𝔏𝑁, a state description
has the form of a conjunction of sentences of the aforementioned sort
while in the case of an infinite language system 𝔏∞, a state
description is a class of sentences that contains at most one sentence
of the aforementioned sort. In both cases nothing more is included in a
state description. The class of all state descriptions in a given
system
𝔏 is designated by ‘𝑉𝔙’ while the null class by
‘𝛬𝔙’.
For example, consider a language system 𝔏3 with names,
‘𝑎’, ‘𝑏’ and ‘𝑐’ and a single atomic unary predicate symbol ‘𝑃’. The
complete set of state descriptions is the following:
|
|
|---|---|
|
|
|
|
|
|
The adequacy of a language system 𝔏 for inductive logic requires
compliance with two important conditions: the requirement of logical
independence and the requirement of completeness. The first condition
aims at restricting the language system to bar contradictory state
descriptions. The requirement of logical independence stipulates (i)
that atomic sentences (i.e. sentences that consist of an 𝑛- place
predicate and 𝑛 names ) are logically independent, i.e. a class
containing atomic sentences (e.g. sentences of the form 𝑃𝑎 for a
predicate ‘𝑃’ and a name ‘𝑎’) and the negations of other atomic
sentences does not entail logically entail another atomic sentence or
its negation; (ii) names in 𝔏 designate different and separate
individuals;
(iii) atomic predicates are interpreted to designate logically
independent attributes.
The requirement of completeness of language stipulates that the set
of the atomic predicates of 𝔏 be sufficient for expressing every
qualitative attribute of the individuals in the universe of discourse of
𝔏. This requirement seemed absolutely necessary for the Carnapian
system, since the language systems affect the 𝔠-values in the theory of
inductive logic. For the time being, all we need to stress is that this
requirement implies that a language system 𝔏 mirrors its
universe of discourse.
Whatever there is in it can be exhaustively expressed within 𝔏. Here
is Carnap’s example (1950: 75). Take a language system 𝔏 with only two
predicates, ‘ 𝑃1’ and ‘𝑃2’ interpreted as Bright
and Hot. Then, every individual in the universe of discourse of 𝔏 should
differ only with respect to these two attributes. If a new predicate
‘𝑃3’,
interpreted as Hard, were added, the 𝔠 -values of hypotheses
concerning individuals in
𝔏 would change. Even if this simple scheme holds (or might hold) in a
simple language, can it be adequate for the language of natural
sciences? A similar requirement had been proposed by Keynes, in the form
of the Principle of Limited Variety (see section 3c above).
Later on, Carnap abandoned this requirement and replaced it with the
following: The value of the confirmation function 𝔠(ℎ, 𝑒) remains
unchanged if further families of predicates are added to the language
(see 1963: 975). According to this requirement, the value of 𝔠(ℎ, 𝑒)
depends only on the predicates occurring in h and
e. Hence, the addition of new predicates to the language does
not affect the value of
𝔠(ℎ, 𝑒). This new idea amounts to what Lakatos (1968: 325) called the
minimal language requirement, according to which the degree of
confirmation of a proposition depends only on the minimal language in
which the proposition can be expressed.
Another important concept defined by Carnap is that of the
range of a sentence or of a collection of sentences (1950: 78).
The range of a sentence 𝑖, ℜ(𝑖), is the class of those state
descriptions in which that sentence holds. A (molecular) sentence of the
form ‘𝑃𝑎 or ~𝑃𝑎’ for a atomic predicate ‘𝑃’ and some name ‘𝑎’ holds in a
state description 𝔙 if it is either a conjunct in 𝔙’s defining
conjunction or it belongs to the class of sentences that define 𝔙.
Analogously, if a sentence is a conjunction of sentence, then all
components of the conjunction should hold for a state description
while if it is a disjunction, at least one disjunct should hold in a
state description – so that the state description partake of the
sentence’s range. Notice that a tautology holds in all state
descriptions. For instance, in the previous example, the range of
𝑃𝑎&𝑃𝑏 is ℜ(𝑃𝑎&𝑃𝑏) = {𝔙1, 𝔙4, } while the range of 𝑃𝑎 ∨ 𝑃𝑏 is
ℜ(𝑃𝑎 ∨ 𝑃𝑏) =
{𝔙1, 𝔙2, 𝔙3, 𝔙4, 𝔙6, 𝔙7}. Finally, the range of a class of sentences
is the class of state descriptions in which every sentence of class
holds.
As a final step before defining the 𝔠-function, we present Carnap’s
account of logical concepts in a system 𝔏 in terms of state descriptions
and the concept of range: a sentence 𝑖 is L-true in 𝔏 if and
only if ℜ(𝑖) is 𝑉𝔙 while it is L-false in 𝔏 if and
only if ℜ(𝑖) is 𝛬𝔙; a sentence 𝑖 L-implies 𝑗 in 𝔏 if
and only if ℜ(𝑖) ⊂ ℜ(𝑗); 𝑖 is L- equivalent to 𝑗 in 𝔏 if and
only if ℜ(𝑖) = ℜ(𝑗); 𝑗1, 𝑗2, … , 𝑗𝑛 (𝑛 ≥ 2) are L-disjunct with
one another in 𝔏 if and only if ℜ(𝑗1) ∪ ℜ(𝑗2) ∪ …
∪ ℜ(𝑗𝑛) is 𝑉𝔙; 𝑖 is L-exclusive of 𝑗 in 𝔏
if and only if ℜ(𝑖) ∩ ℜ(𝑗) is 𝛬𝔙; a class of sentences is
L-exclusive in pairs
if and only if every pair of the class is L-exclusive of every other
sentence of that class. L-truth is the explicatum for logical truth or
analytical truth while L-false for contradiction. L-implication is the
explicatum for logical entailment while L- equivalence explicates mutual
deducibility and it is the same as mutual L-implication. L-disjunctness
applied to a set of sentences explicates the idea that at least one of
those sentences is true and L-exclusion explicates logical
incompatibility or logical impossibility of joint truth.
For the sake of simplicity, in this presentation we focus on finite
language systems.
Thus, 𝔪 is a regular measure function (briefly, a regular
𝔪-function) for 𝔙 in 𝔏𝑁 if and only if it fulfills the
following two conditions: (a) for every 𝔙𝑖 in 𝔏𝑁,
𝔪(𝔙𝑖) ∈ ℝ;
(b) the sum of the values of 𝔪 for all 𝔙 in 𝔏𝑁 is 1, ∑𝔙𝑖
𝔪(𝔙𝑖) = 1. The regular 𝔪- function for 𝔙 can be extended to a
regular 𝔪-function for the sentences in 𝔏𝑁
by requiring the following: (a) for any L-false sentence 𝑗 in
𝔏𝑁, 𝔪(𝑗) = 0 ; (b) for any non-L-false sentence 𝑗, 𝔪(𝑗) =
∑𝔙∈ℜ(𝑗) 𝔪(𝔙) (Carnap 1950: 295).
In the example of the language system 𝔏3 considered
previously, a regular 𝔪- function for state descriptions is defined as
follows:
𝔪(𝔙 ) = 1 ( ) 1 .
, for 𝑖 =
1,3,4,7 𝔪
𝑖
12
𝔙𝑖
=
, for 𝑖 = 2,5,6,8
6
It is extended to a regular 𝔪-function for sentences that assigns
numerical values to sentences, e.g.,
𝔪(𝑃𝑎&~𝑃𝑎) = 0 ; 𝔪(𝑃𝑎 ∨ ~𝑃𝑎) = 1
1 2
𝔪(𝑃𝑎&𝑃𝑏) = ∑ 𝔪(𝔙𝑖) = 6 ; 𝔪(𝑃𝑎 ∨ 𝑃𝑏) = ∑
𝔪(𝔙𝑖) =.
3
𝑖=1,4 𝑖=1,2,3,4,6,7
A regular confirmation function is defined as a two-argument function
for sentences on the basis of a regular 𝔪-function for sentences in
𝔏𝑁. Namely, let 𝔪 be a regular 𝔪-function for sentences in
𝔏𝑁, then 𝔠 is a regular confirmation function
(briefly, a regular 𝔠-function) for sentences in 𝔏𝑁
if and only if for any sentences 𝑒, ℎ in 𝔏𝑁,
𝔠(ℎ, 𝑒) =
𝔪(𝑒&ℎ)
𝔪(𝑒) ,
where 𝔪(𝑒) ≠ 0 and 𝔠(ℎ, 𝑒) has no value, where 𝔪(𝑒) = 0 (Carnap 1950:
295). In the aforementioned example, if 𝑒 stands for the L-false
sentence ‘𝑃𝑎&~𝑃𝑎’,
𝔠(ℎ, 𝑒) is not defined for any hypothesis ℎ. L-false sentences cannot
be evidence for or against any hypothesis. However, if an L-false
sentence, e.g., ‘𝑃𝑎&~𝑃𝑎’, is taken as hypothesis ℎ, then 𝔠(ℎ, 𝑒) =
0, for any admissible piece of evidence 𝑒. Consider an L-true sentence,
such as ‘ 𝑃𝑎 ∨ ~𝑃𝑎’, as hypothesis ℎ. Then 𝔠(ℎ, 𝑒) = 1 no matter what
the admissible evidence might be; no evidence can increase or decrease
the degree of confirmation of a logical truth (obviously, 𝑒 is not
L-false). In other cases, e.g., for the hypothesis ℎ, ‘𝑃𝑎’ and the
evidence 𝑒, ‘𝑃𝑏’, 𝔠(𝑃𝑎, 𝑃𝑏) = 𝔪(𝑃𝑎&𝑃𝑏) =
𝔪(𝑃𝑏)
1/6 = 1.
1/2 3
A regular 𝔠-function is a conditional probability function in the
common parlance
of mathematical theory of probability since it satisfies Kolmogorov’s
axioms. This was a desideratum for Carnap who stipulated that
an adequate concept of degree of confirmation should fulfill the
following conditions (1950: 285):
- L-equivalent evidences. If 𝑒 and 𝑒′ are L-equivalent,
then 𝔠(ℎ, 𝑒) = 𝔠(ℎ, 𝑒′). - L-equivalent hypotheses. If ℎ and ℎ′ are L-equivalent,
then 𝔠(ℎ, 𝑒) = 𝔠(ℎ′, 𝑒). - General Multiplication Principle. 𝔠(ℎ&𝑗, 𝑒) = 𝔠(ℎ,
𝑒) ∙ 𝔠(𝑗, 𝑒&ℎ). - Special Addition Principle. If 𝑒&ℎ&𝑗 is L-false,
then 𝔠(ℎ ∨ 𝑗, 𝑒) = 𝔠(ℎ, 𝑒) +
𝔠(𝑗, 𝑒)
- Maximum Value. For any not L-false 𝑒 𝔠(𝑡, 𝑒) =
1,
where ℎ, ℎ′, 𝑒, 𝑒′, 𝑗 are any sentences in 𝔏𝑁 and 𝑡 is a
logical truth. Conditions, (a) and (b) demand that the
explicatum of the degree of confirmation should respect logical
equivalence. The General Multiplication Principle is derived
mathematically directly from the definition of conditional probability.
The Special Addition Principle is recognized as the additivity axiom in
Kolmogorov’s formulation which gives rise to the finite additivity
condition and the Maximum Value condition corresponds to the fact
probability of the sample space is 1.
To recover unconditional probability functions for sentences
in 𝔏𝑁 , Carnap suggested to consider the probability of any
sentence conditionally to a tautology. Namely, if 𝔠 is a regular
confirmation function for 𝔏N, then for every sentence 𝑗 in
𝔏𝑁, the null confirmation 𝔠0 is
𝔠0(𝑗) = 𝔠(𝑗, 𝑡). Moreover, he showed that 𝔠0(𝑗) =
𝔪(𝑗). The null confirmation represents the prior probability of a
sentence in the absence of any evidence (1950: 307-8).
In the example of the language system 𝔏3 considered
previously we suggested a regular 𝔪-function that assigns different real
numbers to different state descriptions, i.e., to different states in
the universe of discourse. However, is there any reason to believe that
these numbers should be unequal? Is there any reason to believe that one
state description weighs more than any other? Rather, by application of
the principle of indifference, it seems that we should demand equal
distribution of weight to all state descriptions,
𝔪+(𝔙) = 1
𝜁
where 𝜁 is the number of the state descriptions in 𝔏𝑁
(Carnap, 1950: 564). Moreover, it is easy to show that for any given
piece of evidence 𝑒 and for every pair of state description
𝔙𝑖, 𝔙𝑗 compatible with 𝑒, it holds:
𝔠+(𝔙𝑖, 𝑒) = 𝔠+(𝔙𝑗,
𝑒).
Of course, the principle of indifference entails equiprobability only
for state descriptions and not for all sentences, in a way that Keynes
would appreciate, since he was the first to suggest restricted
application of the principle of indifference to possibilities that are
mutually exclusive and exhaustive of the sample space, to avoid the Book
paradox. Salmon (1966: 72) notes that Carnap’s “…explication of
probability in these terms has been thought to preserve the ‘valid core’
of the traditional principle of indifference”.
Nevertheless, Carnap has shown that to suggest a regular 𝔪-function
for 𝔙 in 𝔏 that assigns equal weight to all state descriptions, although
intuitively plausible, has deeply undesirable consequences: it inhibits
learning from experience. To see why consider a language
𝔏𝑁+1, with a single unary atomic predicate 𝑃. We want to
calculate the degree of confirmation of the hypothesis that the (𝑁 +
1)th individual will have the property 𝑃, i.e., ℎ:
‘𝑃𝑎𝑁+1’, given the evidence that all individuals
examined so far had the property 𝑃, i.e., 𝑒: ‘𝑃𝑎𝑁& …
&𝑃𝑎1’. The number of state descriptions is
2𝑁+1, hence, the 𝔪+ regular 𝔪-function assigns
equal weight to all
state descriptions, 𝔪+(𝔙) = 1
2𝑁+1
. First, notice that ℎ&𝑒 and ~ℎ&𝑒 are state
descriptions; hence, 𝔪+(ℎ&𝑒) = 𝔪+(~ℎ&𝑒)
= 1
2𝑁+1
. Second, sentences 𝑒 and
(ℎ&𝑒) ∨ (~ℎ&𝑒) are L-equivalent and 𝔪+(𝑒) =
𝔠+(𝑒) = 𝔠+(𝑒, 𝑡). By the L- equivalent-hypotheses
condition, 𝔪+(𝑒) = 𝔠+((ℎ&𝑒) ∨ (~ℎ&𝑒), 𝑡);
and by the Special
0
Addition Principle, 𝔪+(𝑒) = 𝔠+(ℎ&𝑒, 𝑡) +
𝔠+(~ℎ&𝑒, 𝑡) = 𝔠+(ℎ&𝑒) +
𝔠+(~ℎ&𝑒) =
𝔪+(ℎ&𝑒) + 𝔪+(~ℎ&𝑒) = 1
2𝑁+1
+ 1
2𝑁+1
= 2
2𝑁+1
0 0
. Hence,
𝔠+
(ℎ, 𝑒) =
𝔪+(ℎ&𝑒)
𝔪+(𝑒) =
1 2𝑁+1 = 1
2 2
2𝑁+1
Moreover, by a simple calculation
𝔠+(ℎ) = 𝔪+(ℎ) = ∑ 𝔪+(𝔙) =
2𝑁 1 1
0
𝔙∈ℜ(ℎ)
2𝑁+1 = 2.
i.e.,
𝔠+(ℎ, 𝑒) = 𝔠+(ℎ).
0
The last equality yields the desired conclusion: the degree of
confirmation of a hypothesis is independent of the evidence collected in
a given population. No matter how many positive instances of a given
property one observes in a population, their guess regarding the
appearance of the property in the next individual is not better
justified than if no observations were made; thus learning does not come
from experience (1950: 564-5).
To avoid this difficulty, Carnap suggested to apply the principle of
indifference in a different way. Instead of distinguishing states of
affairs in terms of properties and relations instantiated by certain
individuals, Carnap grouped all states of affairs
instantiating the same properties and relations,
independently of the individuals that instantiated them, and
distinguished only among these classes. Hence, we should not focus
anymore on state descriptions describing possible states of the universe
of discourse for a language system but on classes of such state
descriptions in which any two state descriptions are isomorphic
to one another. Two sentences 𝑖, 𝑗 in 𝔏𝑁 are isomorphic if 𝑗
is formed from 𝑖 by replacing each individual constant occurring in 𝑖 by
its correlate with respect to a one-to-one relation among all individual
constants in
𝔏𝑁. These classes are called structure
descriptions, 𝔖𝔱𝔯. They describe the common structure
attributed to the realm of individuals by a class of state descriptions.
For instance, a structure description may express the fact that there
are exactly two individuals in the universe of discourse possessing a
given property 𝑃 or that none of the individuals bears the relation 𝑅 to
itself, or that relation 𝑅 is satisfied by pairs of individuals
non-symmetrically – i.e., if for all individual constants 𝑎, 𝑏 𝑅𝑎𝑏 and
~𝑅𝑏𝑎 are both satisfied – etc. Now the principle of indifference applies
in two stages: firstly, following the principle we assign equal weight
to all structure descriptions and, secondly, within each structure
description we assign equal weight to all isomorphic state descriptions.
Thus, for a state description 𝔙𝑖 in a language system
𝔏𝑁, if 𝜏 is the number of structure descriptions 𝔖𝔱𝔯 and
𝜁𝑖 the number of all state descriptions that are isomorphic
to 𝔙𝑖, we define (1950: 564) the regular 𝔪-function for
𝔙:
𝔪∗(𝔙 ) = 1
.
𝑖
𝜏 ∙ 𝜁𝑖
To illustrate the relation between state descriptions and structure
descriptions and the difference between the values of 𝔪+,
𝔪∗ regular 𝔪-functions, consult the following table which
represents the example of 𝔏3 with a single predicate 𝑃:
STATE DESCRIPTIONS
WEIGHT STRUCTURE
DESCRIPTIONS
WEIGHT 𝖒+ 𝖒∗
|
1/8 |
|
1/4 |
|
1/4 |
|---|---|---|---|---|---|
|
1/8 |
|
1/12 | ||
|
1/8 |
|
1/4 |
|
1/12 |
|
1/8 |
|
1/12 | ||
|
1/8 |
|
1/12 | ||
|
1/8 |
|
1/4 |
|
1/12 |
|
1/8 |
|
1/12 | ||
|
1/8 |
|
1/4 |
|
1/4 |
Let’s now revisit the problem of determining the degree of
confirmation of the hypothesis that the (𝑁 + 1)th individual
will have the property 𝑃, i.e., ℎ: ‘𝑃𝑎𝑁+1’, given the
evidence that all individuals examined so far had the property
𝑃, i.e., 𝑒: ‘𝑃𝑎𝑁& … &𝑃𝑎1’ in a language
𝔏𝑁+1 with a single unary predicate 𝑃. Since our language
contains 𝑁 + 1 individual constants, a structure description is
determined by the number of instances of the property 𝑃 we find in the
universe of discourse disregarding the identity of the individuals that
instantiate the property. Thus, all state descriptions that are
isomorphic to ‘𝑃𝑎𝑁+1&𝑃𝑎𝑁−1& …
&𝑃𝑎1’ correspond to the same structure description
characterized by 𝑁 + 1 property instances in the universe of discourse,
while all state descriptions that are isomorphic to
‘~𝑃𝑎𝑁+1&~𝑃𝑎𝑁−1& … &~𝑃𝑎1’
correspond to the same structure description characterized by 0 property
instances in the universe of discourse. Thus, we have different
structure description corresponding to 0,1, … , 𝑁 + 1 occurrences of 𝑃
and
the total number of structure descriptions is 𝜏 = 𝑁 + 2. To calculate
the number 𝜁𝑘 of state descriptions that are isomorphic to
𝔙𝑘, let us take 𝑘 to denote the number of occurrences of 𝑃 in
𝔙𝑘, i.e., 𝑘 = 0,1, … , 𝑁 + 1. Then 𝜁𝑘 is the number of the
different
ways that 𝑁 individuals can form 𝑘-tuples, i.e., (𝑁 + 1) =
(𝑁+1)! . Thus, we find
that
𝔪∗(𝔙𝑘
𝑘
𝑘! (𝑁 + 1 − 𝑘)!
) = (𝑁 + 2)!
𝑘!(𝑁+1−𝑘)!
for 𝑘 = 0,1, … , 𝑁 + 1.
The degree of confirmation of the hypothesis ℎ given evidence 𝑒
is
𝔠∗(ℎ, 𝑒) =
𝔪∗(ℎ&𝑒)
𝔪∗(𝑒) .
Notice that ℎ&𝑒 is isomorphic to any state description
𝔙𝑁+1 and
𝔪∗(ℎ&𝑒) = 𝔪∗(𝔙
(𝑁 + 1)! 1
) = =
𝑁 (𝑁 + 2)!
𝑁 + 2
while ~ℎ&𝑒 is isomorphic to any state description 𝔙𝑁
and
𝔪∗(~ℎ&𝑒) = 𝔪∗(𝔙𝑁
𝑁!
) = (𝑁 + 2)!.
As before, sentence 𝑒 is L-equivalent to (ℎ&𝑒) ∨ (~ℎ&𝑒)
and
𝔪∗(𝑒) = 𝔪∗(ℎ&𝑒) + 𝔪∗(~ℎ&𝑒) =
𝑁!
(𝑁 + 1)!
1
= 𝑁 + 1
Thus,
𝔠∗(ℎ, 𝑒) =
𝔪∗(ℎ&𝑒)
𝔪∗(𝑒) =
𝑁 + 1
.
𝑁 + 2
Using the same reasoning, we may calculate, more generally, the
degree of confirmation of the hypothesis that the (𝑟 + 1)-th individual
𝑎𝑟+1 will exhibit property 𝑃, i.e., ℎ:
‘𝑃𝑎𝑟+1’given the evidence that 𝑟 individuals of the universe
of discourse have exhibited so far the same property 𝑃, i.e. 𝑒 :
‘𝑃𝑎𝑟& … &𝑃𝑎1’,
𝔠∗(ℎ, 𝑒) =
𝔪∗(ℎ&𝑒)
𝔪∗(𝑒) =
𝑟 + 1
.
𝑁 + 2
These results amount to the celebrated Laplace’s Rule of Succession,
which in Carnap’s theory of inductive logic has become a theorem.
The Continuum of Inductive
Methods
In the examples so far, we have examined three different regular
𝔠-functions: one determined by arbitrarily assigning weight to state
descriptions in 𝔏3; the other two,
𝔠+, 𝔠∗, determined by assigning equal weight to
state and structure descriptions, respectively, on the basis of the
principle of indifference. There are many alternative ways to assign
such a weight to the different possibilities and each one of them
results in a different regular 𝔠-function yielding a different degree of
confirmation 𝔠(ℎ, 𝑒) for a given hypothesis ℎ and evidence 𝑒 in a
language system 𝔏. Thus, there are many different inductive methods,
actually, a continuum of such possible methods (Carnap, 1952). For a
given language system each inductive method is characterized by the
value of a non-negative real parameter 𝜆. For a given 𝜆 the degree of
confirmation
𝔠(ℎ, 𝑒) is fixed for any hypothesis ℎ and with respect to any
evidence 𝑒 and any two inductive methods have the same 𝜆 only if they
agree on the value of 𝔠(ℎ, 𝑒).
To understand how the degree of confirmation is defined in terms of
the 𝜆- parameter, we need first to explain the concept of logical
width of a property (1950: 126-127). Consider any language system
𝔏𝑁 having 𝜋 unary atomic predicates. We may form molecular
predicates by taking the conjunction of 𝜋 predicates which are either
the atomic predicates or of their negations. In this way we form 𝜅 =
2𝜋 molecular predicates (Q-predicates). Then any property 𝐹
expressible in 𝔏𝑁 is represented either by a Q-predicate or
by a disjunction of two or more Q-predicates. Logical width
characterizes the logical complexity of a property 𝐹. The greater the
logical width of a property, the greater is the number of possible
(non-contradictory) properties it admits. For example, the property
𝑃1 ∨ 𝑃2 is wider than 𝑃1 since property
~𝑃1&𝑃2 is admitted by the first but excluded
by the second. Thus, the logical width of a contradictory property is 0
while the logical width of a property represented by a Q-predicate is 1.
Any property 𝐹 that is expressed as a disjunction of Q-predicate has a
logical width 𝜅 ≥ 𝑤 > 1 equal to the number of disjuncts.
Moreover, the relative width 𝐹 is the ratio 𝑤/𝜅. Notice that
the relative width varies from 0, for a contradictory property, through
½ , for any property represented by a atomic predicate, to 1 for a
logically necessary property.
Let 𝑒 be the sentence expressing that out of 𝑠 individuals examined,
𝑠𝐹 had property 𝐹 and ℎ be the hypothesis that a given
individual different that those examined so far had also 𝐹, then the
degree of confirmation 𝔠(ℎ, 𝑒) is
𝑠 𝑠𝐹 𝜆 𝑤
𝔠(ℎ, 𝑒) = ( ) + ( )
,
𝑠 + 𝜆 𝑠 𝑠 + 𝜆 𝜅
where 𝑠𝐹⁄𝑠 is the relative frequency of observed instances
of the property 𝐹 and 𝜆 a non-negative real number (Burks, 1953). The
relative frequency of observed instances, 𝑠𝐹⁄𝑠, is an
empirical fact while the relative width of the property is a logical
fact depending on the language system and the predicate that represents
the property. Hence, the degree of confirmation is determined as a
mixture of a logical factor and of an empirical factor (1952: 24):
𝔠(ℎ, 𝑒) = (1 − 𝑎) 𝑠𝐹
𝑠
𝑤
+ 𝑎
,
𝜅
where 𝑎 = 𝜆 . If no observation has taken place, i.e., 𝑠 = 0,
then 𝔠(ℎ, 𝑒) = 𝑤, and the
𝑠+𝜆 𝜿
degree of confirmation is determined on logical grounds. As the
number of
observations increases relative frequency of observed instances
acquires significance
and the degree of confirmation tends toward 𝑠𝐹 .
Exactly how fast we learn from
𝑠
experience, that is how fast 𝔠(ℎ, 𝑒) tends to 𝑠𝐹,
depends on 𝜆. In the following table we
𝑠
have summarized the degrees of confirmation that correspond to
different characteristic values of 𝜆
| 𝝀 | 𝖈(𝒉, 𝒆) |
|---|---|
|
𝑠𝐹
|
|
|
|
|
For 𝜆 = 0, we have the straight rule which stipulates that
the observed relative frequency is equal to the probability that an
unobserved individual has the property in question. Carnap says that the
straight rule is problematic since it yields complete certainty (𝔠 = 1),
if all examined individuals are found to possess the relative property
(𝑠𝐹 = 𝑠) – a conclusion that may be accepted if the size 𝑠 of
the sample is quite large but not otherwise (1950: 227). The second row
in our table (𝜆 = 𝜅) is better interpreted if we assume that our
language system consists of one atomic unary predicate only. Then 𝑤 = 1
and 𝜅 = 2, and we get Laplace’s rule of succession,
𝔠(ℎ, 𝑒) = 𝔠∗(ℎ, 𝑒). Finally, with the same assumptions
about the language system, for
𝜆 → ∞ the logical factor reigns and (ℎ, 𝑒) = 𝔠+(ℎ, 𝑒) =
1 , as calculated for
2
equiprobable state descriptions.
How can we decide which of the uncountable infinity of inductive
methods is the appropriate one? Carnap’s answer is based on two
important elements: (a) adopting an inductive method is a matter of
choice that we make; (b) this choice is made on a
priori grounds. Carnap agreed with Burks’ suggestion to apply to
induction the internal-external distinction concerning the adoption of
frameworks (1963: 982).
Thus, while the degree of confirmation for a given hypothesis on
given evidence is an internal question, it presupposes the
adoption of a 𝔠-function, the choice of which is an external
one; i.e., it is raised outside any inductive system and has to do with
the choice of a framework similar to the choice of a language system.
Richard Jeffrey (1992: 28) pointed out that:
Carnap counted the specification of 𝔠-functions among the semantical
rules for languages. Choice of a language was a framework question, a
practical choice that could be wise or foolish, and lucky or unlucky,
but not true or false.
The pragmatic (i.e., non-cognitive) nature of the scientist’s choice
of an inductive method becomes apparent in the passage below:
X may change this instrument [i.e., their inductive method]
just as he changes a saw or an automobile, and for similar reasons.
(Carnap 1952: 55)
It is up to the scientists to make up their minds and to choose among
them the one that they feel are the more appropriate for their purposes.
They can change them as they change their automobiles!
Assuming that a choice of an inductive method has been made and a
particular 𝔠- function has been defined, any statement of the sort “𝔠(ℎ,
𝑒) = 𝑝” for specified sentences ℎ, 𝑒, is analytic, if true (and
contradictory, if false), i.e., their truth or falsity
rests on definition and pure logic. This fact raises additional
problems regarding the justification of the applicability of the
inductive methods to practical issues: “The question is”, says Salmon
(1966:76), “How can statements that say nothing about any matters of
fact serve as ‘a guide of life’?” The observation that non-trivial
empirical content is introduced by the synthetic sentence 𝑒 expressing
evidence of past experience, does not improve things very much. For, one
may further require a justification of considering past evidence and
logico-mathematical facts about the degree of a confirmation as a guide
to predictions and our future conduct. On what grounds do we
deem such a practice rational? Nevertheless, these last
questions seem to get us outside the limits of any framework since they
are reformulations of the external question about the choice of a
particular 𝔠-function, and can be answered neither from reason nor from
experience.
Where does all this leave Carnap’s project? The project of specifying
the inductive logic falls apart. There is no uniquely rational
way to determine the relations between evidence and hypotheses. Instead,
Carnap’s attitude seems to be captured by the following paraphrase of
Chairman Mao’s famous dictum: ‘Let a hundred inductive methods bloom’.
But even if we were to argue that we end up with a plurality of
inductive methods, they would still fall short of being inductive
logics. As we saw, the c-function depends on the parameter λ.
But, as Howson and Urbach (1989: 55) have stated, the very idea of an
adjustable parameter λ “calls into question the fundamental role
assigned to his systems of inductive logic by Carnap. If their adequacy
is itself to be decided empirically, then the validity of whatever
criterion we use to assess that adequacy is in need of justification,
not something to be accepted uncritically”.
Subjective Probability
and Bayesianism
Probabilities as Degrees of
Belief
Subjective theory is a theory of inductive probability
proposed by the Cambridge Apostle F. P. Ramsey in his paper “Truth and
Probability”, written in 1926 and published in 1931, and,
independently, by the Italian mathematician, Bruno de Finetti,
who proposed it somewhat later, in 1928, and published it in a series of
papers in 1930. In this conception, probability is the degree of
belief of an individual at a given time. The inductive nature of
the account is reflected in de Finetti’s (1972: 21) that:
[t]he subjectivists … maintain that a probability evaluation, being
but a measure of someone’s beliefs, is not susceptible of being proved
or disproved by the facts …
A major assumption of the theory is that beliefs, commonly conceived
as psychological states, are measurable, otherwise as Ramsey
put it “all our inquiry will be vain” (1926:166). Thus, one needs to
specify a method of measuring belief to consider the sentence ‘the
degree of belief of X, at time t, is p’ meaningful. Ramsey
examined two such methods. The first one is based on the fact that the
degree of belief is perceptible by its owner, since one ascribes
different intensities of feelings of conviction to different beliefs
that they hold. However, as Ramsey noted, we do not have strong feelings
for things we take for granted, actually, such things are practically
accompanied by no feeling; thus, this way of measuring degree of belief
seems inadequate. The second method rests on the supposition that the
degree of belief is a causal property and:
the difference [in the degree of belief] seems to me to lie in how
far we should act on these beliefs (ibid: 170).
To measure beliefs as bases for actions Ramsey (ibid: 172)
suggested:
to propose a bet and see what are the lowest odds which… [the agent]
will accept.
In a similar vein, de Finetti (1931) characterized probability “the
psychological sensation of an individual” and also suggested to use bets
to measure degrees of belief.
A bet on a hypothesis ℎ, with betting quotient
𝑝, at stake 𝑆, 𝑏𝑒𝑡(ℎ, 𝑝, 𝑆), is defined by the
following conditions:
- if hypothesis ℎ is true, the gambler wins (1 − 𝑝)𝑆;
- if hypothesis ℎ is false, the gambler loses 𝑝𝑆,
where 𝑝 is any real number in the unit interval and 𝑆 any sum of
money.
We say that the odds in a bet on ℎ at stake 𝑆 are 𝑅: 𝑄
whenever the betting quotient
𝑝 = 𝑅/(𝑅 + 𝑄).
| 𝒉 | AGENT PAYS |
|
|
|---|---|---|---|
|
𝑝𝑆 | 𝑆 | (1 − 𝑝)𝑆 |
|
𝑝𝑆 |
|
−𝑝𝑆 |
The actions that measure an agent’s degree of belief in a hypothesis
ℎ are the buying and selling of a bet on ℎ. In particular, the
degree of belief of an individual 𝑋 in a hypothesis ℎ is a
number 𝑝0 which, expressed in monetary values,
$𝑝0, is (i) the highest price 𝑋 is willing to buy a
bet that returns $1 if ℎ is true, and $0 if ℎ is false, and, (ii) the
lowest price, 𝑋 is willing to sell that same bet.
To better understand this definition, consider the set of all bets on
ℎ at stake $1. It can be characterized in terms of the betting quotients
as follows:
{𝑝 ∈ ℝ: 𝑏𝑒𝑡(ℎ, 𝑝, $1)}
To buy any bet from this collection the bettor should pay $𝑝. But
depending on ℎ they are not willing to pay any amount of money; on the
contrary they seek to pay the least possible. The definition assumes
that the amount of money the agent is willing to pay to buy the bet is
bounded from above and its least upper bound is $𝑝0.
Similarly, the money an agent could earn from selling the bet is bounded
from below and the greatest lower bound is also $𝑝0. This
number 𝑝0 is the degree of belief of an agent in ℎ.
On this view, the conditional degree of belief of an individual 𝑋 in
a hypothesis ℎ
given some statement 𝑒, 𝑏𝑋(ℎ|𝑒) = 𝑝0 is defined in terms of the
following bet:
- if hypothesis ℎ&𝑒 is true, the bettor wins $(1 −
𝑝0); - if hypothesis 𝑒 is false, the bettor wins $𝑝0
The idea for this bet is that it is called off in case 𝑒 is false and
the agent gets a refund of $𝑝0. (Jeffrey 2004: 12)
The degree of belief 𝑝0 of an individual 𝑋 in a hypothesis
ℎ is confined within the unit interval. To see this, assume, first, that
𝑝0 < 0 and consider the agent selling a
bet to the bookie that pays $1 if ℎ is true, and $0 if ℎ is false,
for $𝑝0. Independently of the truth-value of ℎ, this bet is a
loss for the agent: the agent has a net gain of
$(−1 + 𝑝0) < 0 in case ℎ is true and $𝑝0
< 0 in case ℎ is false. In a similar vein, if
𝑝0 > 1, an agent buying a bet from the bookie
that pays $1 if ℎ is true, and $0 if ℎ is false, for $𝑝, gains $(1 −
𝑝0) < 0 if ℎ is true, and $ − 𝑝0 < 0 if ℎ is
false, and the bet is, again, a loss for the agent. Hence, if an agent
assigns to any of their beliefs degrees that are either negative or
greater than 1, they are exposed to a betting situation with guaranteed
loss independently of the truth or the falsity of that belief. Such an
unwelcome bet or set of bets which “will with certainty result in a
loss” (de Finetti, 1974: 87) for the agent is called Dutch
book. It is conjectured that the term can be traced back to the
introduction of the Lotto game in the Low Countries, at the beginning of
the 16th century where in the so-called “Dutch Lotto”, the organizer
had, in any event, a positive gain (de Finetti, 2008: 45). Hence, to
avoid a Dutch book, one should confine degrees of belief within the
interval [0,1].
A degree of belief function 𝑏𝑋 is an assignment
of degrees of belief of a person 𝑋’s beliefs as represented by
propositions (or, classes of logically equivalent sentences, in a
language dependent context):
𝒮𝐿 ∋ ℎ ↦ 𝑏𝑋(ℎ) ∈ [0,1].
For an agent 𝑋 with an assignment of degrees of belief described by
the function 𝑏𝑋, we may define the expected winnings of
a 𝑏𝑒𝑡(ℎ, 𝑝, 𝑆) for X, as a convex combination of the gains
and losses of the agent on this bet with coefficients determined by
their degree of belief in ℎ :
𝐸𝑊[ 𝑏𝑒𝑡(ℎ, 𝑝, 𝑆), 𝑋] = 𝑏𝑋(ℎ)𝑉(ℎ) + (1 −
𝑏𝑋(ℎ))𝑉(~ℎ).
where 𝑉(ℎ) is the net payoff for the agent if ℎ is true and 𝑉(~ℎ),
the net payoff if ℎ is false. To understand this concept, think of 𝑉(ℎ)
and 𝑉(~ℎ) as the possible states in which an agent that their belief
function assigns 1 and 0 to ℎ, respectively, expects to be found if the
bet offered is accepted. Namely, an agent that is certain of the truth
of ℎ, expects to gain 𝑉(ℎ) an agent that is certain of the falsity of ℎ,
expects to gain
𝑉(~ℎ) by accepting the bet. If the agent’s belief function assigns
any other number in the unit interval to ℎ, they will occupy an
intermediate state. Geometrically, 𝑉(ℎ) and 𝑉(~ℎ) may be thought as the
extremities of a line segment and any other state a point between these
extremities. Next, assume that the agent is placed on the midpoint of
the segment, equidistant from its extremities. Then the bet doesn’t give
any prevalence beforehand to the truth or the falsity of the hypothesis
for that particular agent and it is fair. If the agent’s belief function
places them closer to either of the extremities, 𝑉(ℎ) or 𝑉(~ℎ), then the
gives an unfair advantage for or against ℎ, for this agent. Thus, for
𝑏𝑋(ℎ) = 𝑝0, the expected winnings of a 𝑏𝑒𝑡(ℎ, 𝑝,
𝑆) for X is:
(𝑝0 − 𝑝)𝑆
and it measures how much fair or unfair is the bet for that
particular agent. In this understanding, no commitment to a
probabilistic view of the belief function is required. It is sufficient
to treat belief quantitatively, to consider the degree of belief on a
hypothesis a number in the closed interval and to interpret the values 0
and 1 in terms of the belief in the falsity and truth of the hypothesis
respectively.
Accordingly, we may now give the following definitions:
- We call 𝑏𝑒𝑡(ℎ, 𝑝, 𝑆) a fair bet for 𝑋 if and only if 𝐸𝑊[
𝑏𝑒𝑡(ℎ, 𝑝, 𝑆), 𝑋] = 0. - We call 𝑏𝑒𝑡(ℎ, 𝑝, 𝑆) advantageous for 𝑋 if and only
if
𝐸𝑊[ 𝑏𝑒𝑡(ℎ, 𝑝, 𝑆), 𝑋] > 0.
- We call 𝑏𝑒𝑡(ℎ, 𝑝, 𝑆) disadvantageous for 𝑋 if and only
iff
𝐸𝑊[ 𝑏𝑒𝑡(ℎ, 𝑝, 𝑆), 𝑋] < 0.
Notice that the Dutch book in which we would be vulnerable were we to
consider degrees of belief outside the unit interval, is fair, since it
is defined in terms of buying and selling 𝑏𝑒𝑡(ℎ, 𝑝0, 𝑆) – a fact that
makes its bite even worse.
Dutch Books
Ramsey identified a connection between Dutch books and the laws of
mathematical probability. In “Truth and Probability” we read that (1926:
182):
If anyone’s mental condition violated these laws [of probability] …
[h]e could have a book made against him by a cunning bettor and would
then stand to lose in any event.
And conversely,
Having degrees of belief obeying the laws of probability implies a
further measure of consistency, namely such a consistency between the
odds acceptable on different propositions as shall prevent a book being
made against you (1926: 183).
Instead of Ramsey’s ‘consistency’, de Finetti (1974: 87) has spoken
of ‘coherence’ of degrees of beliefs. The degrees an agent assigns to
his beliefs are said to be coherent :
if among the combinations of bets which [y]ou have committed yourself
to accepting there are none for which the gains are all uniformly
negative.
Thus, if an agent is not vulnerable to a Dutch book with betting
quotients equal to their degrees of belief, the agent is said to have
coherent degrees of belief. In addition, an agent has coherent degrees
of belief if and only if their degrees of belief satisfy the axioms of
probability. This is the celebrated Ramsey – de Finetti or Dutch-Book
theorem:
Let 𝑏𝑋: 𝒮𝐿 ⟶ ℝ be a degree of
belief function of a person 𝑋. If 𝑏𝑋 does
not satisfy the axioms of probability, then there is a family of fair
bets𝑏𝑒𝑡(ℎ𝑖, 𝑝𝑖, 𝑆𝑖), with
ℎ𝑖 ∈ 𝒮𝐿 , 𝑝𝑖 = 𝑏𝑋(ℎ𝑖) and
𝑆𝑖 ∈ ℝ, for every 𝑖 = 1, … , 𝑛 (or∞) which guarantees that the agent will result in an overall
loss, independently of the truth-values of the hypotheses
ℎ𝑖.
The converse of that theorem has also been shown:
Let 𝑏𝑋: 𝒮𝐿 ⟶ ℝ be a degree of
belief function of a person 𝑋. If 𝑏𝑋
satisfies the axioms of probability, then there is no family of fair
bets 𝑏𝑒𝑡(ℎ𝑖, 𝑝𝑖, 𝑆𝑖),
with ℎ𝑖 ∈ 𝒮𝐿 , 𝑝𝑖 = 𝑏𝑋(ℎ𝑖)
and 𝑆𝑖 ∈ ℝ, for every 𝑖 = 1, … , 𝑛 which guarantees that
the agent will result in an overall loss, independently of the
truth-values of the hypotheses ℎ𝑖.
We have already discussed the application of the Ramsey-de Finetti
theorem in the case of violation of the axiomatically imposed constraint
that probability values lie within the unit interval. The next example
illustrates how an agent will experience an overall loss if they hold
degrees of belief that do not comply with the finite additivity
axiom.
Consider the tossing of a die and assume that the degrees of belief
assigned by a person 𝑋 to the beliefs that they will obtain: ‘6’ in a
single toss is 𝑞; ‘3’ in a single toss is 𝑟; and, either ‘6’ or ‘3’ is
𝑘. Moreover, let 𝑘 < 𝑟 + 𝑞, i.e., finite additivity axiom is
violated. Then we may consider the following family of fair bets,
suggested to the agent:
𝑏𝑒𝑡(′6′, 𝑞, 1), 𝑏𝑒𝑡(′3′, 𝑟, 1),
𝑏𝑒𝑡(′6′𝑜𝑟′3′, 𝑘, −1).
The agent buys from the bookie 𝑏𝑒𝑡(′6′, 𝑞, 1) that pays $1,
if “′6′ is obtained” is true, and $0, if false, for $𝑞.Next, the agent
buys the second bet, 𝑏𝑒𝑡(′3′, 𝑟, 1), that pays $1, if “′3′ is
obtained” is true, and $0, if false, for $𝑟. Finally, in the third bet,
the agent sells to the bookie
𝑏𝑒𝑡(′6′𝑜𝑟′3′, 𝑘, −1) that pays $1, if
“′6′ or ′3′ is obtained” is true, and $0 if false, for $𝑘. In the
following table, is calculated the net gain for the agent in this
betting sequence:
“′𝟔′,” “′𝟑′,” “′𝟔′, OR ′𝟑′,” NET GAIN FOR THE
AGENT
|
|
|
|
|
|---|---|---|---|---|
|
|
|
(−𝑞) ⋅ 1 + (−𝑟) ⋅ 1 + (−𝑘)(−1) = 𝑘 − (𝑟 + 𝑞) |
|
As we can see, this sequence of bets results in an overall loss for
the agent. Thus, as the Ramsey-de Finetti theorem demands, an agent
whose degree of belief function violates the axiom of finite additivity
is exposed to a Dutch book.
One could obtain a similar result for the violation of countable
additivity axiom. In this case they need to employ a countable infinite
family of bets. However, a criticism that follows such an assumption is
that it is unrealistic for any agent to be engaged in infinitely many
bets. (Jeffrey,2004: 8)
There have been attempts to extend the requirement of coherence from
the synchronic case, as expressed by the compliance of the
degrees of belief with the axioms of probability theory, to
diachronic coherence by stipulating rules for belief updating.
Learning from experience requires that the agent should change their
assignment of degree of belief (probability) on a given hypothesis in
response to the result of experiment or observation. The simplest, and
most common, rule for updating is the following:
In the light of new evidence, the agent should update their degrees
of beliefs byconditionalizing on this evidence.
Thus, assume that the belief function of a person 𝑋 before new
evidence 𝑒 is acquired is 𝑏𝑋𝑜𝑙𝑑 and 𝑏𝑋𝑛𝑒𝑤 is the
belief function after the acquisition of new evidence. The transition
from the old degree of belief to the new one is governed by the
rule:
𝑏𝑋𝑛𝑒𝑤(ℎ) = 𝑏𝑋𝑜𝑙𝑑(ℎ|𝑒)
where 𝑒 is the total evidence, and 𝑏𝑋𝑜𝑙𝑑(ℎ|𝑒) is the
posterior probability as determined by Bayes’s Theorem if we identify
the degree of belief function with the probability function.
This form of conditionalization is called strict
conditionalization and it takes the probability of the learned
evidence to be unity, i.e., 𝑏𝑋𝑛𝑒𝑤(𝑒) = 1 . Jeffrey found out
that certainty is a very restrictive condition that does not conform
with the uncertainties of real empirical research in science and
everyday life. To show that Jeffrey suggested the example of observing
the color of a piece of cloth by candlelight. The agent gets the
impression that the observed color is green, but they concede that it
maybe blue or less probably violet. The experience causes as to
change
our degrees of belief in propositions about the color of the object
but does not cause us to change them to 1. Hence, strict
conditionalization is inapplicable for updating our degrees of belief.
Jeffrey suggested another form of conditionalization that tackles the
problem, known as Jeffrey-conditionalization (or,
probability kinematics, as Jeffrey called it), which considers
evidence as providing probabilities to a partition of our set of
beliefs. In this case, the new degree of belief function is calculated
in terms of the old one,
𝑏𝑋
𝑛𝑒𝑤
(ℎ) = ∑𝑛 𝑏𝑋
𝑜𝑙𝑑
(ℎ|𝑒𝑖)𝑝𝑖,
where {𝑒𝑖}𝑛 is a partition of our set of beliefs
consisting mutually exclusive and jointly exhaustive propositions and 𝑝𝑖
= 𝑏𝑋𝑛𝑒𝑤(𝑒𝑖) , 𝑖 = 1, … , 𝑛, are the probabilities
assigned to propositions 𝑒𝑖 by new evidence. As before,
𝑏𝑋𝑜𝑙𝑑(ℎ|𝑒𝑖) is calculated as the posterior
probability in Bayes’s Theorem.
𝑖
𝑖=1
One difficulty with Jeffrey’s conditionalization is that while strict
conditionalization provides an assurance to convergence to truth,
Jeffrey’s conditionalization generally doesn’t. There is a family of
theorems, known as convergence theorems, with the most well-known being
that of Gaifman and Snir (1982), which claim that, under reasonable
assumptions, the probability of a hypothesis conditional on available
evidence converges to 1 in the limit of empirical research, if the
hypothesis is true. These theorems provide a vindication of Bayesianism
showing that it is guaranteed to find the truth eventually by applying
successively strict conditionalization.
Conditionalizing on the evidence is purely logical updating
of degrees of belief. It is not ampliative. It does not introduce new
content, nor does it modify the old one. It just assigns a new degree of
belief to an old opinion. The justification for the requirement of
conditionalization is supposed to be a diachronic version of the Dutch-
book theorem. It is supposed to be a canon of rationality (certainly a
necessary condition for it) that agents should update their degrees of
belief by conditionalizing on evidence. The penalty for not doing this
is liability to a Dutch-book strategy: the agent can be offered
a set of bets over time such that a) each of them taken
individually will seem fair to them at the time it is offered; but b)
taken collectively, they lead them to suffer a net loss, come what
may.
Bayesian Induction
In this context, induction rests on the degree of belief one assigns
to a hypothesis given a body of confirmatory evidence and on the process
of updating in the light of new evidence. Hence, the problem of
justification of induction gives way to the problem of justifying
conditionalization on the evidence. In general, Bayesian theories of
confirmation maintain the following theses:
- Belief is always a matter of degree; degrees of belief are
probability values and degree of belief functions are probability
functions. - Confirmation is a relation of positive relevance, viz., a piece
of evidence confirms a hypothesis if it increases its
probability;
𝑒 confirms ℎ iff 𝑝(ℎ|𝑒) > 𝑝(ℎ), where 𝑝 is a probability
function.Similarly, we may define disconfirmation of a hypothesis by
a piece of evidence in terms of negative relevance (𝑝(ℎ|𝑒) < 𝑝(ℎ)),
as well as neutrality of a hypothesis with respect to a piece
of evidence in terms of irrelevance (𝑝(ℎ|𝑒) =𝑝(ℎ)).
- The relation of confirmation is captured by Bayes’s theorem which
dictates the change of the degree of belief in a given hypothesis in the
light of a piece of evidence.
𝑝(ℎ|𝑒) =
𝑝(𝑒|ℎ)𝑝(ℎ)
𝑝(𝑒)
, where 𝑝(ℎ), 𝑝(𝑒) > 0,
- The only factors relevant to confirmation of a hypothesis are its
prior probability
𝑝(ℎ), the likelihood of the evidence given the hypothesis 𝑝(𝑒|ℎ); and
the probability of the evidence 𝑝(𝑒).
- The specification of the prior probability of (aka prior
degree of belief in) a hypothesis is a purely subjective
matter. - The only (logical-rational) constraint on an assignment of prior
probabilities to several hypotheses should be that they obey the axioms
of the probability calculus. - The reasonableness of a belief does not depend on its content;
nor, ultimately, on whether the belief is made reasonable by the
evidence.
Too Subjective?
In 1954, Savage discussed a criticism of subjective Bayesianism based
on the idea that science or scientific method aims at finding out “what
is probably true, by criteria on which all reasonable men agree.”
(1954:67). By applying intersubjectively accepted criteria, scientific
method is supposed to lead to an agreement between any two rational
agents on the probability for the truth of a hypothesis given the same
body of evidence. According to Savage this demand for intersubjectivity
has its source either in considering probabilistic entailment as a
generalization of logical entailment, or in considering probability an
objective property of certain physical systems. Yet, the criticism goes,
complete freedom in the choice of prior probabilities for a hypothesis
by two agents may yield different posterior probabilities for that
hypothesis given the same body of evidence. This fact compromises the
desideratum of intersubjectivity of criteria since it makes room for the
intrusion of idiosyncratic elements, non-cognitive values, or any other
source of subjective preferences, reflected in the disagreement of the
agents in the choice of priors, and, ultimately, in the value of
posterior probability of a hypothesis. Hence, what is “probably true” is
not evaluated by “criteria on which
all people agree”. In a nutshell, it is claimed that purely
subjective prior probabilities fail to capture the all-important notion
of rational or reasonable degrees of belief and that subjective
Bayesianism is too subjective to offer an adequate theory of
confirmation.
In defense of subjective probability, Savage claims that although
this viewincorporates all the universally acceptable criteria for
reasonableness in judgement… [these criteria] do not guarantee agreement
on all questions among all honest and freely communicating people, even
in principle (ibid),considering disagreements a non-distressful situation. Moreover,
anticipating what later became known as
convergence-to-certainty or merger-of-opinions
theorems, he showed that:…in certain contexts any two opinions, provided that neither is
extreme in a technical sense, are almost sure to be brought very close
to one another by a sufficiently large body of evidence. (1954: 68; see
also 46f)
Yet, as Hesse (1975; see Earman 1992:143) objected, Savage’s argument
makes assumptions that are valid for the flipping of a coin case but are
not typically valid in scientific inference. Gaifman and Snir (1982)
have shown important results which overcome the limitations of Savage’s
account. They have shown (Thm. 2.1) that for an infinite sequence of
empirical questions, 𝜑1,…, 𝜑𝑛, …, formulated in a
given language that satisfies certain conditions:
- Convergence-to-certainty: The limiting probability of a
true sentence 𝜓 in that language, given all empirical evidence collected
in our world 𝑤, in response
to empirical questions stated, 𝜑𝑤, … , 𝜑𝑤, …,
equals to 1, lim Pr(𝜓|& 𝜑𝑤) =
1 𝑛
𝑛→∞
𝑖≤𝑛 𝑖
1. For a false proposition, the respective probability is 0,
lim Pr(𝜓|&𝑖≤𝑛𝜑𝑤) = 0.
𝑛→∞ 𝑖
- Merger-of-opinions: The distance between any two
probability functions that agree to assign probability 0 to the same
sentences, i.e., they are equally dogmatic, converges to 0, in the limit
of empirical research, i.e.,
lim
𝑠𝑢𝑝𝜓|Pr1(𝜓|&𝑖≤𝑛𝜑𝑤) −
Pr2(&𝑖≤𝑛𝜑𝑤)| = 0.𝑛→∞
𝑖 𝑖
Merger-of-opinions theorem is supposed to mitigate the excessive
subjectivity of Bayesianism in the choice of prior probabilities: the
actual values assigned to prior probabilities do not matter much since
they ‘wash out’ in the long run.
Unfortunately, several criticisms of the theorem showed that the
objection of subjectivism is not fully addressed. Let us briefly review
some of these criticisms: The first objection is related to the
asymptotic character of convergence and merging and the fact that the
speed of convergence is unknown. The results do not apply to the
divergences of opinion induced by small and medium-sized sets of
evidence that have practical importance. The second objection is related
to the language-dependent nature of the theorems restricting them to
cases in which the predicates of the language are fixed. The theorems
cannot guarantee washing out the priors assigned by agents in different
linguistic contexts, as before and after a scientific revolution.
An important criticism stems from the fact that convergence in the
theorems is obtained almost everywhere, i.e., for all worlds 𝑤,
the actual world included, which belong to some set of possible worlds
with probability 1. In the authors’ own words:
… with probability 1, two persons holding mutually nondogmatic
initial views will, in the long run, judge similarly… Also the
convergence is guaranteed with probability 1, where “probability” refers
to the presupposed prior. (I) and(II) [referring to the two parts of the theorem] form an “inner
justification” but they do not constitute a justification of the
particular prior.
So, the theorem guarantees convergence to truth and merging of
opinions in every world except for some pathological cases that form
small sets of worlds of measure zero. But who decides what those sets of
worlds of measure zero would be? The Bayesian agent themselves through
the choice of priors who is compelled to assign probability zero to
‘unpleasant’ scenarios. On these grounds, Earman claims that the
“impressiveness of these results disappears in the light of their
narcissistic character… ‘almost surely’ sometimes serves as a rug under
which some unpleasant facts are swept” (1992:147).
Extending on this criticism, Belot (2013; 2017) has argued that in
problems of convergence to truth, there are typical cases –
their typicality being defined in a topological sense without
measure-theoretic presuppositions – in which convergence to truth is
unsuccessful, a fact that a Bayesian agent is bound to ignore by
assigning prior probability zero to such cases. Thus, Belot, concludes,
convergence – merger theorems “constitute a real liability for
Bayesianism by forbidding a reasonable epistemological modesty”
(2013)Belot’s arguments have prompted a variety of responses: some
philosophers were critical of Belot’s topological considerations as
being irrelevant to probability theory (Cisewski et al. 2018;
Huttegger 2015). Others focused on imprecise probabilities and finitely
additive probabilities to escape the charge of immodesty (Weatherson
2015; Elga 2016; Nielsen and Stewart 2019). Huttegger (2021) has shown
using non-standard analysis that “convergence to the truth fails with
(non-infinitesimal) positive probability for certain hypotheses … [a
fact] that creates a space for modesty within Bayesian epistemology.” As
regards the countable additivity of the probability function,
convergence-to-certainty and merger-of opinions theorem relies
essentially on this axiom. Prominent subjective Bayesians, on the other
hand, such as de Finetti and Savage, explicitly reject countable
additivity axiom despite its theoretical fecundity. Yet Savage, as
mentioned above, has explored the possibility of theorems that despite
their shortcomings attempt to mitigate the extreme subjectivism of
Bayesianism. Recently, Nielsen (2021) has shown that there are
uncountably many merely finitely additive probabilities that converge to
the truth almost surely and in probability. As a general comment, we
would say that the area convergence and merger theorems seems to have
many open problems to capture the interest of researchers.
Some Success Stories
Bayesian theory has a record of successful justifications of some
important common intuitions about confirmation – such as the belief that
a theory is confirmed by its observational consequences or the belief
that a theory is better confirmed if subject to strict tests – and it
has provided a solution to the famous ‘raven paradox’.
It is straightforward to show that hypotheses are confirmed by their
consequences. Assume that ℎ ⊢ 𝑒, then the likelihood of 𝑒 given ℎ is
𝑝(𝑒|ℎ) = 1 and according to
Bayes theorem, 𝑝(ℎ|𝑒) = 𝑝(𝑒|ℎ) 𝑝(ℎ) = 𝑝(ℎ)
> 𝑝(ℎ), given that 𝑒 is not trivially true
𝑝(𝑒) 𝑝(𝑒)
(𝑝(𝑒) < 1); hence, 𝑒 confirms ℎ. This result justifies the
inference of the truth of a hypothesis on the basis of its observational
consequences as the hypothetico- deductive method of confirmation
suggests. Although the inference commits the
formal fallacy of affirming the consequent, if considered
inductively, through the lenses of Bayes’s theorem, it is fully
justified and the confirmatory nature of the hypothetico-deductive
method is explained. This is what Earman recognized as an important
“success story” of the Bayesian approach (1992: 233)
Another common methodological intuition that may be justified on
Bayesian grounds is related to the scientific practice of subjecting a
hypothesis to severe tests on the basis of improbable consequences. As
Deborah Mayo (2018: 14), following Popper, suggested in her Strong
Severity Principle:
We have evidence for a claim C just to the extent it survives a
stringent scrutiny. If C passes a test that was highly capable of
findings flaws or discrepancies from C, and yet none or few are found,
the passing result, x, is evidence for C.
Now, as before, consider a logical consequence 𝑒 of a hypothesis ℎ.
i.e., ℎ ⊢ 𝑒 . A severe test of ℎ would be one in which 𝑝(~𝑒) is high
and, consequently, 𝑝(𝑒) is low. In this case 𝑒 would be evidence for ℎ.
Hence, a necessary condition for collecting evidence for a hypothesis
according to the aforementioned principle, would be to test its
improbable consequences. Indeed, following Bayes’s theorem:
𝑝(ℎ|𝑒) = 𝑝(𝑒|ℎ) 𝑝(ℎ) = 𝑝(ℎ).
𝑝(𝑒) 𝑝(𝑒)
Thus, the more improbable the consequence 𝑒 is, the greater the
degree of confirmation, as measured by the ratio
𝑝(ℎ|𝑒), is.
𝑝(ℎ)
Another piece in the collection of trophies of the Bayesian account
is the resolution of the ravens paradox. This is a paradox of
confirmation, first noted by Carl Hempel, which took its name from the
example that Hempel used to illustrate it viz., all ravens are
black. The paradox emerges from the impossibility of having jointly
satisfied three intuitively compelling principles of confirmation. The
first is Nicod’s principle [named after the French philosopher Jean
Nicod]: a universal generalization is confirmed by its positive
instances. So, that all ravens are black is confirmed by the observation
of black ravens. Second, the principle of logical equivalence:
if a piece of evidence confirms a hypothesis, it also confirms its
logically equivalent hypotheses.
Third, the Principle of relevant empirical investigation:
hypotheses are confirmed by investigating empirically what they
assert.
To set up the paradox, take the hypothesis ℎ: All ravens are black.
The hypothesis ℎ′: All non-black things are
non-ravens is logically equivalent to ℎ. A positive instance of
ℎ′ is a white piece of chalk. Hence, by Nicod’s condition,
the observation
of the white piece of chalk confirms ℎ′. By the
principle of equivalence, it also confirms ℎ, that is that all
ravens are black. But then the principle of relevant empirical
investigation is violated. For, the hypothesis that all ravens are black
is confirmed not by examining the colour of ravens (or of any other
birds) but by examining seemingly irrelevant objects (like pieces of
chalk or red roses). So at least one of these three principles should be
abandoned, if the paradox is to be avoided.
To resolve the ravens paradox, a Bayesian may show that there is no
problem with accepting all three principles of confirmation since the
degree of confirmation conferred on the hypothesis ℎ by an instance of a
non-raven-non-black object is negligible in comparison with how much the
hypothesis is confirmed by an instance
of a black object.[According to Howson and Urbach (2006: 100) a
Bayesian analysis could also challenge the adequacy of Nicod’s criterion
as a universal principle of confirmation.]
To see that consider hypotheses ℎ: ∀𝑥(𝑅𝑥 → 𝐵𝑥) and ℎ′: ∀𝑥(~𝐵𝑥 → ~𝑅𝑥)
and evidence 𝑒: 𝑅𝑎&𝐵𝑎 and 𝑒′: ~𝐵𝑎&~𝑅𝑎 which are
positive instances of ℎ, ℎ′ respectively. We calculate the ratio
𝑝(ℎ|𝑒)/𝑝(ℎ|𝑒′) which according to Bayes’s theorem and the easily
verifiable equality of likelihoods of 𝑒 and 𝑒′ given ℎ, 𝑝(𝑒|ℎ) =
𝑝(𝑒′|ℎ), is 𝑝(ℎ|𝑒) = 𝑝(𝑒) . But 𝑝(𝑒′)>>
𝑝(𝑒) because there are very many more things
𝑝(ℎ|𝑒′)
𝑝(𝑒′)
which are non-Black and non-Ravens than Black Ravens. Hence, 𝑝(ℎ|𝑒) ≫
𝑝(ℎ|𝑒′),
i.e 𝑒 confirms ℎ a lot more than 𝑒′ confirms ℎ′.
We are closing this presentation of subjective probability and
Bayesian confirmation theory by referring to what has become known as
the old evidence problem. The problem has been
identified for the first time by Glymour (1980) and it underlines a
potential conflict between Bayesianism and scientific practice. Suppose
that a piece of evidence 𝑒 is already known (i.e., it is an old piece of
evidence relative to the hypothesis ℎ under test). Its probability,
then, is equal to unity, 𝑝(𝑒) = 1. Given Bayes’s theorem, it turns out
that this piece of evidence does not affect at all the posterior
probability, 𝑝(ℎ|𝑒), of the hypothesis given the evidence; the posterior
probability is equal to the prior probability, i.e., 𝑝(ℎ|𝑒) = 𝑝(ℎ).
This, it is argued, is clearly wrong since scientists typically use
known evidence to support their theories. This fact is demonstrated by
the use of the anomalous precession of Mercury’s perihelion, discovered
in the nineteenth century, as confirming evidence for Einstein’s General
Theory of Relativity. Therefore, the critics conclude, there must be
something wrong with Bayesian confirmation. Some Bayesians have replied
by adopting a counterfactual account of the relation between theory and
old evidence (Howson and Urbach 2006: 299). Suppose, they argue, that 𝐾
is the relevant background knowledge and 𝑒 is an old (known) piece of
evidence—that is, 𝑒 is actually part of 𝐾. In considering what kind of
support 𝑒 confers on a hypothesis ℎ, we subtract counterfactually the
known evidence 𝑒 from the background knowledge 𝐾. We therefore presume
that 𝑒 is not known and ask: what would the probability of 𝑒 given
𝐾\{𝑒}? This will be less than one; hence, the evidence 𝑒 can affect
(that is, raise or lower) the posterior probability of the
hypothesis.
Appendices
- Lindenbaum algebra and probability in sentential
logic.
In this appendix we show how one can assign probabilities, originally
defined in set- theoretic framework, to sentences in the language of
sentential logic, 𝐿. We formulate Kolmogorov’s axioms of probability for
sentences and some important theorems.
In particular, consider the set of all well-formed formulas (wffs) of
𝐿 and define for every wff 𝜙 the equivalence class:
[𝜙] = {𝜓: ⊢𝐿 𝜙 ≡ 𝜓}.
In the set of all equivalence classes 𝒮𝐿, we define
set-theoretic operations that correspond to the sentential connectives
of the language. Thus, for every two wffs
𝜙, 𝜓:
[𝜙] ∪ [𝜓] = [𝜙 ∨ 𝜓]
[𝜙] ∩ [𝜓] = [𝜙 ∧ 𝜓] [𝜙]𝑐 = [~𝜙]
[⊥] = ∅
[𝑡] = {wffs of 𝐿}
where “⊥” designates a contradiction and “𝑡” a tautology. This way
constructed, the set of all equivalence classes, 𝒮𝐿, is a
field (and a Boolean algebra) (see section 1a), and it is called
Lindenbaum algebra (Hailperin 1986: 30ff.). However, since in
the language of sentential logic, infinitary operations, like 𝜙1 ∨ … ∨
𝜙𝑛 ∨ … , cannot be applied to wffs 𝜙𝑖 to produce other wffs, we cannot
define in 𝒮𝐿 the countably infinite union of classes of wffs.
As a consequence, 𝒮𝐿 is not a σ-field and the probability
function that we are about to define does not satisfy countable
additivity. So, this is an account of elementary probability
theory. To discuss the full axiomatic apparatus of probability theory
one needs to work in richer languages, which for present purposes is not
deemed necessary.
So, we can define a probability function 𝑝 that satisfies
Kolmogorov’s axioms (i)-
(iii) on 𝒮𝐿 and assign to each singular sentence of the
language 𝐿 the probability value of its equivalence class. Thus, for any
sentences 𝑎, 𝑏 and a tautotology 𝑡 of 𝐿:
- 𝑝(𝑎) ≥ 0;
- 𝑝(𝑡) = 1 ;
- 𝑝(𝑎 ∨ 𝑏) = 𝑝(𝑎) + 𝑝(𝑏), where 𝑎 ⊢𝐿 ~𝑏.
As for the conditional probability of a sentences 𝑎 given the truth
of a sentence sentences 𝑏, we have:
𝑝(𝑎|𝑏) =
𝑝(𝑎 ∧ 𝑏)
𝑝(𝑏)
, 𝑝(𝑏) ≠ 0.
It is obvious from the discussion above that logically equivalent
sentences have equal probability values:
if ⊢𝐿 𝑎 ≡ 𝑏, then 𝑝(𝑎) = 𝑝(𝑏).
We conclude this appendix with some useful theorems of the
probability calculus which we state in sentence-based formalism, without
proof:
- The sum of the probability of a sentence and of its negation is
1:
𝑝(~𝑎) = 1 − 𝑝(𝑎).
- Contradictions (⊥) have zero probability:
𝑝(⊥) = 0.
- The probability function respects the entailment relation: if 𝑎
⊢𝐿 𝑏, then 𝑝(𝑎) ≤ 𝑝(𝑏). - Probability values range between 0 and 1:
0 ≤ 𝑝(𝑎) ≤ 1.
- Finite Additivity Condition:
𝑝(𝑎1 ∨ … ∨ 𝑎𝑁) = 𝑝(𝑎1) + ⋯ +
𝑝(𝑎𝑁), 𝑎𝑖 ⊢𝐿 ~𝑎𝑗, 1 ≤ 𝑖 <
𝑗 ≤ 𝑁.Corollary:
If ⊢𝐿 𝑎1 ∨ … ∨ 𝑎𝑁 and 𝑎𝑖
⊢𝐿 ~𝑎𝑗, 1 ≤ 𝑖 < 𝑗 ≤ 𝑁, 1 = 𝑝(𝑎1) + ⋯
+ 𝑝(𝑎𝑁).
- Theorem of total probability:
If 𝑝(𝑎1 ∨ … ∨ 𝑎𝑁) = 1, and 𝑎𝑖
⊢𝐿 ~𝑎𝑗, 𝑖 ≠ 𝑗, then 𝑝(𝑏) = 𝑝(𝑏 ∧ 𝑎1) +
⋯ +𝑝(𝑏 ∧ 𝑎𝑁), for any sentence 𝑏.
Or in terms of conditional probabilities:
If 𝑝(𝑎1 ∨ … ∨ 𝑎𝑁) = 1, 𝑎𝑖
⊢𝐿 ~𝑎𝑗, 𝑖 ≠ 𝑗, and 𝑝(𝑎𝑖) > 0 then
𝑝(𝑏) =𝑝(𝑏|𝑎1)𝑝(𝑎1) + ⋯ +
𝑝(𝑏|𝑎𝑁)𝑝(𝑎𝑁), for any sentence 𝑏.Corollary 1:
If ⊢𝐿 𝑎1 ∨ … ∨ 𝑎𝑁 and 𝑎𝑖
⊢𝐿 ~𝑎𝑗, 𝑖 ≠ 𝑗, then 𝑝(𝑏) = 𝑝(𝑏 ∧ 𝑎1) +
⋯ +𝑝(𝑏 ∧ 𝑎𝑁).
Corollary 2:
𝑝(𝑏) = 𝑝(𝑏|𝑐)𝑝(𝑐) + ⋯ + 𝑝(𝑏|~𝑐)𝑝(~𝑐), for any sentence 𝑐, 𝑝(𝑐) >
0.
- Bayes’s Theorem. The famous theorem that took its name after the
eighteenth- century clergyman Thomas Bayes.- First form (Thomas Bayes):
𝑝(𝑒|ℎ)𝑝(ℎ)
𝑝(ℎ|𝑒) =
𝑝(𝑒)
, where 𝑝(ℎ), 𝑝(𝑒) > 0,
where 𝑝(ℎ|𝑒) is called posterior probability and expresses
the probability of the hypothesis ℎ conditional on the evidence 𝑒;
𝑝(𝑒|ℎ) is called likelihood of the hypothesis and expresses the
probability of the evidence conditional on the hypothesis; 𝑝(ℎ) is
called prior probability of the hypothesis; and 𝑝(𝑒) is the
probability of the evidence.
- Second form (Pierre Simon Laplace):
If 𝑝(ℎ1 ∨ … ∨ ℎ𝑁) = 1 and ℎ𝑖
⊢𝐿 ~ℎ𝑗, 𝑖 ≠ 𝑗, and 𝑝(ℎ𝑖), 𝑝(𝑒) > 0
then
𝑝(ℎ
𝑝(𝑒|ℎ𝑘)𝑝(ℎ𝑘)
|𝑒) =
∑
- Third form:
𝑘 𝑁
𝑖=1
𝑝(𝑒|ℎ𝑖)𝑝(ℎ𝑖)
𝑝(ℎ|𝑒) =
𝑝(ℎ) +
𝑝(ℎ)
𝑝(𝑒|~ℎ) 𝑝(~ℎ)
𝑝(𝑒|ℎ)
A sketch of
proof for Laplace’s Rule of Succession
Assume that we want to calculate the probability that the sun will
rise tomorrow given that the sun has risen for the past 𝑁 days. We have
observation data about the sunrise in the past 𝑁 days but the
probability 𝑞 of the sunrise is unknown. By application of the principle
of indifference, we claim that it is equally likely that the probability
of sunrise be any number 𝑞 ∈ [0,1]. Hence, the distribution of
probability values of sunrise is uniform.
We take the sample space to consist of (N+2)-ples of the following
type:
𝑁+1
< 𝑆⏞, 𝑆̅̅, …̅⌃, 𝐹̅, …̅̅ ,`𝑆 , 𝑞 >,
where 𝑆, 𝐹 stand for ‘Success’ and ‘Failure’ of the sunrise,
respectively, and 𝑞 denotes a possible value for the probability of the
sun rising.
The subset of the sample space
𝑁
𝐸 = {< 𝑆⏞̅, …⌃̅,`𝑆 , 𝑥, 𝑞 > |𝑥 ∈ {𝑆, 𝐹} 𝑎𝑛𝑑 𝑞 ∈ [0,1]},
is a random event consistent with observations of the sun rising in
the past 𝑁 days, no matter what is going to happen in the (𝑁 + 1) day or
what the probability 𝑞 of the sunrise is.
Since, parameter 𝑞 takes real values we should not ask what the
probability of a given value 𝑘 of the parameter 𝑞 is, but what the
probability of 𝑞 to be found within a given interval is:
𝑝(𝑞 ≤ 𝑘|𝐸).
To calculate this probability, we first apply Bayes’ rule:
𝑝(𝑞 ≤ 𝑘|𝐸) =
𝑝(𝑞 ≤ 𝑘) ⋅ 𝑝(𝐸|𝑞 ≤ 𝑘)
𝑝(𝐸)
Since all values of 𝑞 in [0,1] are equiprobable:
𝑝(𝑞 ≤ 𝑘) = 𝑘.
Since the sequence of past sunrises is a sequence of independent
trials, i.e., whether the sun has risen or not in a given day does not
influence the rising of the sun in subsequent days:
and
𝑝(𝐸|𝑞 ≤ 𝑘) =
𝑘𝑁
𝑁 + 1
Hence:
𝑝(𝐸) =
1
𝑁 + 1
𝑝(𝑞 ≤ 𝑘|𝐸) = 𝑘𝑁+1.
From here, we can calculate the probability density function for 𝑞 =
𝑘 conditional on
𝐸:
𝑓(𝑘) = (𝑁 + 1)𝑘𝑁.
To yield the probability of the sun to rise in the (𝑁 + 1) day, given
that it has risen in the last 𝑁 days, no matter what the probability of
sunrise might be is given by the following integral:
1
∫ 𝑘𝑓(𝑘)𝑑𝑘 =
0
(𝑁 + 1)𝑘𝑁+2 1
𝑁 + 2 |
0
𝑁 + 1
= .
𝑁 + 2
The
Mathematics of Keynes’s Account of Pure Induction
Consider a generalization ℎ: “𝑎𝑙𝑙 𝐴 𝑖𝑠 𝐵” and 𝑛 positive instances
𝑒𝑖: “𝑡ℎ𝑖𝑠 𝐴 𝑖𝑠 𝐵” ,
𝑖 = 1, … , 𝑛 that follow logically from ℎ, i.e., ℎ ⊢ 𝑒𝑖.
Let 𝑝(ℎ|𝐾) the prior to any evidence probability relative to background
knowledge 𝐾. Background knowledge is understood as the body of evidence
which is related to the truth of the hypothesis with the exception of
the evidence that are being considered explicitly. If 𝑛 positive
instances 𝑒𝑖: “𝑡ℎ𝑖𝑠 𝐴 𝑖𝑠 𝐵” , 𝑖 = 1, … , 𝑛 and no negative
instances have been observed, the posterior probability of ℎ is
𝑝(ℎ|𝑒1& … &𝑒𝑛&𝐾).
To justify inductive inference, Keynes claims, we need to find the
conditions on which the posterior probability increases with the
accumulation of positive instances and the absence of negative instances
so that the inductive argument is strengthened and in the limit of
empirical investigation, hypothesis ℎ can be inferred with certainty on
the basis of empirical evidence:
lim 𝑝(ℎ|𝑒1& … &𝑒𝑛&𝐾) = 1.
𝑛→∞
From Bayes’s theorem we have:
𝑝(ℎ|𝐾) 𝑝(𝑒1& … &𝑒𝑛|ℎ&𝐾)
𝑝(ℎ|𝑒 & … &𝑒 &𝐾) = .
1 𝑛
Since ℎ ⊢ 𝑒𝑖, 𝑖 = 1, … , 𝑛:
𝑝(𝑒1& … &𝑒𝑛|𝐾)
𝑝(𝑒1& … &𝑒𝑛|ℎ&𝐾) = 1 (1)
𝑝(ℎ|𝐾)
𝑝(ℎ|𝑒1& … &𝑒𝑛&𝐾) = 𝑝(𝑒 & …
&𝑒
|𝐾) (2)
1 𝑛
From the law of total probability, we have:
𝑝(𝑒1& … &𝑒𝑛|𝐾) = 𝑝(𝑒1& …
&𝑒𝑛|ℎ&𝐾)𝑝(ℎ|𝐾) + 𝑝(𝑒1& …
&𝑒𝑛|~ℎ&𝐾)𝑝(~ℎ|𝐾)
and by (1),
𝑝(𝑒1& … &𝑒𝑛|𝐾) = 𝑝(ℎ|𝐾) +
𝑝(𝑒1& … &𝑒𝑛|~ℎ&𝐾)𝑝(~ℎ|𝐾) (3)
Hence, by (2) and (3):
𝑝(ℎ|𝐾)
𝑝(ℎ|𝑒1& … &𝑒𝑛&𝐾) = 𝑝(ℎ|𝐾) + 𝑝(𝑒 & … &𝑒
|~ℎ&𝐾)𝑝(~ℎ|𝐾)
1 𝑛
If lim 𝑝(𝑒1&…&𝑒𝑛|~ℎ&𝐾) = 0, the requested
condition of asymptotic certainty,𝑛→∞
𝑝(ℎ|𝐾)
lim 𝑝(ℎ|𝑒1& … &𝑒𝑛&𝐾) = 1, is
satisfied. Since 𝑝(ℎ|𝐾) is the prior probability of the𝑛→∞
hypothesis which is independent of the evidence accumulated, it is a
fixed number.
Hence, the antecedent of the aforementioned conditional can be split
into the following two conditions:
𝑝(ℎ|𝐾) ≠ 0 (4)
and
lim 𝑝(𝑒1& … &𝑒𝑛|~ℎ&𝐾) = 0 (5)
𝑛→∞
Condition (5) can be analyzed in terms of the probability of a
positive instance 𝑒𝑗 given 𝑗 − 1 positive instances for ℎ,
𝑒1& … &𝑒𝑗−1, and that ℎ is false:
𝑝(𝑒𝑗|𝑒1& …
&𝑒𝑗−1&~ℎ&𝐾) = 𝑞𝑗, 𝑗 = 2, … , 𝑛
𝑝(𝑒1|~ℎ&𝐾) = 𝑞1.
The probability of 𝑛 positive instances and no negative instances
given that ℎ is false is:
𝑝(𝑒1& … &𝑒𝑛|~ℎ&𝐾) = 𝑞1 ∙ … ∙ 𝑞𝑛.
Let 1 > 𝑀𝑛 = 𝑚𝑎𝑥{𝑞1, … , 𝑞𝑛} then
𝑝(𝑒1& … &𝑒𝑛|~ℎ&𝐾) ≤
𝑀𝑛𝑛. The sequence
{𝑀𝑛}𝑛∈ℕ is bounded. If = 𝑠𝑢𝑝𝑛∈ℕ
𝑀𝑛 , 0 < 𝑀 < 1, then:
for every 𝑛 ∈ ℕ, 𝑝(𝑒1& … &𝑒𝑛|~ℎ&𝐾)
≤ 𝑀𝑛𝑛 < 𝑀𝑛
and (5) follows:
lim 𝑝(𝑒1& … &𝑒𝑛|~ℎ&𝐾) ≤ lim
𝑀𝑛 = 0 .
𝑛→∞ 𝑛→∞
By contraposition we infer that if condition (5) is not satisfied,
{𝑀𝑛}𝑛∈ℕ is not bounded by any number 𝑀, 0 < 𝑀
< 1. Thus, for every 𝑀 there is a 𝑛0 ∈ ℕ such that
𝑀𝑛0 > 𝑀. Since 𝑀𝑛0 = 𝑚𝑎𝑥{𝑞1, … ,
𝑞𝑛0 }, we infer that for every 𝑀 there is a 𝑘 ∈ ℕ, 𝑘 <
𝑛0, such that:
1 > 𝑝(𝑒𝑘|𝑒1& …
&𝑒𝑘−1&~ℎ&𝐾) = 𝑞𝑘 > 𝑀,
and
lim 𝑝(𝑒𝑘|𝑒1& …
&𝑒𝑘−1&~ℎ&𝐾) = 1. (6)𝑘→∞
Hence, if (5) is false then (6). But it is reasonable to demand that
a negative instance of ℎ, ~𝑒𝑘, should have non-zero
probability no matter how many positive
instances have been observed given the falsity of ℎ. Thus, Keynes
(1921: 275) suggested that (6) is false:
[given that] the generalisation is false, a finite uncertainty as to
its conclusion being satisfied by the next hitherto unexamined instance
which satisfies its premiss.
Or, as Russell commented referring to condition (5), “[i]t is
difficult to see how this condition can fail in empirical material.”
(1948: 455).
Keynes justified the second condition, (4), by applying the principle
of limited independent variety and the principle of indifference (see
sections 3.a.1, 3.a.2).
According to the principle of limited independent variety, qualities
are classified into a finite number of groups so that two qualities that
belong in the same group have the same extension, i.e., they are
satisfied by the same individuals, and, in this sense, they are
equivalent. More precisely, [𝐴] is the set of all qualities that are
equivalent to 𝐴; it includes all qualities 𝐵 ∈ [𝐴] which (∀𝑥)(𝐴𝑥 ≡ 𝐵𝑥).
Thus, generalization ℎ is entailed logically by the assumption that 𝐴, 𝐵
are equivalent properties. Moreover, the principle of limited variety
requires that the number of independent qualities that are inequivalent
is finite. Hence, if 𝑛 is the number of independent qualities by the
principle of indifference we conclude that the probability of any two
properties 𝐴, 𝐵 to belong in the same group is 1/𝑛. Since, ℎ is a
logical consequence of this fact, by a well-known theorem in probability
theory (see section 1.a),
1
𝑝(ℎ|𝐾) ≥
𝑛
, 𝑛 fixed counting number
But this is exactly what the demand for finite prior probability,
condition (4), requires.
- References and further reading
- Belot, G., (2013). “Bayesian Orgulity”. Philosophy of
Science 80 (4): pp.483-503. - Belot, G., (2017). “Objectivity and Bias”. Mind
126(503): pp.655-695. - Bernoulli, J., (1713 [2006]). The Art of Conjecturing.
Baltimore: The John Hopkins University Press. - Boole, G., (1854). An Investigation of The Laws Of Thought,
on Which Are Founded The Mathematical Theories Of Logic And
Probabilities. London: Walton – Maberly. - Burks, A.W., (1953). “Book Review: The Continuum of Inductive
Methods.
Rudolf Carnap.” Journal of Philosophy 50 (24):731-734.
- Carnap, R., (1950). Logical Foundations of Probability.
London: Routledge and Kegan Paul, Ltd. - Carnap, R., (1952). The Continuum of Inductive Methods.
Chicago: University of Chicago Press. - Carnap, R., (1963). “Replies and Systematic Expositions”. In
Schilpp, P.A., (ed.). The
Philosophy of Rudolf Carnap. Library of Living Philosophers,
Volume XI. Illinois: Open Court Publishing Company, pp.859-999. - Carnap, R., (1971). “A basic system of inductive logic, I”. In
Jeffrey, R., and Carnap, R., (eds.). Studies in Inductive Logic and
Probability. Los Angeles: University of California Press. pp.
34-165. - Carnap, R., (1980). “A basic system of inductive logic, II”.
Jeffrey, R., (ed.). Studies in Inductive Logic and
Probability. Berkeley: University of California Press. pp.
2-7. - Childers, T., (2013). Philosophy and Probability.
Oxford: Oxford University Press. - Cisewski, J., Kadane, J. B., Schervish, M. J., Seidenfeld, T. and
Stern, R., (2018). “Standards for Modest Bayesian Credences”.
Philosophy of Science, 85(1): pp.,53-78. - de Finetti, B., (1931)., “Probabilismo. Saggio critico sulla
teoria delle probabilità e sul valore della scienza”. In:
Logos. Napoli: F. Pezzella, pp.163-219. English translation in
Erkenntnis 31 (1989): pp.169-223. - de Finetti, B., (1936). “Statistica e Probabilita nella
concezione di R. von Mises”. Supplemento Statistico ai Nuovi
Problemi di Politica, Storia ed Economia Anno II, Fasc.2-3, pp.
5-15. - de Finetti, B., (1972). Probability and Induction. The art of
guessing. London: Wiley. - de Finetti, B., (1974). Theory of Probability: A Critical
Introductory Treatment.
Chichester: Wiley.
- de Finetti, B., (2008). Philosophical Lectures on
Probability, collected edited and annotated by A. Mura.
Springer. - Earman, J., (1992). Bayes or Bust: A critical examination of
Bayesian Confirmation Theory. Cambridge, Massachusetts – London,
England: The MIT Press. - Elga, A., (2016). “Bayesian Humility”. Philosophy of
Science, 83: pp. 305–23. - Ellis, R.L., (1842). “On the Foundations of the Theory of
Probability”. In The Mathematical and Other Writings of Robert
Leslie Ellis, 1862. Cambridge: Deghton Bell, and Co. pp.
1-11. - Gaifman, H., and Snir, M, (1982). “Probabilities Over Rich
Languages, Testing and Randomness”. The Journal of Symbolic
Logic, 47(3), pp. 495-548. - Gillies, D., (2000). Philosophical Theories of
Probability. London and New York: Routledge. - Gnedenko, B.V., (1969 [1978]). The Theory of
Probability. Moscow: Mir Publishers. - Goodman, N., (1955 [1981]). Fact, Fiction and Forecast.
Cambridge, MA: Harvard University Press. - Hájek, A., (2019). “Interpretations of Probability”, The
Stanford Encyclopedia of Philosophy (Fall 2019 Edition), Edward N.
Zalta (ed.), URL =
<https://plato.stanford.edu/archives/fall2019/entries/probability-interpret/>.
- Hacking, I., (1971). “Equipossibility Theories of Probability”.
The British Journal for the Philosophy of Science, 22 (4), pp.
339-355. - Hacking, I., (1975 [2006]). The Emergence of Probability: A
philosophical study of early ideas about probability induction and
statistical inference. Cambridge: Cambridge University
Press. - Hailperin, T., (1986). Boole’s Logic and Probability.
Amsterdam: North-Holland. - Hausdorff, F., (1914 [1957]). Set theory. New York:
Chelsea Publishing Company. - Hempel, C.G., (1945). “Studies in the logic of confirmation, I”.
Mind 54
(213),
pp. 1-26.
- Hempel, C.G., (1945). “Studies in the logic of confirmation, II”.
Mind 54 (214),
pp. 97-121.
- Hesse, M., (1975). “Bayesian Methods and the Initial Probability
of Theories”. In, Maxwell, G. and Anderson, R.M., (eds). Induction,
Probability and Confirmation. Minnesota Studies in the Philosophy of
Science, vol.6. Minneapolis: University of Minnesota
Press. - Hilbert, D., (1902). “Mathematical Problems”. Bull. Amer.
Math. Soc. 8 : pp. 437- 479 - Howson, C. and Urbach, P., (1989/2006). Scientific Reasoning:
The Bayesian Approach. Chicago and La Salle, Illinois: Open
Court. - Humphreys, P., (1985). “Why Propensities cannot be
Probabilities”. The Philosophical Review 94(4)
pp.557-570. - Huttegger, S. M. (2015). “Bayesian Convergence to the Truth and
the Metaphysics of Possible Worlds”. Philosophy of Science, 82:
pp. 587–601. - Huttegger, S. M. (2021). “Rethinking Convergence to the Truth”.
The Journal of Philosophy 119: pp. 380–403 - Jeffrey, R., (1992). Probability and the Art of
Judgement. Cambridge: Cambridge University Press. - Jeffrey, R., (2004). Subjective Probability: The Real
Thing. Cambridge: Cambridge University Press. - Kolmogorov, A. N. (1933 [1950]). Foundations of the Theory of
Probability. New York: Chelsea Publishing Company - Keynes, J. M., (1921). A Treatise on Probability.
London: Macmillan and Co., Limited. - Lakatos, I., (1968). “Changes in the problem of inductive logic”.
In Lakatos, I., (ed.), The Problem of Inductive Logic: Proceedings
of the International Colloquium in the Philosophy of Science, London,
1965, vol.2. Amsterdam: North Holland Pub. Co. pp.315-417. - Laplace, P. S., (1814 [1951]). A Philosophical Essay on
Probabilities. New York: Dover Publications, Inc. - Leibniz, G. W., (1678 [2004]). “On Estimating the Uncertain”.
The Leibniz Review 14. - Maher, P., (2006). “The Concept of Inductive Probability”.
Erkenntnis 65, pp.185–206. - Nielsen, M., (2021). “Convergence to Truth without Countable
Additivity”.
Journal of Philosophical Logic, 50: pp. 395–414.
- Nielsen, M. and Stewart, R.T., (2019). “Obligation, permission
and Bayesian orgulity”. Ergo 6(3). - Popper, K., (1959). “The Propensity Interpretation of
Probability”. The British Journal for the Philosophy of
Science, 10, (37), pp. 25-42. - Psillos, S. and Stergiou, C. (2022). “The Problem of Induction”.
The Internet Encyclopedia of Philosophy, ISSN 2161-0002, https://iep.utm.edu/problem-of-
induction/#H8 - Ramsey, F. P., (1926). “Truth and Probability”. In The
Foundations of Mathematics and other Logical Essays. London and New
York: Routledge (1931),
pp. 156-198.
- Reichenbach, H., (1934 [1949]). The Theory of Probability: An
Inquiry into the Logical and Mathematical Foundations of the Calculus of
Probability. Berkeley and Los Angeles: University of California
Press. - Russell, B., (1948 [1992]). Human Knowledge—Its Scope and
Limits. London: Routledge. - Salmon, W. C. (1966). The Foundations of Scientific
Inference. Pittsburgh: University of Pittsburgh Press. - Savage, L. J. (1954 [1972]). The Foundations of
Statistics. New York: Dover Publications. Inc. - Shackel, N. (2007), “Bertrand’s Paradox and the Principle of
Indifference,
Philosophy of Science, 74 (2), pp. 150–175.
- Venn, J., (1888). The Logic of Chance. London: Macmillan
and Co - von Mises, R., (1928 [1981]). Probability, Statistics and
Truth. New York: Dover Publications, Inc. - von Mises, R., (1964). Mathematical Theory of Probability and
Statistics. London and New York: Academic Press. - Weatherson, B., (2015). “For Bayesians, Rational Modesty Requires
Imprecision”. Ergo, 2.
Author Information
Chrysovalantis Stergiou
E-mail: cstergiou@acg.edu
The American College of Greece – Deree Greece
Stathis Psillos
E-mail: psillos@phs.uoa.gr
University of Athens Greece
=
, for 𝑖 =
.