Offerings of KnowledgeToTheMax
______________________________________________________________________________________________________________________
This message targets and should be of
interest to: science policy makers; funders of scientific studies; research
workers; scientists; philosophers; educators; political leaders, physicians;
engineers; laymen.
________________________________________________________________________________________________
Abstract
In an
advance of breathtaking importance, logic has been completed. The completion
was effected by extending logic from its roots in the deductive logic through
the inductive logic. In the extension of logic, the principles of reasoning
were discovered. One result is a new found ability to build a scientific model
(aka scientific theory) under the principles of reasoning. Under these
principles, the construction of a model creates the maximum possible knowledge
from fixed resources.
The news of
this advance has reached few of the people who need to know about it. On this
Web site, the firm KnowledgeToTheMax
offers a primer on the completed logic. In key respects this primer is the
first to be published.
The
extension of logic makes it possible, for the first time, to eliminate logical
error from a model. Most models in use today, in fields of endeavor that
include medicine, engineering, law, business and government, are guilty of this
kind of error. Bad consequences for us result from the error.
Error
degrades the performance of a model. Conversely, an absence of error boosts the
performance of a model to the maximum possible level. In some cases, the degree
of boost is found to be of an astounding order of magnitude.
The
completion of logic exposes a gap in which very few scientists possess the
skills that are necessary for construction of a logically sound, optimally
effective model. KnowledgeToTheMax
offers its assistance to the scientific, educational and business communities
in filling this gap.
Table of contents
State transition probabilities
An early principle of reasoning
The principle of entropy maximization
Shannon’s theory of the optimal
encoder
The principle of entropy minimization
Shannon’s theory of the optimal
decoder
Christensen’s theory of knowledge
The principle of maximum
entropy expectation
Creation of the maximum possible
knowledge
Reduction to the deductive logic
The word
“science” comes to us from the Latin word “scientia,” meaning “demonstrable
knowledge.” For brevity, this document references “scientia” by the word
“knowledge.”
Prior to the
year 1975, science was undermined by the existence of unresolved foundational
issues. Among these were:
o
The
origins of patterns,
o
The
nature of knowledge and
o
The
principles of reasoning.
The delivery
system for mankind’s knowledge was its collection of scientific models. A model
was a procedure for making inferences. On each occasion on which an inference
was made, each element in a collection {a,
b…} of inferences was a candidate for
being made. Which candidate was correct? The model builder had to decide!
Logic was
the science of the principles that discriminated the one correct inference in
the collection {a, b, …} of inferences from the many incorrect
ones. These principles were called the “principles of reasoning.”
However,
logic was incomplete. Aristotle had described the principle of reasoning for
the deductive branch of logic but, in building a model, the builder had to
employ the inductive branch of logic. The inductive branch had yet to be
described.
The problem
of extending logic from its deductive branch and through its inductive branch
had come to be known as the “problem of induction.” Prior to 1975, this problem
was unsolved. In lieu of the principles of reasoning, model builders coped
through employment of intuitive rules of thumb called “heuristics” in the
determination of which inference in the collection {a, b…} of inferences was
correct. According to one, frequently employed heuristic, the correct inference
was the one of greatest beauty.
The first
principle of reasoning was the law of non-contradiction. In employing
heuristics, scientists violated this law.
A result was
for models to be highly susceptible to making logical errors. The existence of
these errors degraded the performances of the associated models. The degree of
degradation could be and often was great. People died and suffered other
unpleasant consequences from the errors.
In 1948, the
mathematician and communications engineer Claude Shannon opened up a path out
of this quagmire. Shannon identified a measure which, it could be shown, was
the unique measure of an inference. The measure of an inference was the information
that was missing in this inference, for a deductive conclusion.
The
existence and uniqueness of the measure of an inference signified that the
identity of the correct inference could be established by optimization. In an
optimization, that inference in the collection {a, b…} of inferences that
were candidates for being made by a model was identified as correct for which
the missing information was (depending upon the type of inference) minimal or
maximal.
Shannon
described a pair of applications for the idea of optimization; these
applications were in the design of equipment for the telecommunications
industry. In one of them, the inference made by the device called an “encoder”
was optimized by maximizing the missing information in this inference. In the
other, the inference made by the device called a “decoder” was optimized by
minimizing the missing information in this inference.
The function
of a decoder was to translate an encoded message such as “0110111…” to an
un-encoded message such as “Mary had a little lamb.” Early in the 1960s, the
engineer, lawyer and theoretical physicist Ronald Christensen got the idea the
problem of induction could be solved by construing a model to be the algorithm
for the decoder of a “message” consisting of the sequence of outcomes of
statistical events.
There was a
barrier to implementation of this idea. Shannon’s theory of the optimal decoder
was incomplete. Christensen completed it.
Using his
idea, Christensen explained the origins of patterns, described the nature of
knowledge and enunciated the principles of reasoning. “Knowledge” was the
information which the user of a model obtained about the outcomes of
statistical events by recognition of the patterns that preceded them. In the
construction of a model, one of two principles of reasoning was to maximize the
knowledge. The other was to constrain the process of maximization by empirical
data and other sources of information. Patterns originated in the construction
of models under the principles of reasoning.
By 1985,
Christensen’s idea had been reduced to practice, published and employed in real
world applications of large number. In comparison to models built by
heuristics, models built under Christensen’s principles of reasoning
consistently performed better. Often, they performed much better. The
consistent outperformance was consistent with the belief that Christensen’s
principles of reasoning were the such
principles.
Today, there
is an anomaly in which, with near universality, communications engineers build
their models by optimization while, with near universality, research workers
build their models by the method of heuristics. The users of the models that
are constructed in the latter manner pay for the lack of optimality. Sometimes
they pay with their lives.
The firm KnowledgeToTheMax fills a gap in which
very few research workers are equipped to build a logically sound, optimally
effective model. The firm offers services that include conduct of tutorials,
consultation on curriculum reform in education and construction of models.
This
completes a summary of the offerings of KnowledgeToTheMax.
An expanded version of the same topic follows.
Introduction
In the
period of two years that ended in 1980, a signal event occurred in the history
of meteorology. At the start of this period, centuries of research had extended
the span of time over which the weather could be forecasted, with statistical
significance, to no more than 1 month. Two years later, this span had been
extended to 12 to 36 months – an improvement of a factor of 12 to 36. At this
point, it would be instructive and motivating for the reader to learn of the
features of this model. To do so, please click here.
The factor of
12 to 36 improvement had been effected through the use of a method of model
building that was new to meteorology. How could a mere switch in the method of
its construction effect such an enormous improvement in the performance of a
model? This question falls naturally within the context of logic.
A model is a
procedure for making inferences. Each time an inference is made, there are
possibilities a, b… for being made. Which inference in the collection { a, b…
} of possible inferences is the correct one? The model builder must decide!
Logic is the
science of the principles that discriminate the one correct inference from the
many incorrect ones. These principles are called the “principles of reasoning.”
For the deductive branch of logic:
o
there
is a single principle of reasoning and,
o
this
principle has been known since Aristotle described it 23 centuries ago.
For the
deductive logic, the principle of reasoning dictates the conformity of
arguments to the form called modus ponens
or the form called modus tollens.
Discovery of
the principles of reasoning for the whole of logic is called “the problem of
induction” after “induction,” the process by which the model builder
generalizes from descriptions of observed events to descriptions of unobserved
ones. To be useful to us, a model must describe the unobserved ones.
The problem
of induction thwarted the best efforts of thinkers over several millennia. Many
otherwise well informed people believe the problem remains unsolved. In many
ways, our society is organized as if this belief were true. For example, while
it is impermissible to publish an illogical deductive argument in a
mathematical journal, it is permissible to publish an illogical inductive
argument in a scientific journal. However, over the period of four centuries
that ended in 1975, a solution was found to the problem of induction. Solving
this problem made the principles of reasoning known and available for
construction of a model.
The ideas
that were to foster a solution were those of measure, inferences, optimization and missing
information. If an
inference had a unique measure then the one correct inference in the collection
{a, b…} of inferences that were possibilities for being made by a model
could be identified by optimization. In the optimization of an inference, that
possibility would be identified as correct whose measure was minimal or
maximal. In time, this measure was discovered in a generalization from the
deductive logic; this generalization came to be known as the “probabilistic
logic.”
Over the
period of about 4 centuries that ended in 1975, it was discovered that the
probabilistic logic held the unique measure of an inference. The unique measure
was the missing information in this inference, for a deductive conclusion. This
discovery yielded a pair of principles of reasoning.
Inferences
of infinite number were candidates for being made by a model. All but a few of
these inferences were incorrect. Thus, if a model builder were not guided by
the principles of reasoning, the associated model was virtually certain to make
incorrect inferences. The incorrect inferences degraded the performance of the
model. The magnitude of the degradation could be and often was great.
Over the
period of 27 years that ended in 1975, it became possible to eliminate the
incorrect inferences by optimization of inferences. Immediately, communications
engineers seized this opportunity. This was to revolutionize the communications
industry. HDTV was to become one of the fruits from this revolution.
Inexplicably, research workers failed to seize the same opportunity.
A result is for
the lion’s share of today’s models to make incorrect inferences. We employ
these inferences in making decisions on issues of importance to us. For
example, we employ them in making decisions on medical issues for which alive and dead are the outcomes.
That few of
us are aware of the principles of reasoning or how they operate is a barrier to
eradication of the incorrect inferences that plague us. To gain this awareness,
one must delve into the details of logic. The rudiments of logic are presented
in the primer that follows. The primer can be speed-read in about half an hour,
probably without attaining full understanding of the mathematical details. More
thorough study of this topic, preferably with the help of a competent tutor, is
advised for scientists, philosophers, educators, intellectuals, professionals,
business leaders and political leaders, among others.
In one’s
study of the details of logic, the place to start is with the probabilistic
logic.
Working in
the sixteenth century, the mathematician, physician and habitual gambler
Girolamo Cardano discovered or anticipated the discovery of a surprisingly
large proportion of the ideas that were to play key roles in solving the
problem of induction. He described his ideas in Liber de Ludo Aleae (“The Book
of Games of Chance”). One of Cardano’s ideas was the probabilistic logic.
In the
deductive logic, every proposition was true in either 0% or 100% of the
instances in which this proposition was asserted. In reality, though, a proposition
might be true in a proportion of instances in which it was asserted lying
between 0% and 100%. Cardano got the idea that logic could be freed from the
unrealistic restriction to either 0% or 100%. By this idea, he created the
generalization from the deductive logic that came to be known as the
“probabilistic logic.” In this logic, the proportion of instances in which a
proposition was asserted in which it was true was called the “probability” of
this proposition. Cardano’s innovation made it possible for logic to express
the important idea that information needed for a deductive conclusion from an
inference could be missing; for example, information needed for a deductive
conclusion about the outcome of a horse race could be missing.
States
This topic
of “states” and the ten topics that follow it are largely devoted to tedious
but necessary definitions of terms.
The
propositions that are referenced by the probabilistic logic are examples of
states. A “state” is a description of a physical object or “body.” Cloudy is an example of a state; it
describes a body that is a region of the Earth.
That a
proposition is a state signifies that this proposition can be validated or
invalidated by observation. A proposition that is a state is validated each time
the associated body is observed and found to be in the state that is claimed
for it; otherwise, it is invalidated. The ability of one to validate or
invalidate its propositions in this way ties the probabilistic logic to
science. In science, a model is invalidated if a single one of its propositions
is invalidated.
A complete
set of alternate descriptions of a body is called a “state-space” for this
body.
If the state
in a state-space is observed, the associated state-space is said to be “observed.”
The set { cloudy, not cloudy } is an example of an
observed state-space.
If the state
in a state-space is unobserved, the associated state-space is said to be
“unobserved.” The set { rain, no rain } is an example of an unobserved
state-space.
An
“inference” is an extrapolation from a state in an observed state-space of a
body to a state in an unobserved state-space of the same body. For example, it
is an extrapolation from the state cloudy
in the observed state-space { cloudy,
not cloudy } of a region of the Earth
to the state rain in the unobserved
state-space { rain, no rain } of the same region.
An “event”
is a pairing of a state in the unobserved state-space Y with a state in the observed state-space that participates with Y in making an inference. The pairing { cloudy, rain } is an example of one of them.
An “observed
event” is a datum that specifies the state in the observed state-space of an event
and the state in the unobserved-state-space of the same event. An example of an observed event is “ cloudy, rain .”
An
“unobserved event” is a datum that specifies the state in the observed
state-space of an event but not the state in the unobserved-state-space of the
same event. An example of an unobserved
event is “ cloudy, x ” where the x designates a variable which takes on the state in the unobserved
state-space as its value.
A complete set
of observed and unobserved events is called a “population.”
A subset of
a population is called a “sample.” The observed events belong to a sample.
A state in
an unobserved state-space may be conditional upon on a state in an observed
state-space. For example, the state rain
may be conditional upon the state cloudy.
A state that is formed in this way is called a “conditional state.” Rain given
cloudy is an example of a conditional
state.
State transition
probabilities
As a conditional
state is an example of a proposition, under the probabilistic logic a
conditional state has a probability. This probability is called a
“state-transition probability.” Pr( rain given cloudy ) is an example of such a probability, where Pr(.) signifies the probability.
An early principle of
reasoning
A logic
needed principles of reasoning that identified the one correct inference in the
set {a, b…} of alternatives, when an inference was made by a model. In Liber de Ludo Aleae, Cardano supplied a
principle of reasoning whose scope was limited to models of games of chance.
Under this principle, equal numerical values were assigned to the probabilities
of the ways in which an outcome
could occur in a game of chance, provided that this game was fair. Many
centuries later, a generalization from Cardano’s principle would become the
principle of entropy maximization; soon, you’ll learn about this principle.
The first
principle of reasoning is the law of non-contradiction. This law cannot be
derived. Instead, it serves as a portion of the definition of what it means to
be “logical.” The law states that a proposition is false if it contradicts itself. For example, the proposition “That
card is the ace of spades and is not the ace of spades” is false because it violates the law of non-contradiction.
As the
principles of reasoning identify the one correct inference in the set {a, b…}
of alternatives for being made by a model, the proposition is false that these principles identify
more than one inference as the one correct inference, by the law of
non-contradiction. Over the years in which the problem of induction remained
unsolved, the major barrier to solution was to satisfy the law of
non-contradiction.
A
“heuristic” is an intuitive rule of thumb that identifies the one correct
inference in the set {a, b…} of alternatives for being made by a
model. In every instance in which a heuristic identifies the one correct
inference, at least one different heuristic identifies a different inference as
the one correct inference. As it identifies more than one inference as the one
correct inference, the method of heuristics violates the law of
non-contradiction.
Prior to the
year 1948, model builders lacked an alternative to the method of heuristics. In
that year, the idea of optimizing inferences was published by Claude Shannon.
Under optimization, the correct inference was the one whose unique measure was
minimal or maximal. By identifying the correct inference uniquely, optimization
satisfied the law of non-contradiction.
In the
construction of a model, a key idea is that of abstraction. A model is
abstracted (removed) from some of the details of the real world.
To be more
precise, a state A is said to be
“abstracted” from the states B, C… if and only if A is the inclusive disjunction of B, C…; in other words, A is the equivalent of the state B OR C
OR… A is said to be “abstracted” from
B, C… because the description provided by A is removed from the descriptions provided by B, C… For example, the
state male OR female is removed from the gender difference between the state male and the state female.
How to
abstract his/her model from the details is one of the problems faced by the
builder of a model. This problem is solved by the principles of reasoning.
A state that
is abstracted from no other state is called a “way in which a state can occur” or
“way” for short.
In the example provided above, the states male and female are
examples of ways. The state male OR female is not a way, for it is abstracted from the two other states.
If an
inference were to be optimized, it had to possess a unique measure. Under
Cardano’s definition of “probability,” a probability was an example of a
measure. It was not, however, the measure of an inference. Early in the
twentieth century, the mathematician Henri Lebesgue generalized Cardano’s idea
to measures in general. Lebesgue’s generalization set the stage for discovery
of the measure of an inference.
Lebesgue’s
generalization is called “measure theory.” Under measure theory, a measure is a
mathematical function that maps each set in a collection of “measureable sets”
to a non-negative real number. That this function “maps” signifies that, for
every set in the collection, there is exactly one non-negative real number.
Under a precept
of measure theory, the measure of an empty set is nil. Under the precept called
“additivity,” the measure of the union of disjoint sets is the sum of the
measures of the individual sets.
The union of
several sets is the set of all elements that belong to at least one set. Sets
are said to be “disjoint” if they do not intersect.
Under
precepts of measure theory governing membership in the collection of measurable
sets, if a collection contains the set A
and the set B then this collection
also contains the sets B – A and B∩A.
B – A is
called the “set difference.” It is the set of elements of B that do not belong to A.
B∩A is called the “set
intersection.” It is the set of elements of B
that also belong to A.
Probability
is an example of a measure. Let the probability measure be designated by the
function Pr(.). The dot between the
parentheses symbolizes that element in the collection of measurable sets which
is measured by Pr(.).
Half way
through the twentieth century, the mathematician and communications engineer
Claude Shannon described the measure that came to be named after him:
“Shannon’s measure.” Let Shannon’s measure be designated by the function Sh(.).
The
collection of sets which were measurable by Sh(.)
included an unobserved state-space plus the observed state-space which
participated with this unobserved state-space in making an inference. Let the
unobserved state-space be designated by Y
and the observed state-space be designated by X.
By the precepts
of measure theory, the collection of measurable sets also contained Y – X
and Y∩X. Under the precept of additivity,
Sh( Y
– X ) = Sh( Y ) – Sh( Y∩X )
(1)
Shannon
stipulated that Sh( Y – X
) was a function with a specific mathematical form that was known to
Shannon; by this formula, Sh( Y – X
) was the measure of an inference. Thus, the set being measured by Sh(.), namely Y – X, had to be an
inference. It could be shown that, under the probabilistic logic, Sh( Y
– X ) was the unique measure of an
inference. Though Shannon did not realize it, the existence of the unique
measure of an inference signified that the problem of induction could be solved
by optimization.
In the
circumstance that Sh( Y∩X ) was nil, it followed from equation (1) that
Sh( Y
– X ) = Sh ( Y )
(2)
By
inspection of equation (2), Sh( Y – X
) reduced to Sh( Y ) in the circumstance that Sh( Y∩X ) was nil.
Refresher: o
Pr(.)
designates the probability measure. o
Sh(.)
designates Shannon’s measure.
Under the
probabilistic logic, Pr(.) existed
and was the unique measure of a state. Under the same logic, Sh(.) existed and was the unique measure
of an inference.
The function Sh( Y ) had a specific mathematical form. A colleague informed Shannon that, in
developing the modern theory of heat (aka thermodynamics), physicists had named
this form the “entropy.” In this way, Sh(
Y ) became known as the “entropy.”
The entropy was the measure of an inference from the observed state-space X to the unobserved state-space Y, in which X contained a single state. The single state in X was abstracted from the states in Y.
The “conditional entropy”
By its form, the function Sh(
Y – X ) was the measure of an inference from X to Y, in which X contained several states. By analogy
to the term “entropy,” this function came to be termed the “conditional
entropy.”
Shannon
worked in the field of communications engineering. Communications firms, such
as telephone companies, understood that their role was to move information.
However, they did not know what “information” was! That they did not know precluded
optimization of their operations. Shannon suggested that the mathematical
formula for the “information” moved by communications firms was given by the
function Sh( Y∩X ). In
particular, Sh( Y∩X ) was the
“information” about the state in Y,
given the state in X.
Under
equation (1), Sh( Y∩X ) varied inversely with Sh(
Y – X ). When Sh( Y∩X ) was at its maximum value, Sh(
Y – X ) was at its minimum value. Conversely, when Sh( Y∩X ) was at its minimum value, Sh( Y
– X ) was at its maximum value.
As Sh( Y∩X ) was called the “information” about
the state in Y, given the state in X, it was apt to call Sh(Y
– X) the “missing information” in the
inference from X to Y. As Sh(Y – X) was also called the “conditional
entropy,” the “conditional entropy” of an inference was synonymous with the
“missing information” in this inference. Similarly, the “entropy” of an
inference was synonymous with the “missing information” in this inference.
The phrase
“missing information” was shorthand for “missing information for a deductive
conclusion.” Under the probabilistic logic, the inductive differed from the
deductive logic in the respect that there was missing information for a
deductive conclusion in the former branch of logic but not the latter. In this
way, the probabilistic logic answered the previously unanswered question of the
essential difference between the deductive and inductive branches of logic.
The missing
information has a precise mathematical formula that you can look up on the Web.
To avoid getting embroiled in too many details, we’ll skip the formula and
illustrate the idea with a story.
The story is
about a race among 8 equally matched horses. With the winner unknown, the
identity of the winner is conveyed by the three bit binary number _ _ _ . Each of the underscore
characters of the number represents
the erasure of a binary digit, that is, a “0” or a ”1”. One bit of information about the winner is
lost when a binary digit is erased. One bit of information is gained when an
erasure is replaced by a binary digit. The information about the winner is the number
of binary digits. The missing information about the winner is the number of
erasures.
The
information about the winner, measured in “bits,” is the number of binary digits.
The missing information about the winner, also measured in bits, is the number
of erasures. Thus, for example, in the number “_0_,” there is one bit of
information about the winner and there are two bits of missing information
about the winner.
Why is it
that the missing information about the winner can be measured by counting the
erasures and the information about the winner can be measured by counting the
binary digits? This is a consequence of the precept of measure theory called
“additivity.” This precept gives the missing information and the information
their unique functional forms. If we were to define the missing information and
the information in any other way, Shannon’s “measure” would no longer be a
measure.
The principle of entropy maximization
In the
nineteenth century, physicists described a principle of reasoning for the
modern theory of heat: When a body was isolated from its environment, the
entropy of an inference to the ways in which a state
could occur in the unobserved state space of this body was maximized. In the
twentieth century, Shannon described a similar principle for the modern theory
of communication. Subsequently, various theorists generalized the two
principles to a principle of reasoning for models, in general.
This
principle is entropy maximization. An abbreviated derivation of this principle
follows.
If and only
if the observed state-space of an inference contains a single state then, under
the probabilistic logic, the unique measure of this inference is its entropy.
If and only if the states in the unobserved state-space of this inference are
examples of ways, the entropy possesses a maximum. The
entropy may be pushed downward from the maximum by constraints, expressed
mathematically, on entropy maximization. The amount of the reduction in the
entropy, from the constraints, is called “the available information.”
Thus, the
probabilistic logic holds a “principle of entropy maximization.” It states
The entropy of the inference to the ways in which a state can occur is maximized, under
constraints expressing the available information.
Maximization
of the entropy, under the constraints, identifies the one correct inference in
the set {a, b…} of alternatives for being made by a model. The correct
inference is the one that maximizes its own entropy. Thus, the principle of
entropy maximization is a principle of reasoning. This principle assigns a unique
numerical value to the probability of each way in which the
state can occur.
The reader
should understand that, under the probabilistic logic, the principle of entropy
maximization is a fact and not a theory, conjecture or empirical finding. Thus,
this principle provides a portion of the bedrock upon which a model may be
founded. Conversely, to violate this principle in the construction of a model
is to commit a logical error.
The
principle of entropy maximization has been called the “principle of honesty in
inferences.” Often, model builders violate this principle by raising the
entropy to a level that is lower than is justified by the available
information. When this happens, the result is the same as when a dishonest research
worker fabricates empirical data. A consequence is for the model to fail in
service from making false assertions.
Under
Cardano’s theory of fair gambling devices, equal numerical values are assigned
to the probabilities of the ways in which an outcome can
occur in a game of chance. Cardano’s theory arises from the principle of
entropy maximization. It arises in the following manner.
Suppose a
model makes an inference from an observed state-space containing a single state
to the unobserved state-space that participates with the observed state-space
in making an inference. This inference assigns a numerical value to the
probability of each way in which an outcome can occur,
in a game of chance. The observed state-space contains a single state, which is
abstracted from the ways in the unobserved state-space.
The
numerical values which are assigned to the probabilities of the various ways form a set. Sets of numerical values of infinite
number are possibilities. Each possibility defines a different inference. Which
inference is correct?
The
principle of entropy maximization applies to the situation described. Under
this principle, the correct inference maximizes its own entropy, under
constraints expressing the available information.
The
information about the way in which the outcome will occur is nil, by the
definition of a fair gambling device. Thus, the entropy of the one correct
inference is maximized without constraints. Maximization of the entropy assigns
equal numerical values to the probabilities of the various ways.
The modern
theory of heat, aka thermodynamics, arises from the principle of entropy
maximization. The manner in which it arises is identical to the manner in which
the theory of fair gambling devices arises, with the exception of the
identities of the states in the unobserved state-space. In the theory of fair
gambling devices, these states are the ways in which an
outcome can occur in a game of chance. In the theory of heat, they are the
“accessible microstates” for a body at thermodynamic equilibrium; the
accessible microstates are the ways in which an outcome
can occur, for a body at thermodynamic equilibrium.
Shannon’s theory of the optimal encoder
Shannon’s
theory of the optimal encoder is an application of the principle of entropy
maximization. Shannon’s theory features an inference-making device called an
“encoder.” An encoder translates an un-encoded message, such as “Mary had a
little lamb,” to an encoded message, such as “101110…” In doing so, an encoder
makes an inference to an unobserved state-space; this state-space is the
alphabet of the encoded message. The inference assigns a numerical value to the
probability of each state in the unobserved state-space. The numerical values
that are assigned to the probabilities of the various states form a set. Sets
of infinite number are possibilities for assignment. Which one is correct? The
designer of the decoder must decide!
Each state
in the unobserved state-space of the inference is an example of a way. The observed state-space that participates with the
unobserved state-space in making the inference contains a single state; this
state is abstracted from the ways in the unobserved
state-space.
The
principle of entropy maximization applies to the situation described. The
correct inference is the one that maximizes its own entropy, under constraints
expressing the available information. Designing an encoder in this manner
eliminates logical error from the inference that made by this encoder. An
encoder that is designed in this manner is called an “optimal encoder.”
The principle of entropy minimization
Shannon
described a second principle of reasoning for the modern theory of
communication. Various theorists generalized this principle to a principle of
reasoning for models, in general.
This
principle is entropy minimization. A derivation of the principle follows.
If the
observed state space of an inference contains several states, under the
probabilistic logic the unique measure of this inference is its conditional
entropy. The conditional entropy of an inference is the missing information in
this inference, for a deductive conclusion.
The observed
state-space of the inference can be defined in many different ways. Each way
defines a different inference. Thus, each inference in the set { a, b…
} of alternatives is a candidate for being made by a model. Which inference is
correct? The model builder must decide!
The
principle of entropy maximization does not apply to this situation, for its
task is to assign numerical values to the probabilities of the states in an
unobserved state-space. Here, the task is to determine the descriptions that
are provided by the states in an observed state-space. Minimization of the
conditional entropy is the optimization that determines the descriptions. Minimization
of the conditional entropy uniquely determines the correct inference. Thus,
minimization of the conditional entropy is a principle of reasoning. This
principle is called “entropy minimization.”
Shannon’s theory of the optimal decoder
Shannon’s
theory of the optimal decoder applies the principle of entropy minimization.
Under Shannon’s theory, a device called a “decoder” translates an encoded
message, such as “001110…,” to an un-encoded message, such as “Mary had a
little lamb.” In doing so, a decoder makes an inference from an observed to an
unobserved state-space. The unobserved state-space is the alphabet of the
un-encoded message.
A variety of
descriptions can be provided by the states in the observed state-space of this
inference. Each description defines a different inference. Which inference in
the set {a, b…} of alternatives for being made by the decoder is correct? The
designer of the decoder must decide!
The
principle of entropy minimization applies to the situation described. That
inference is correct which minimizes its own conditional entropy.
In the
vernacular of communications engineering, the conditional entropy is attributed
to the “noise.” Lightning strikes to telephone lines are a source of noise, for
they add to the conditional entropy. Minimization of the conditional entropy
through the design features of the decoder minimizes the deleterious effects of
this noise.
Conformity
to the principle of entropy minimization eliminates logical error from the
inference that is made by a decoder. A decoder that is free from logical error
is called an “optimal” decoder.
Shannon’s theory of communication
Under
Shannon’s theory of communication (Shannon, 1948), the designer of a
communications system maximizes the capacity of this system by combining an
optimal encoder with an optimal decoder. The effect is to maximize the missing
information about the encoded message at the encoder of this message and
minimize the missing information about the un-encoded message at the decoder of
the same message. Shannon’s ideas underlie the designs of nearly all modern
communications devices.
Christensen’s theory of
knowledge
By 1963, the
problem of induction remained unsolved. In that year, the engineer, lawyer and
theoretical physicist Ronald Christensen got the idea that the problem of
induction could be solved by construing a model to be the algorithm for an
optimal decoder of a “message” from nature. This “message” consisted of the
sequence of the outcomes of statistical events for which the model was
designed. It consisted, for example, of the sequence: rain rain no
rain rain ….
In their
quest for knowledge, research workers were hampered by the fact that
“knowledge” was an undefined concept. Christensen’s idea supplied a definition
that was uniquely logical. In doing so, it generated a logical theory of
knowledge that was the only such theory. Going forward, this theory will be
called “Christensen’s theory of knowledge.”
In the
construction of a model, the issue repeatedly arose of which inference in a set
{ a, b…} of inferences that were candidates for being made by a model
was the one correct inference. Under Christensen’s theory, each such issue was
resolved by measuring the various candidates by Shannon’s measure and selecting
that candidate whose measure was minimal or maximal.
This line of
thinking yielded a pair of principles of reasoning. Christensen called these
principles “entropy minimax.” Acting under entropy minimax, the builder of a
model discovered patterns in empirical data. Christensen called the process of
discovery “entropy minimax pattern discovery.”
Christensen’s
theory employs an abundance of mathematical ideas. To keep unambiguous track of
these ideas, it is necessary to employ mathematical symbols in referencing some
of them.
Toward the
end of keeping track, let the set O
designate the set of outcomes of statistical events to which an inference is
made by a model; O is an example of
an unobserved state-space. Let C
designate the observed state-space that participates with O in making an inference. The “knowledge” of Christensen’s theory
is Sh( O∩C ); it is the
information about the state in O,
given the state in C. “Knowledge”
must be defined in this way because, under the probabilistic logic, there is no
other way to define it.
If the
construction of a model is to create knowledge, this model must be built upon
one or more independent variables. Each such variable is a measured variable or
is computed from one or more measured variables. If an inference is to be made
from C to O, a value must have been assigned to each independent variable at
or before the time at which this inference is made.
A result
from satisfaction of this requirement is for the set of independent variables
to take on a value for each of its variables in the period before an inference
is made. The set which contains a value for each of the independent variables
is called a “tuple.” The complete set of tuples is called the “independent
variable space” for the model.
For
concreteness, let’s take a look at a simplified example. In the example, the
model has two independent variables. The values of one of these variables are
the elements of the set { heavy, light }. The values of the other
variable are the elements of the set { long,
short }. The associated independent
variable space is the set { heavy-long,
heavy-short, light-long, light-short }. Heavy-long
one of the four tuples in this space.
The
independent variable space of a model may be divided into parts. In the case of
our example, this space may be divided into the part { heavy-long } and the part { heavy-short, light-long, light-short }. The complete set of these parts is
called a “partition” of the independent variable space. Each element of C is a tuple or is abstracted from the tuples in a part of this
partition.
If C contains two or more states, these
states are called “conditions,” for they are conditions on the independent
variable space. The state heavy-short
OR light-long OR light-short is an example of a condition; it
is abstracted from the elements of the part { heavy-short, light-long, light-short } of the partition of the independent variable space that was described in
the previous paragraph.
In practice,
there are a great many possible partitions of the independent variable space.
If at least one of the independent variables is continuous, the number of
partitions is infinite. Each partition generates a different set of
descriptions for the states in C. Each
such set defines a different inference from C
to O. Which of these inferences is
correct? The model builder must decide!
The
principle of entropy minimization applies to the situation described. That
inference is correct which minimizes its own conditional entropy. With
“knowledge” defined as previously described, the principle of entropy
minimization is the equivalent of the principle that the model builder shall
Maximize the knowledge.
Maximization
of the knowledge is Christensen’s first principle of reasoning. His second
principle of reasoning constrains the process of maximization of the knowledge
by the availability of information for this purpose. With the availability of
unlimited information, perfect knowledge is created by the application of Christensen’s
first and second principles of reasoning. With the availability of no
information, no knowledge is created. In practice, it is usually true that some
but not perfect knowledge is created.
The
foregoing description of Christensen’s second principle of reasoning is
accurate; however, it is vague in the sense of failing to describe how
information may be turned into knowledge. Elimination of this vagueness comes
at the expense of exposing the student to mathematical details that are
complicated and that may be confusing. In view of the potential for confusion,
it would be wise and cost-effective for the student to engage a competent
tutor. For those who wish to attempt to learn of the details without a tutor,
the following self-guided tutorial is provided.
Shannon’s
theory of the optimal decoder contained an omission. This was of means for
assignment of a number to the probability of a state in C or to the probability of a state in O, given a state in C.
These assignments had to be made in order for the knowledge to be computed and
maximized.
To assign a
number to each probability, one needed a solution to a so-called “inverse
problem.” The problem was that, while model builders had to assign values to
probabilities, all that experimental science gave to model builders was
frequency ratios in statistical samples.
The major
barrier to solving the inverse problem was the question of how to avoid
violation of the law of non-contradiction. In response, Christensen developed a
strategy that answered this question. The strategy was to set up the problem
such that on each occasion on which the identity of the correct inference was
at issue, this issue was resolved by the principle of entropy maximization. By
this strategy, Christensen solved the inverse problem. A result from this
strategy is Christensen’s second principle of reasoning.
Christensen’s
strategy is rich with mathematical ideas. In keeping track of the ideas, it
helps to reference them by symbols. Toward this end, let T designate an unobserved state-space and let U designate the observed state-space that participates with T in making an inference.
Refresher: O
designates the set of outcomes of statistical
events that is referenced by a model. O
is an example of an unobserved state-space. C
designates the observed state-space that
participates with O in making an
inference. Provided that C
contains two or more states, these states are called “conditions.”
The
description provided in the previous paragraph deliberately employs
terminological sloppiness in which the state-space C, previously described as an “observed state-space,” can also be
described as an “unobserved state-space.” The question of which kind of
state-space C is in any given context
is resolved by this context.
For
concreteness, let’s look at a couple of examples. In both examples, O is the state-space { rain, no rain } while C is the
state-space { cloudy, not cloudy }.
In the first
example, the variable T takes on the
value O and the variable U takes on the value C; thus, T is the state-space { rain,
no rain } while U is the state-space { cloudy,
not cloudy }. In the second example, T takes on the value C and U contains a single state that is abstracted from the states in C; thus, T is the state-space { cloudy,
not cloudy } while U is the state-space { cloudy OR not cloudy }.
As the
reader may recall, T is an example of
an unobserved state-space while U is
the observed state-space that participates with T in making an inference. Let { Tl,
Um
} designate the pairing of an unspecified state in T with an unspecified state in U. The count of the elements of a
statistical sample that are observed to be in state Um is an example of a “frequency”; let this frequency be
designated by n. The count of the
elements that are observed to be in state
Tl AND Um
is another example of a frequency; let this frequency be designated by x. By definition, the two frequencies
form the “frequency ratio” of the state Tl
given Um. Let this
frequency ratio be designated by ‘x in
n’.
Let V designate a statistical sample. If the frequency ratio of Tl given Um in V is ‘x in n’
what number shall be assigned to Pr( Tl given Um )? To answer this question, one needs a solution to
the inverse problem. Often, model builders have assumed the “straight rule” to
be this solution.
Under the
straight rule, the number assigned to Pr(
Tl given Um ) is x/n. x/n
is the “relative frequency” of the state
Tl given Um
in the sample V. The relative
frequency is the value that makes the frequency ratio ‘x in n” most likely.
Thus, it is an example of a maximum likelihood estimator.
The straight
rule is illogical, for it violates the principle of entropy maximization. This
deficiency is most apparent in the circumstance that n is small.
To pick a
specific example, if 1 swan was observed and it was white, the frequency ratio
of the state white given swan is ‘1 in 1’, the relative frequency
of this state is 1/1 and 1 is assigned to Pr(
white given swan ) under the straight rule. This is the equivalent of the
conclusion that “all swans are white.”
Is it
logical to conclude that all swans are white on the basis of a sighting of a
single white swan? No it’s not. One cannot logically state that all swans are
white, for information is missing about the colors of the unobserved swans.
Nonetheless, prior to 1957, statisticians were firm believers in the straight
rule. A result, still present in the language of mathematical statistics, is
use of the superlative “unbiased estimator” in reference to the result from the
straight rule. Using the meaning of “biased” in common English, one would have
to say that to assign the value of 1 to the probability of a white swan on the
basis of a sighting of a single white swan is extremely biased. It is biased in
the direction of presuming extremely more information than is possessed by the
model builder about the colors of the unobserved swans.
The straight
rule may be tested for its conformity to reality. In one such test, the state
space T of the model contained a pair
of outcomes. One of these outcomes was a
hit in a time at bat in the game of baseball. The other outcome was not a hit. The state space U contained 18 conditions on the model’s
independent variables. Each condition was the identity of the major league
player who was the batter. The results of the test were published in the
periodical Scientific American (Efron
and Bradley, 1977).
In the
language of baseball, a player’s relative frequency of the state a hit in a time at bat is called this
player’s “batting average.” In the test of the straight rule, the performances
of 18 major league players were measured in the 1970 season. Each player’s
batting average in his first 45 times at bat was compared to this player’s
batting average in the remainder of the season.
If the
straight rule were consistent with reality, the two batting averages would have
been of similar magnitude. It was found, however, that in the remainder of the
season, the various players’ batting averages had shrunk far from their batting
averages in the first 45 times at bat and close to the grand average of the
eighteen players in their first 45 times at bat. This phenomenon became known
as “shrinkage.” With shrinkage, the numbers assigned to probabilities by the
straight rule were wrong. Thus the straight rule was invalided as a general
guide to model building. The article called this phenomenon “Stein’s paradox,”
after the statistician who had discovered it.
Under the
probabilistic logic, the shrinkage has a cause. This cause is overestimation of
the information one gets about the probability of a state in T,
from knowing the state in U. If
this information is nil, then the probability of a state in T is independent of the state in U and is called the “base-rate.” The
shrinkage that is observed in empirical studies is toward the base-rate. The
straight rule neglects the shrinkage toward base-rate when the information
about the state in T, given the state
in U, is less than perfect.
The
shrinkage can be eliminated. To accomplish this, the model builder eliminates
the overestimation of the information about the probability of a state in T from knowing the state in U. This elimination is effected by
conformity to the principle of entropy maximization. To settle every issue of
which inference is correct by the principle of entropy maximization is the idea
that underlies a principle of reasoning which Christensen discovered. He calls
this principle “maximum entropy expectation.”
The principle of maximum entropy expectation
Maximum
entropy expectation solves the inverse problem. In setting up the inverse
problem for solution, Christensen describes inferences by a carefully contrived
strategy. This strategy is designed to render every issue of which of several
inferences is correct decidable, under the principle of entropy maximization.
To review,
the principle of entropy maximization applies if and only if:
o
an
inference to an unobserved state-space assigns a numerical value to the
probability of each of the states in this state-space and,
o
the
observed state-space that participates with the unobserved state-space in
making the inference contains a single state and,
o
the
elements of the unobserved state-space are examples of ways.
The details
of Christensen’s strategy result from the necessity for conforming to the three
bulleted requirements for the principle of entropy maximization to be applicable.
Maximum
entropy expectation assigns a number to Pr( Tl given Um ) over each value of the
index l and the index m. The manner in which it computes each
such number is the topic of the following derivation.
The
derivation features the set { W1,
W2,… } of ways in which a state can occur in the unobserved
state-space { T1 given Um, T2 given Um…
}. The elements of { W1, W2,… } may be paired. Let an
arbitrarily selected pair be designated by { Wi, Wj }, where the index i does not equal the index j.
Let it be stipulated that { Wi,
Wj } is the unobserved
state-space for an inference. Let it be stipulated that the observed
state-space which participates with { Wi,
Wj } in making this
inference is { Wi OR Wj }.
Wi given Wi OR Wj
is an example of a state. Let E2
designate the evidence that is available for the assignment of a
numerical value to Pr( Wi given Wi OR Wj ). Let Pr[
( Wi given Wi OR Wj ) given E2
] designate this value. If Pr[ ( Wi given Wi OR Wj
) given E2 ] can be computed
over all of the values of the indices i and j then,
it can be shown, sufficient data are available for assignment of a numerical
value to Pr( Tl given Um )
over each value of the index l and
each value of the index m as required
for completion of the derivation.
How shall a
number be assigned to Pr[ ( Wi given Wi OR Wj
) given E2 ]? As
Christensen has set up the problem, it responds to the principle of entropy
maximization. In the assignment of a number, an inference is made from the
unobserved state-space Wi
OR Wj to the observed
state-space { Wi, Wj }. Inferences of infinite
number are possibilities. Each inference assigns a number to Pr[ ( Wi given Wi
OR Wj ) given E2 ] and a different number
to Pr[
( Wj given Wi OR Wj ) given E2
]. Which inference is correct? The model builder must decide!
Under the
principle of entropy maximization, the correct inference is the one that
maximizes its own entropy, under constraints expressing the available
information.
What are the
natures of the constraints? In answering this question, Christensen makes a
second application of the principle of entropy maximization.
In
developing this idea, we conduct a thought experiment. In each trial of this
experiment, we observe whether the state is Wi,
given that the state is Wi
OR Wj.
In 1 trial
of this experiment, it is a fact that the relative frequency of Wi given Wi OR Wj
will be 0 OR 1. In 2 trials, the relative frequency will be 0 OR ½ OR 1. In 3
trials, the relative frequency will be 0 OR 1/3 OR 2/3 OR 1. In N trials, the relative frequency will be
0 OR 1/N OR 2/N OR 3/N…OR 1. Note that
the relative frequency will surely be one of the elements in the sequence of
numbers 0, 1/N, 2/N, 3/N,…,1.
Now, let the
number of trials N increase without
limit. The relative frequency becomes known as the “limiting relative
frequency.” Let the limiting relative frequency of the state Wi given Wi OR Wj
be designated by Fr( Wi given Wi OR Wj
).
Each element
of the set { 0, 1/N, 2/N,…,1 } matches the description of a way in which Fr( Wi given Wi OR Wj
) can occur. Let this set be designated by F(
Wi given Wi OR Wj ). F( Wi given Wi OR Wj
) is a variable whose true but thus far undetermined value is Fr( Wi
given Wi OR Wj ). The values taken on by F( Wi
given Wi OR Wj ) are in the sequence 0,
1/N, 2/N, 3/N…,1. The distance
between adjacent values is fixed at 1/N and
this distance is infinitesimal.
We stipulate
that F( Wi given Wi
OR Wj ) is the unobserved
state-space for an inference. The observed state-space of this inference is { 0
OR 1/N OR 2/N OR 3/N OR…OR 1 }. The
numerical values that are assigned to the probabilities of the elements of F( Wi
given Wi OR Wj ) by this inference form a
set. Sets of infinite number are possibilities. Each set defines a different
inference. Which inference is correct? The model builder must decide!
The
principle of entropy maximization applies to the situation described. The
correct inference to F( Wi given Wi OR Wj
) is the one which maximizes its own entropy, under constraints expressing the
available information.
For reasons
that will become clear, it is convenient to stipulate that there are two
sources for the available information. One of these is the piece of evidence
we’ve already seen, namely E2.
The other is the additional piece of evidence E1. Under constraints expressing the available
information in E1,
maximization of the entropy yields the probability distribution function Pr[ F(Wi given Wi OR Wj
) given E1 ]. Under
constraints expressing the available information in E2, entropy maximization yields the probability
distribution function Pr[ F(Wi
given Wi OR Wj ) given E2 ].
By
tradition, the set E1 is
assumed to be empty and the set E2
is assumed to contain the frequency ratio ‘x
in n’ in a sample. Under this
tradition, Pr[ F( Wi given Wi OR Wj ) given E1
] is called the “prior” probability
distribution function while Pr[ F(Wi
given Wi OR Wj ) given E2 ] is called the
“posterior” probability distribution function. A theorem, proved independently
in the eighteenth century by Thomas Bayes and Pierre-Simon Laplace and called
“Bayes’ theorem,” maps the “prior” function plus the frequency ratio to the
“posterior” function.
Bayes’
theorem is logically impeccable. However, under the tradition, the application
of it is illogical for violating the law of non-contradiction.
Non-contradiction is violated by the arbitrariness of the “prior” function.
If the
principle of entropy maximization is employed in the determination of the
“prior” function, this eliminates the arbitrariness. However, acting under the
tradition, one concludes that the “prior” function is uniform. Usually, a
result from acting on this conclusion is for the model to fail from the
resulting shrinkage.
Either the
principle of entropy maximization is empirically invalidated or the tradition
is empirically invalidated. However, under the probabilistic logic, the
principle of entropy maximization is a fact. Thus, it must be the tradition
that is empirically invalidated.
Acting on
the conclusion that the tradition is empirically invalidated, Christensen
defines E1 and E2 outside the tradition. In
particular:
o
E1 contains the frequency ratio ‘x1 in n1’
plus the function Pr[ F(Wi
given Wi OR Wj ) given E2 ] and,
o
E2 contains the frequency ratio ‘x2 in n2’
plus the function Pr[ F(Wi
given Wi OR Wj ) given E1 ].
where ‘x1 in n1’ designates the frequency ratio that is measured in a
sample and ‘x2 in n2’ designates the frequency
ratio that is measured in a different sample from the same population.
Bayes’
theorem maps ‘x2 in n2’ and Pr[ F(Wi given Wi OR Wj
) given E1 ] to Pr[ F(Wi given Wi OR Wj
) given E2 ]. A feedback
loop devised by Christensen maps ‘x1
in n1’ and Pr[ F(Wi given Wi OR Wj
) given E2 ] to Pr[
F(Wi
given Wi OR Wj ) given E1 ].
The evidence
E1 pushes the entropy
downward. The evidence E2 pulls
the entropy upward. It is the absence of the evidence E2 that causes the model to fail, under the tradition.
The feedback loop varies the portion of E2
which is Pr[ F(Wi given Wi OR Wj ) given E1
] in such a way as to minimize a measure of the error when a number is assigned
to Pr[ ( Wi given Wi
OR Wj ) given E2 ] over each of the values
of the indices i and j. By this strategy, the functions Pr[ F(Wi given Wi OR Wj
) given E1 ] and Pr[ F(Wi given Wi OR Wj
) given E2 ] are uniquely
determined, the available information is precisely represented and the cause of
shrinkage is eliminated.
With the
definitions for E1 and E2, as modified by
Christensen, the traditional terminological convention in which Pr[ F(Wi given Wi OR Wj
) given E1 ] is called the
“prior” function and Pr[ F(Wi
given Wi OR Wj ) given E2 ] is called the
“posterior” function becomes misleading and inappropriate, for Pr[ F(Wi given Wi OR Wj
) given E1 ] is dependent
upon, rather being prior to, observational data. Adherence to this convention
has misled a large body of methodologists into the logically erroneous conclusion
that conformity to Bayes’ theorem must be avoided when, for logical
consistency, this conformity must be preserved.
A procedure
has been described for determination of the function Pr[ F(Wi given Wi OR Wj
) given E2 ]. The next
step in the derivation is to describe means for assignment of a number to the
different function Pr( Wi given Wi OR Wj
), over each of the values of the indices i
and j. With the availability of
these means, the derivation can be completed.
In the
context of this problem it is pertinent that the function Pr[ F(Wi given Wi OR Wj
) given E2 ] contains all of the information that is
available for the assignment of a number to
Pr(Wi given Wi OR Wj ). Our strategy is to discovery a measure of Pr[ F(Wi given Wi OR Wj
) given E2 ] with the properties that are required
of the number which is assigned to Pr(
Wi given Wi OR Wj ).
Let the
function g(.) designate this measure.
If the identity of g(.) is
determined, then the required means are available for the assignment of a
number to Pr( Wi given Wi
OR Wj ).
What is the
identity of the measure g(.)? In
addressing this question, it is pertinent that, under the circumstance that the
missing information about the value of the variable F( Wi given Wi OR Wj ) is reduced to nil, this value is Fr( Wi
given Wi OR Wj ). Thus, under this
circumstance,
g{ Pr[
F(Wi
given Wi OR Wj ) given E2 ] } = Fr( Wi given Wi OR Wj ) (3)
Equation (3)
imposes the first of two constraints on the form of the measure g(.). On the assumption that the
function Pr[ F(Wi given Wi OR Wj ) given E2
] has a single maximum, a form for g(.)
that is consistent with this constraint is that g(.) is the value of F( Wi given Wi OR Wj
) at this maximum. However, this assumption is inconsistent with the second of
the two constraints.
The second
of the constraints arises in the following way. To review, { Wi, Wj } designates the unobserved state-space of a kind of
inference. The observed state-space that participates with { Wi, Wj } in making this inference is { Wi OR Wj
}. Each such inference assigns a number to the probability of Wi and a different number to
the probability of Wj.
To continue
the review, F( Wi given Wi
OR Wj ) designates the
unobserved state-space of a different kind of inference. F( Wi given Wi OR Wj ) is a variable which takes on the values in the
sequence 0, 1/N, 2/N, 3/N…,
1, where N is a large positive
integer. The observed state-space
that participates with F(Wi given Wi OR Wj
) in making this inference is { 0 OR 1/N OR
2/N OR 3/N OR…OR 1 }. Each such inference assigns a number to the
probability of each element of F( Wi given Wi OR Wj
).
Many
inferences of the first and second kinds are possibilities. Which ones are
correct?
The
principle of entropy maximization applies to the situation. Among the several
possibilities for being made by a model, that inference is correct which
maximizes its own entropy, under constraints expressing the available
information.
Now, let it
be stipulated that the bases for inferences of the two kinds are entirely
empirical. In the circumstance that the available information about the value
of F( Wi given Wi
OR Wj ) is nil, the
available information about the state in { Wi,
Wj } must be nil, for
there is a complete lack of information about it. It follows that the value of F( Wi
given Wi OR Wj ) is ½; however, ½ is not
the value for which Pr[ F(Wi
given Wi OR Wj ) given E2 ] is at the maximum. In
fact, under the stated circumstances, Pr[
F(Wi
given Wi OR Wj ) given E2 ] is a constant. Thus, a
single value of F(Wi given Wi OR Wj
) does not exist for which Pr[ F(Wi
given Wi OR Wj ) given E2 ] is at the maximum but
rather all of the values in the interval between 0 and 1 are at the maximum.
Given that a
slight amount of information is available, the value assigned to F(Wi
given Wi OR Wj ) must, by the definition
of “information,” be close to ½. Under this circumstance, Pr[ F(Wi given Wi OR Wj
) given E2 ] may possess a
maximum but the value of Pr(.) at this
maximum is not necessarily close to ½. On these grounds, the proposition that
the measure g(.) is the value of F(Wi
given Wi OR Wj ) for which Pr[ F(Wi given Wi OR Wj
) given E2 ] is at the
maximum is rejected.
In the
search for an acceptable alternative, it may be noted that the function for
which we require a measure, namely Pr[
F(Wi
given Wi OR Wj ) given E2 ], is a set of pairs. Each
such pair consists of an element of the set [ F(Wi given Wi OR Wj ) given E2
] and the element of the set Pr(.)
to which it maps. We stipulate that the collection of sets that are measurable
by g(.) contains the set of these
pairs and that the elements of this set are non-overlapping.
As the
missing information about the true value of the variable F(Wi given Wi OR Wj ) approaches nil, the function Pr[ F(Wi given Wi OR Wj
) given E2 ] reduces to a
Dirac delta function that is centered on the true value of F(Wi given Wi OR Wj ), namely Fr(Wi given Wi OR Wj
). By a property of the Dirac delta function, it must be true that the measure g(.) of a single pair in Pr[ F(Wi given Wi OR Wj
) given E2 ] is f(Wi
given Wi OR Wj ) Pr[ f(Wi given Wi OR Wj
) given E2 ], where f(Wi
given Wi OR Wj ) designates an element of
F(Wi
given Wi OR Wj ).
Under the
precept of measure theory called additivity, the measure of the union of the
pairs in the collection of measurable sets is the sum of the measures of the
individual pairs. In establishing the identity of this sum, it is convenient to
employ the substitution in which
Pr[ f(
Wi given Wi OR Wj ) given E2
] = Prʹ[ f( Wi
given Wi OR Wj ) given E2 ] df( Wi given Wi OR Wj ),
where Prʹ[ f( Wi given Wi OR Wj ) given E2
] is an example of a probability density function and df[ ( Wi
given Wi OR Wj ) given E2 ] is the distance between
adjacent elements of F(Wi given Wi OR Wj
). With this substitution, the measure g(.)
of the union of the pairs in the collection of measurable sets is
The above
quantity is the expected value of [ ( Wi
given Wi OR Wj ) given E2 ] in the probability
distribution function Pr[ F( Wi
given Wi OR Wj ) given E2 ], by the definition of
“the expected value.” With no available information about the true value of F(Wi
given Wi OR Wj ), the expected value of
it is ½; this value of ½ maximizes the
entropy of the inference to { Wi,
Wj }, as required under
the principle of entropy maximization. On these grounds, we accept the
proposition that the number assigned to Pr(
Wi given Wi OR Wj ) is the expected value of the variable F(Wi
given Wi OR Wj ) in the function Pr[ F(Wi given Wi OR Wj
) given E2 ].
A procedure
has been described that assigns a numerical value to Pr( Wi given Wi OR Wj ), where Wi
and Wj are ways in which a state can occur in the unobserved
state-space { T1 given Um, T2 given Um…
} of an inference. It can be shown that, by variation of the values of the
indices i and j over their ranges and assignment of a value to Pr( Wi
given Wi OR Wj ) by this procedure, one
generates sufficient data for assignment of a numerical value to Pr( Tl
given Um ) over each value
of the index l and each value of the
index m.
This
completes our derivation of maximum entropy expectation. It is apt to classify
maximum entropy expectation as a principle of reasoning because: a) it is based
upon the principle of entropy maximization and b) the latter principle is a
fact, under the probabilistic logic.
In the
nineteenth century, a group of logicians advocated a reform in the field of mathematical
statistics. This reform became known as “frequentism.”
They
advocated this reform for the purpose of eliminating the violations of the law
of non-contradiction that resulted from the arbitrariness of the “prior”
probability distribution function. In the twentieth century, this reform became
doctrine in the field of mathematical statistics and was embraced by most
scientists. However, this reform could not have been more damaging to the
advancement of science for, under frequentism, patterns could not be discovered
and knowledge could not be created.
Frequentism
is the idea that the constant numerical value of the limiting relative
frequency of a state is assigned to the probability of this state. In
particular,
Pr( Wi
given Wi OR Wj ) := c (4)
where the
constant c is a real number. Under
this method of assignment, the value which is assigned to Pr( Wi given Wi OR Wj ) is insensitive to the nature of the “prior”
probability distribution function. The insensitivity eliminates the violation
of the law of non-contradiction. However, there is an unsavory side effect.
The story of
what it is that is unsavory tells best by way of an example. In the example, c is 0.6931…, Wi is rain given
cloudy and Wj is no rain
given cloudy. Thus, Wi given Wi OR Wj
is { (rain given cloudy ) given [ (rain given
cloudy ) OR ( no rain given cloudy ) ]
}. The relationship between { (rain given
cloudy ) given [ (rain given cloudy ) OR ( no rain
given cloudy ) ] } and 0.6931… is
one-to-one. Thus, that the value is 0.6931… determines that the state is [ (rain given cloudy ) given [ (rain given
cloudy ) OR ( no rain given cloudy ) ].
The unsavory
side effect is that, with the state { (rain
given cloudy ) given [ (rain given cloudy ) OR ( no rain
given cloudy ) ] } determined, the
observed state-space { cloudy, not cloudy } cannot be swapped out and another state-description, say { barometric pressure falling, barometric pressure rising }, swapped in
for the purpose of determination of whether this would reduce the conditional
entropy of the inference to { rain, no rain }. This cannot be done, for to
do so would be to change the value from 0.6931… to some other value but this
value is fixed under the defining precept of frequentism. To generalize from
this example, pattern discovery cannot take place under frequentism nor can
knowledge be created because frequentism implies the states in observed
state-spaces to be of fixed description. Viewed from the perspective of the
probabilistic logic, frequentism fails from violation of the principle of
entropy minimization.
Under
maximum entropy expectation, the situation is different. The numerical value
assigned to Pr( Wi given Wi
OR Wj ) is the expected
value of F[ ( Wi given Wi
OR Wj ) given E2 ] in the function Pr[ F(
Wi given Wi OR Wj ) given E2
] } and rather than being fixed the expected value varies with the
evidence E2 and with the
identity of the state Wi
given Wi OR Wj. Thus, swapping in a new
observed state space simply changes the expected value. It follows that, under
maximum entropy expectation, patterns can be discovered, knowledge can be
created and the principle of entropy minimization can be followed.
Under the
probabilistic logic, two principles of reasoning discriminate the one correct
inference from the many incorrect ones in the construction of a model. The
first is entropy minimization. The second is maximum entropy expectation.
Christensen calls this pair of principles “entropy minimax.”
When the
construction of a model is guided by entropy minimax, the elements of the
observed state-space C are optimized
abstractions. If there are two or more of them, it is customary and apt to call
these abstractions “patterns.” Thus, in the construction of a model under
entropy minimax, patterns are discovered. The process by which they are
discovered is called “entropy minimax pattern discovery”.
Creation of the maximum possible knowledge
If knowledge
is created by a scientific investigation, this is by the construction of a
model. When this model is constructed under entropy minimax, the greatest
possible knowledge is created from fixed resources, as “knowledge” is defined
in Christensen’s theory of knowledge. Christensen’s theory has the merit of
providing the only known logical approach to the construction of a model.
When a model
is constructed under entropy minimax, logical errors are eliminated from it. It
follows that the degradation in the performance of this model that would result
from these errors is eliminated. Thus, the model performs at the highest
possible level.
Reduction to the
deductive logic
If the
principles of reasoning are entropy minimax for the whole of logic, this logic
must reduce to the deductive logic in the circumstance that every kind of
missing information is reduced to nil. This is the case.
The
reduction to the deductive logic occurs in the following manner. Associated
with the deductive logic is a single principle of reasoning. This principle
states that an argument is correct if and only if it matches the abstract
argument called modus ponens or the
abstract argument called modus tollens.
Modus ponens states:
Major
premise: A implies B
Minor
premise: A
Conclusion: B
Modus tollens states:
Major
premise: A implies B
Minor
premise: NOT B
Conclusion: NOT A
In the
language of Christensen’s theory of knowledge, A is an example of a pattern while B is an example of an outcome. Modus
ponens and modus tollens express
the one-to-one relationship between patterns and outcomes, under entropy
minimax, when the missing information for a deductive conclusion is reduced to
nil.
Mathematics
results from repeated application of the two arguments. Thus, mathematics
results from the conformity of its arguments to entropy minimax.
If
Christensen’s theory of knowledge is correct, perfect knowledge can be
expressed under entropy minimax pattern discovery. This is the case. Perfect
knowledge results from reduction of the logic to the deductive logic by the
elimination of all missing information for a deductive conclusion.
If
Christensen’s theory of knowledge is correct, perfect ignorance can be
expressed under entropy minimax pattern discovery. This is the case. Perfect
ignorance results from failure to discover patterns.
No
observational data conflict with the thesis that the principles of reasoning
are entropy minimax. The quantity of observational data that bear on this issue
is great. Much of this data results from tests of models or arguments that
were, in effect, built under the principle of entropy minimization, the
principle of entropy maximization or the principle of maximum entropy
expectation, that have large domains of validity and that have worked perfectly
under intensive testing over periods ranging from decades to millennia. They
are:
o
Modus ponens
and modus tollens plus all of the
works that are built by these arguments, including mathematics,
o
the
theory of fair gambling devices,
o
the
modern theory of heat aka thermodynamics and,
o
the
modern theory of communication.
A study
(Christensen, 1986a) compares the performances of models built by entropy
minimax pattern discovery to the performances of models built under the method
of heuristics. It is reported that the former models consistently outperformed
the latter. In certain cases, the degree of outperformance was very great.
Questions
addressed by entropy minimax pattern discovery have included:
o
whether a drug
will be found to retard lymphoid leukemia, lymphocytic leukemia or
melanocarcinoma in mice, based on this drug’s physical, chemical and biological
features,
o
whether a patient
will be found to have heart disease subsequent to his/her electrocardiogram,
ECG,
o
whether an ECG
waveform indicates a normal or an abnormal heartbeat,
o
which features of
an ECG or other waveform contain the most information about outcomes,
o
whether a biopsy will
reveal prostate cancer, conditioned on a patient’s level of prostate specific
antigen, PSA, plus the values of other independent variables,
o
whether a biopsy
will reveal cervical cancer, based on spectral analysis of data from tissue
fluorescence,
o
whether a biopsy
will reveal breast cancer, based on electrical potentials produced by the
patient’s heart beats,
o
whether patients
with lymphoma, chronic granulocytic leukemia or prostate cancer have high or
low survival risks,
o
whether patients
surgically treated for coronary artery disease have high or low survival risks,
based on catheterization and clinical data,
o
whether a paroled
prison inmate will return to prison,
o
whether
depression and related psychological states are related to early childhood memories,
o
whether nuclear
reactor fuel will be sufficiently deformed under accident conditions to
obstruct coolant flow,
o
whether nuclear
reactor fuel will be found to be leaking radioactive substances if removed from
a reactor and tested,
o
whether a gasoline
storage tank will be found to be leaking a carcinogen into an aquifer or an
explosive into adjacent basements, if dug up and tested and,
o
how a
photographic or video image should be classified as to type.
Decisions
that have been supported by models built by entropy minimax pattern discovery
include:
o
the course of
treatment for non-Hodgkin’s lymphoma,
o
the course of
treatment for disorders of the cervical spine,
o
which factors, in
addition to
o
which factors
(now referenced in medicine as International Prognostic Indices, IPIs) indicate
high risk for patients with lymphoma, chronic granulocytic leukemia or prostate
cancer, for consideration in treatment selection,
o
whether to submit
a request for approval of a diagnostic technique for breast cancer to the U.S.
Food and Drug Administration, FDA,
o
whether to submit
a request for approval of a diagnostic technique for cervical cancer to the
FDA,
o
whether the U.S.
Nuclear Regulatory Commission should require further research before certifying
that nuclear reactors are adequately safe from loss of coolant accidents,
o
whether to
restart a nuclear reactor containing parts that might fail in service,
o
whether to
suspend licensing of nuclear reactors,
o
when to replace a
leakage-prone gasoline storage tank,
o
the level of
water that should be kept behind a dam, in light of the long range forecast for
precipitation,
o
how an electric
utility should plan for demand for air conditioning and,
o
whether an
electric utility’s rate should be adjusted, in light of the long range forecast
for precipitation.
Factors
discovered by entropy minimax pattern discovery are embedded in the medical
standard for the classification of patients with non-Hodgkin’s lymphoma.
The
situation in which logic has been completed but neither the scientific nor
academic community has come to grips with this advance leaves a great deal of
work to be done and very few people or organizations with the ability to do
this work. KnowledgeToTheMax offers
its help in filling this gap through services that include:
o
teaching
of logic, not excluding the inductive logic,
o
consultancy
on science policy,
o
consultancy
on curriculum reform in education,
o
management
of theoretical aspects of scientific studies and,
o
construction
of ultra-optimized, logical, maximally effective models.
Currently,
the staff of KnowledgeToTheMax consists
solely of the firm’s owner, Terry Oldberg. From 1975 to 1982 Oldberg managed
the theoretical side of the research program of the electric utilities of the
U.S. on the performance of materials in the cores of their nuclear reactors. In
this capacity, he, Dr. Ronald Christensen and their colleagues pioneered the
application of entropy minimax pattern discovery in engineering research.
Oldberg has held positions in research, management and engineering with the
Lawrence Livermore National Laboratory, General Electric Company, Electric
Power Research Institute, Alltel Healthcare Information Systems and Picturetel
Corporation.
Oldberg
holds the B.M.E. degree in mechanical engineering from Cornell University, the
M.S.E. degree in mechanical engineering from the University of Michigan and the
M.S.E.E. degree in electrical engineering from Santa Clara University. He is a
registered professional engineer in nuclear engineering in the State of
California.
A list of
Oldberg’s publications relating to entropy minimax pattern discovery is
available in the bibliography.
Oldberg
Limitation
In building
models, KnowledgeToTheMax operates
under a technology sharing agreement with the developer of entropy minimax
pattern discovery, Ronald Christensen. Under this agreement, KnowledgeToTheMax may freely employ
proprietary technology owned by Christensen for the benefit of non-profit
organizations and may employ the same technology for the benefit of others with
Christensen’s permission.
In some
cases, the scientific community possesses a degree of mechanistic understanding
of the phenomenon being modeled. In these cases, the amalgamation of a
mechanistic model with an empirical one created by entropy minimax pattern
discovery provides the benefit of greater knowledge or a larger domain of
applicability.
In the
amalgamated model, the mechanistic model plays two roles. First, certain of its
independent variables may provide independent variables for the empirical
model. Second, the inferences that are made to the outcomes of events by the
mechanistic model may serve as a constraint on entropy maximization.
Probability
theory assumes that the sets in the collection of sets that are measured Pr(.) are crisply defined. In the
construction of a model, though, the builder often encounters situations in
which these sets are not crisply defined. In these situations, a similar logic
applies but with set theory replaced by fuzzy set theory and the various
measures by their fuzzy equivalents.
On Oct. 23,
2008, Terry Oldberg of KnowledgeToTheMax presented
a lecture entitled “Information Theory: Maximizing Knowledge” to a meeting of
the American Nuclear Society in San Francisco, California.
On Nov. 20,
2008, Oldberg presented a lecture entitled “Maximizing Knowledge” to a meeting
of the American Chemical Society in Santa Clara, California. The announcement
for the meeting is posted here.
On Feb. 11,
2009, Oldberg presented a lecture entitled “Maximizing Knowledge” to a meeting
of the American Society for Quality in Santa Clara, California.
On May 7,
2009, Oldberg presented a lecture entitled “Maximizing Knowledge” to a meeting
of the American Institute of Chemical Engineers in Berkeley, California.
A
bibliography is available by clicking here. The
literature is large and not completely user friendly. Proofs of some theorems
are sketchy or absent. Hence, it would be far more cost effective to engage a
tutor than to attempt to climb the learning curve unaided.
For further
information, please contact the owner of KnowledgeToTheMax,
Terry Oldberg. He may be reached at terry@KnowledgeToTheMax.com
(Los Altos Hills, California).
Title: Offerings of KnowledgeToTheMax, Third Edition
Author: Terry Oldberg
Publisher: KnowledgeToTheMax,
Los Altos Hills, CA
Publication
date: November 14, 2009
COPYRIGHT ©
2008, 2009 by Terry Oldberg
ALL RIGHTS RESERVED
No part of this work may be reproduced or used in any form or my any
means - including Web distribution or information storage and retrieval systems
– without the written permission of the author. To request permission, contact
the author at terry@KnowledgeToTheMax.com.
a