Offerings of KnowledgeToTheMax

 

______________________________________________________________________________________________________________________

This message targets and should be of interest to: science policy makers; funders of scientific studies; research workers; scientists; philosophers; educators; political leaders, physicians; engineers; laymen.

________________________________________________________________________________________________

 

Abstract

In an advance of breathtaking importance, logic has been completed. The completion was effected by extending logic from its roots in the deductive logic through the inductive logic. In the extension of logic, the principles of reasoning were discovered. One result is a new found ability to build a scientific model (aka scientific theory) under the principles of reasoning. Under these principles, the construction of a model creates the maximum possible knowledge from fixed resources.

The news of this advance has reached few of the people who need to know about it. On this Web site, the firm KnowledgeToTheMax offers a primer on the completed logic. In key respects this primer is the first to be published.

The extension of logic makes it possible, for the first time, to eliminate logical error from a model. Most models in use today, in fields of endeavor that include medicine, engineering, law, business and government, are guilty of this kind of error. Bad consequences for us result from the error.

Error degrades the performance of a model. Conversely, an absence of error boosts the performance of a model to the maximum possible level. In some cases, the degree of boost is found to be of an astounding order of magnitude.

The completion of logic exposes a gap in which very few scientists possess the skills that are necessary for construction of a logically sound, optimally effective model. KnowledgeToTheMax offers its assistance to the scientific, educational and business communities in filling this gap.

 

Table of contents

Summary

Introduction

Logical context

The probabilistic logic

States

State-spaces

Inferences

Events

Observed events

Unobserved events

Population

Sample

Conditional states

State transition probabilities

An early principle of reasoning

The law of non-contradiction

The method of heuristics

The advent of optimization

Abstraction

Ways

Measure theory

The probability measure

Shannon’s measure

Existence and uniqueness

The “entropy”

The “conditional entropy”

The “information”

The “missing information”

The principle of entropy maximization

An application

Another application

Shannon’s theory of the optimal encoder

The principle of entropy minimization

Shannon’s theory of the optimal decoder

Christensen’s theory of knowledge

The straight rule

The principle of maximum entropy expectation

Frequentism

Entropy minimax

Pattern discovery

Creation of the maximum possible knowledge

Peak performance

Reduction to the deductive logic

Perfect knowledge

Perfect ignorance

Empirical basis

Past applications

Offerings

Staff

Limitation

Adding a mechanistic model

Fuzzy sets

News

Bibliography

Contacting us

Citing this work

Copyright

 

Summary

The word “science” comes to us from the Latin word “scientia,” meaning “demonstrable knowledge.” For brevity, this document references “scientia” by the word “knowledge.”

Prior to the year 1975, science was undermined by the existence of unresolved foundational issues. Among these were:

o   The origins of patterns,

o   The nature of knowledge and

o   The principles of reasoning.

The delivery system for mankind’s knowledge was its collection of scientific models. A model was a procedure for making inferences. On each occasion on which an inference was made, each element in a collection {a, b…} of inferences was a candidate for being made. Which candidate was correct? The model builder had to decide!

Logic was the science of the principles that discriminated the one correct inference in the collection {a, b, …} of inferences from the many incorrect ones. These principles were called the “principles of reasoning.”

However, logic was incomplete. Aristotle had described the principle of reasoning for the deductive branch of logic but, in building a model, the builder had to employ the inductive branch of logic. The inductive branch had yet to be described.

The problem of extending logic from its deductive branch and through its inductive branch had come to be known as the “problem of induction.” Prior to 1975, this problem was unsolved. In lieu of the principles of reasoning, model builders coped through employment of intuitive rules of thumb called “heuristics” in the determination of which inference in the collection {a, b…} of inferences was correct. According to one, frequently employed heuristic, the correct inference was the one of greatest beauty.

The first principle of reasoning was the law of non-contradiction. In employing heuristics, scientists violated this law.

A result was for models to be highly susceptible to making logical errors. The existence of these errors degraded the performances of the associated models. The degree of degradation could be and often was great. People died and suffered other unpleasant consequences from the errors.

In 1948, the mathematician and communications engineer Claude Shannon opened up a path out of this quagmire. Shannon identified a measure which, it could be shown, was the unique measure of an inference. The measure of an inference was the information that was missing in this inference, for a deductive conclusion.

The existence and uniqueness of the measure of an inference signified that the identity of the correct inference could be established by optimization. In an optimization, that inference in the collection {a, b…} of inferences that were candidates for being made by a model was identified as correct for which the missing information was (depending upon the type of inference) minimal or maximal.

Shannon described a pair of applications for the idea of optimization; these applications were in the design of equipment for the telecommunications industry. In one of them, the inference made by the device called an “encoder” was optimized by maximizing the missing information in this inference. In the other, the inference made by the device called a “decoder” was optimized by minimizing the missing information in this inference.

The function of a decoder was to translate an encoded message such as “0110111…” to an un-encoded message such as “Mary had a little lamb.” Early in the 1960s, the engineer, lawyer and theoretical physicist Ronald Christensen got the idea the problem of induction could be solved by construing a model to be the algorithm for the decoder of a “message” consisting of the sequence of outcomes of statistical events.

There was a barrier to implementation of this idea. Shannon’s theory of the optimal decoder was incomplete. Christensen completed it.

Using his idea, Christensen explained the origins of patterns, described the nature of knowledge and enunciated the principles of reasoning. “Knowledge” was the information which the user of a model obtained about the outcomes of statistical events by recognition of the patterns that preceded them. In the construction of a model, one of two principles of reasoning was to maximize the knowledge. The other was to constrain the process of maximization by empirical data and other sources of information. Patterns originated in the construction of models under the principles of reasoning.

By 1985, Christensen’s idea had been reduced to practice, published and employed in real world applications of large number. In comparison to models built by heuristics, models built under Christensen’s principles of reasoning consistently performed better. Often, they performed much better. The consistent outperformance was consistent with the belief that Christensen’s principles of reasoning were the such principles.

Today, there is an anomaly in which, with near universality, communications engineers build their models by optimization while, with near universality, research workers build their models by the method of heuristics. The users of the models that are constructed in the latter manner pay for the lack of optimality. Sometimes they pay with their lives.

The firm KnowledgeToTheMax fills a gap in which very few research workers are equipped to build a logically sound, optimally effective model. The firm offers services that include conduct of tutorials, consultation on curriculum reform in education and construction of models.

This completes a summary of the offerings of KnowledgeToTheMax. An expanded version of the same topic follows.

 

Introduction

In the period of two years that ended in 1980, a signal event occurred in the history of meteorology. At the start of this period, centuries of research had extended the span of time over which the weather could be forecasted, with statistical significance, to no more than 1 month. Two years later, this span had been extended to 12 to 36 months – an improvement of a factor of 12 to 36. At this point, it would be instructive and motivating for the reader to learn of the features of this model. To do so, please click here.

The factor of 12 to 36 improvement had been effected through the use of a method of model building that was new to meteorology. How could a mere switch in the method of its construction effect such an enormous improvement in the performance of a model? This question falls naturally within the context of logic.

  

Logical context

A model is a procedure for making inferences. Each time an inference is made, there are possibilities a, b… for being made. Which inference in the collection { a, b… } of possible inferences is the correct one? The model builder must decide!

Logic is the science of the principles that discriminate the one correct inference from the many incorrect ones. These principles are called the “principles of reasoning.” For the deductive branch of logic:

o   there is a single principle of reasoning and,

o   this principle has been known since Aristotle described it 23 centuries ago.

For the deductive logic, the principle of reasoning dictates the conformity of arguments to the form called modus ponens or the form called modus tollens.

Discovery of the principles of reasoning for the whole of logic is called “the problem of induction” after “induction,” the process by which the model builder generalizes from descriptions of observed events to descriptions of unobserved ones. To be useful to us, a model must describe the unobserved ones.

The problem of induction thwarted the best efforts of thinkers over several millennia. Many otherwise well informed people believe the problem remains unsolved. In many ways, our society is organized as if this belief were true. For example, while it is impermissible to publish an illogical deductive argument in a mathematical journal, it is permissible to publish an illogical inductive argument in a scientific journal. However, over the period of four centuries that ended in 1975, a solution was found to the problem of induction. Solving this problem made the principles of reasoning known and available for construction of a model.

The ideas that were to foster a solution were those of measure, inferences, optimization and missing information. If an inference had a unique measure then the one correct inference in the collection {a, b…} of inferences that were possibilities for being made by a model could be identified by optimization. In the optimization of an inference, that possibility would be identified as correct whose measure was minimal or maximal. In time, this measure was discovered in a generalization from the deductive logic; this generalization came to be known as the “probabilistic logic.”

Over the period of about 4 centuries that ended in 1975, it was discovered that the probabilistic logic held the unique measure of an inference. The unique measure was the missing information in this inference, for a deductive conclusion. This discovery yielded a pair of principles of reasoning.

Inferences of infinite number were candidates for being made by a model. All but a few of these inferences were incorrect. Thus, if a model builder were not guided by the principles of reasoning, the associated model was virtually certain to make incorrect inferences. The incorrect inferences degraded the performance of the model. The magnitude of the degradation could be and often was great.

Over the period of 27 years that ended in 1975, it became possible to eliminate the incorrect inferences by optimization of inferences. Immediately, communications engineers seized this opportunity. This was to revolutionize the communications industry. HDTV was to become one of the fruits from this revolution. Inexplicably, research workers failed to seize the same opportunity.

A result is for the lion’s share of today’s models to make incorrect inferences. We employ these inferences in making decisions on issues of importance to us. For example, we employ them in making decisions on medical issues for which alive and dead are the outcomes.

That few of us are aware of the principles of reasoning or how they operate is a barrier to eradication of the incorrect inferences that plague us. To gain this awareness, one must delve into the details of logic. The rudiments of logic are presented in the primer that follows. The primer can be speed-read in about half an hour, probably without attaining full understanding of the mathematical details. More thorough study of this topic, preferably with the help of a competent tutor, is advised for scientists, philosophers, educators, intellectuals, professionals, business leaders and political leaders, among others.

In one’s study of the details of logic, the place to start is with the probabilistic logic.

 

The probabilistic logic

Working in the sixteenth century, the mathematician, physician and habitual gambler Girolamo Cardano discovered or anticipated the discovery of a surprisingly large proportion of the ideas that were to play key roles in solving the problem of induction. He described his ideas in Liber de Ludo Aleae (“The Book of Games of Chance”). One of Cardano’s ideas was the probabilistic logic.

In the deductive logic, every proposition was true in either 0% or 100% of the instances in which this proposition was asserted. In reality, though, a proposition might be true in a proportion of instances in which it was asserted lying between 0% and 100%. Cardano got the idea that logic could be freed from the unrealistic restriction to either 0% or 100%. By this idea, he created the generalization from the deductive logic that came to be known as the “probabilistic logic.” In this logic, the proportion of instances in which a proposition was asserted in which it was true was called the “probability” of this proposition. Cardano’s innovation made it possible for logic to express the important idea that information needed for a deductive conclusion from an inference could be missing; for example, information needed for a deductive conclusion about the outcome of a horse race could be missing.

 

States

This topic of “states” and the ten topics that follow it are largely devoted to tedious but necessary definitions of terms.

The propositions that are referenced by the probabilistic logic are examples of states. A “state” is a description of a physical object or “body.” Cloudy is an example of a state; it describes a body that is a region of the Earth.

That a proposition is a state signifies that this proposition can be validated or invalidated by observation. A proposition that is a state is validated each time the associated body is observed and found to be in the state that is claimed for it; otherwise, it is invalidated. The ability of one to validate or invalidate its propositions in this way ties the probabilistic logic to science. In science, a model is invalidated if a single one of its propositions is invalidated.

 

State-spaces

A complete set of alternate descriptions of a body is called a “state-space” for this body.

If the state in a state-space is observed, the associated state-space is said to be “observed.” The set { cloudy, not cloudy } is an example of an observed state-space.

If the state in a state-space is unobserved, the associated state-space is said to be “unobserved.” The set { rain, no rain } is an example of an unobserved state-space.

 

Inferences

An “inference” is an extrapolation from a state in an observed state-space of a body to a state in an unobserved state-space of the same body. For example, it is an extrapolation from the state cloudy in the observed state-space { cloudy, not cloudy } of a region of the Earth to the state rain in the unobserved state-space { rain, no rain } of the same region.

 

Events

An “event” is a pairing of a state in the unobserved state-space Y with a state in the observed state-space that participates with Y in making an inference. The pairing { cloudy, rain } is an example of one of them.

 

Observed events

An “observed event” is a datum that specifies the state in the observed state-space of an event and the state in the unobserved-state-space of the same event.  An example of an observed event is “ cloudy, rain .”

 

Unobserved events

An “unobserved event” is a datum that specifies the state in the observed state-space of an event but not the state in the unobserved-state-space of the same event.  An example of an unobserved event is “ cloudy, x ” where the x designates a variable which takes on the state in the unobserved state-space as its value. 

 

Population

A complete set of observed and unobserved events is called a “population.”

 

Sample

A subset of a population is called a “sample.” The observed events belong to a sample.

 

Conditional states

A state in an unobserved state-space may be conditional upon on a state in an observed state-space. For example, the state rain may be conditional upon the state cloudy. A state that is formed in this way is called a “conditional state.”  Rain given cloudy is an example of a conditional state.

 

State transition probabilities

As a conditional state is an example of a proposition, under the probabilistic logic a conditional state has a probability. This probability is called a “state-transition probability.” Pr( rain given cloudy ) is an example of such a probability, where Pr(.) signifies the probability.

 

An early principle of reasoning

A logic needed principles of reasoning that identified the one correct inference in the set {a, b…} of alternatives, when an inference was made by a model. In Liber de Ludo Aleae, Cardano supplied a principle of reasoning whose scope was limited to models of games of chance. Under this principle, equal numerical values were assigned to the probabilities of the ways in which an outcome could occur in a game of chance, provided that this game was fair. Many centuries later, a generalization from Cardano’s principle would become the principle of entropy maximization; soon, you’ll learn about this principle.

 

The law of non-contradiction

The first principle of reasoning is the law of non-contradiction. This law cannot be derived. Instead, it serves as a portion of the definition of what it means to be “logical.” The law states that a proposition is false if it contradicts itself. For example, the proposition “That card is the ace of spades and is not the ace of spades” is false because it violates the law of non-contradiction.

As the principles of reasoning identify the one correct inference in the set {a, b…} of alternatives for being made by a model, the proposition is false that these principles identify more than one inference as the one correct inference, by the law of non-contradiction. Over the years in which the problem of induction remained unsolved, the major barrier to solution was to satisfy the law of non-contradiction.

 

The method of heuristics

A “heuristic” is an intuitive rule of thumb that identifies the one correct inference in the set {a, b…} of alternatives for being made by a model. In every instance in which a heuristic identifies the one correct inference, at least one different heuristic identifies a different inference as the one correct inference. As it identifies more than one inference as the one correct inference, the method of heuristics violates the law of non-contradiction.

 

The advent of optimization

Prior to the year 1948, model builders lacked an alternative to the method of heuristics. In that year, the idea of optimizing inferences was published by Claude Shannon. Under optimization, the correct inference was the one whose unique measure was minimal or maximal. By identifying the correct inference uniquely, optimization satisfied the law of non-contradiction.

 

Abstraction

In the construction of a model, a key idea is that of abstraction. A model is abstracted (removed) from some of the details of the real world.

To be more precise, a state A is said to be “abstracted” from the states B, C… if and only if A is the inclusive disjunction of B, C…; in other words, A is the equivalent of the state B OR C OR… A is said to be “abstracted” from B, C… because the description provided by A is removed from the descriptions provided by B, C… For example, the state male OR female is removed from the gender difference between the state male and the state female.

How to abstract his/her model from the details is one of the problems faced by the builder of a model. This problem is solved by the principles of reasoning.

 

Ways

A state that is abstracted from no other state is called a “way in which a state can occur” or “way” for short.  In the example provided above, the states male and female are examples of ways. The state male OR female is not a way, for it is abstracted from the two other states.

 

Measure theory

If an inference were to be optimized, it had to possess a unique measure. Under Cardano’s definition of “probability,” a probability was an example of a measure. It was not, however, the measure of an inference. Early in the twentieth century, the mathematician Henri Lebesgue generalized Cardano’s idea to measures in general. Lebesgue’s generalization set the stage for discovery of the measure of an inference.

Lebesgue’s generalization is called “measure theory.” Under measure theory, a measure is a mathematical function that maps each set in a collection of “measureable sets” to a non-negative real number. That this function “maps” signifies that, for every set in the collection, there is exactly one non-negative real number.

Under a precept of measure theory, the measure of an empty set is nil. Under the precept called “additivity,” the measure of the union of disjoint sets is the sum of the measures of the individual sets.

The union of several sets is the set of all elements that belong to at least one set. Sets are said to be “disjoint” if they do not intersect.

Under precepts of measure theory governing membership in the collection of measurable sets, if a collection contains the set A and the set B then this collection also contains the sets BA and B∩A.

BA is called the “set difference.” It is the set of elements of B that do not belong to A. B∩A is called the “set intersection.” It is the set of elements of B that also belong to A.

 

The probability measure

Probability is an example of a measure. Let the probability measure be designated by the function Pr(.). The dot between the parentheses symbolizes that element in the collection of measurable sets which is measured by Pr(.).

 

Shannon’s measure

Half way through the twentieth century, the mathematician and communications engineer Claude Shannon described the measure that came to be named after him: “Shannon’s measure.” Let Shannon’s measure be designated by the function Sh(.).

The collection of sets which were measurable by Sh(.) included an unobserved state-space plus the observed state-space which participated with this unobserved state-space in making an inference. Let the unobserved state-space be designated by Y and the observed state-space be designated by X.

By the precepts of measure theory, the collection of measurable sets also contained YX and YX. Under the precept of additivity,

 Sh( YX ) = Sh( Y ) – Sh( YX )                                                               (1)

Shannon stipulated that Sh( YX ) was a function with a specific mathematical form that was known to Shannon; by this formula, Sh( YX ) was the measure of an inference. Thus, the set being measured by Sh(.), namely YX, had to be an inference. It could be shown that, under the probabilistic logic, Sh( YX ) was the unique measure of an inference. Though Shannon did not realize it, the existence of the unique measure of an inference signified that the problem of induction could be solved by optimization.

In the circumstance that Sh( YX ) was nil, it followed from equation (1) that

Sh( YX ) = Sh ( Y )                                                                         (2)

By inspection of equation (2), Sh( YX ) reduced to Sh( Y ) in the circumstance that Sh( YX ) was nil.

 

Refresher:

o    Pr(.) designates the probability measure.

o    Sh(.) designates Shannon’s measure.

Existence and uniqueness

Under the probabilistic logic, Pr(.) existed and was the unique measure of a state. Under the same logic, Sh(.) existed and was the unique measure of an inference.

 

The “entropy”

The function Sh( Y ) had a specific mathematical form. A colleague informed Shannon that, in developing the modern theory of heat (aka thermodynamics), physicists had named this form the “entropy.” In this way, Sh( Y ) became known as the “entropy.” The entropy was the measure of an inference from the observed state-space X to the unobserved state-space Y, in which X contained a single state. The single state in X was abstracted from the states in Y.

 

The “conditional entropy”

By its form, the function Sh( YX ) was the measure of an inference from X to Y, in which X contained several states. By analogy to the term “entropy,” this function came to be termed the “conditional entropy.”

 

The “information”

Shannon worked in the field of communications engineering. Communications firms, such as telephone companies, understood that their role was to move information. However, they did not know what “information” was! That they did not know precluded optimization of their operations. Shannon suggested that the mathematical formula for the “information” moved by communications firms was given by the function Sh( YX ). In particular, Sh( YX ) was the “information” about the state in Y, given the state in X.

 

The “missing information”

Under equation (1), Sh( YX ) varied inversely with Sh( YX ). When Sh( YX ) was at its maximum value, Sh( YX ) was at its minimum value. Conversely, when Sh( YX ) was at its minimum value, Sh( YX ) was at its maximum value.

As Sh( YX ) was called the “information” about the state in Y, given the state in X, it was apt to call Sh(YX) the “missing information” in the inference from X to Y. As Sh(YX) was also called the “conditional entropy,” the “conditional entropy” of an inference was synonymous with the “missing information” in this inference. Similarly, the “entropy” of an inference was synonymous with the “missing information” in this inference.

The phrase “missing information” was shorthand for “missing information for a deductive conclusion.” Under the probabilistic logic, the inductive differed from the deductive logic in the respect that there was missing information for a deductive conclusion in the former branch of logic but not the latter. In this way, the probabilistic logic answered the previously unanswered question of the essential difference between the deductive and inductive branches of logic.

The missing information has a precise mathematical formula that you can look up on the Web. To avoid getting embroiled in too many details, we’ll skip the formula and illustrate the idea with a story.

The story is about a race among 8 equally matched horses. With the winner unknown, the identity of the winner is conveyed by the three bit binary number _ _ _ . Each of the underscore characters of the number represents the erasure of a binary digit, that is, a “0” or a ”1”.  One bit of information about the winner is lost when a binary digit is erased. One bit of information is gained when an erasure is replaced by a binary digit.  The information about the winner is the number of binary digits. The missing information about the winner is the number of erasures.

The information about the winner, measured in “bits,” is the number of binary digits. The missing information about the winner, also measured in bits, is the number of erasures. Thus, for example, in the number “_0_,” there is one bit of information about the winner and there are two bits of missing information about the winner.

Why is it that the missing information about the winner can be measured by counting the erasures and the information about the winner can be measured by counting the binary digits? This is a consequence of the precept of measure theory called “additivity.” This precept gives the missing information and the information their unique functional forms. If we were to define the missing information and the information in any other way, Shannon’s “measure” would no longer be a measure.

 

The principle of entropy maximization

In the nineteenth century, physicists described a principle of reasoning for the modern theory of heat: When a body was isolated from its environment, the entropy of an inference to the ways in which a state could occur in the unobserved state space of this body was maximized. In the twentieth century, Shannon described a similar principle for the modern theory of communication. Subsequently, various theorists generalized the two principles to a principle of reasoning for models, in general.

This principle is entropy maximization. An abbreviated derivation of this principle follows.

If and only if the observed state-space of an inference contains a single state then, under the probabilistic logic, the unique measure of this inference is its entropy. If and only if the states in the unobserved state-space of this inference are examples of ways, the entropy possesses a maximum. The entropy may be pushed downward from the maximum by constraints, expressed mathematically, on entropy maximization. The amount of the reduction in the entropy, from the constraints, is called “the available information.”

Thus, the probabilistic logic holds a “principle of entropy maximization.” It states

The entropy of the inference to the ways in which a state can occur is maximized, under constraints expressing the available information.

Maximization of the entropy, under the constraints, identifies the one correct inference in the set {a, b…} of alternatives for being made by a model. The correct inference is the one that maximizes its own entropy. Thus, the principle of entropy maximization is a principle of reasoning. This principle assigns a unique numerical value to the probability of each way in which the state can occur.

The reader should understand that, under the probabilistic logic, the principle of entropy maximization is a fact and not a theory, conjecture or empirical finding. Thus, this principle provides a portion of the bedrock upon which a model may be founded. Conversely, to violate this principle in the construction of a model is to commit a logical error.

The principle of entropy maximization has been called the “principle of honesty in inferences.” Often, model builders violate this principle by raising the entropy to a level that is lower than is justified by the available information. When this happens, the result is the same as when a dishonest research worker fabricates empirical data. A consequence is for the model to fail in service from making false assertions.

 

An application

Under Cardano’s theory of fair gambling devices, equal numerical values are assigned to the probabilities of the ways in which an outcome can occur in a game of chance. Cardano’s theory arises from the principle of entropy maximization. It arises in the following manner.

Suppose a model makes an inference from an observed state-space containing a single state to the unobserved state-space that participates with the observed state-space in making an inference. This inference assigns a numerical value to the probability of each way in which an outcome can occur, in a game of chance. The observed state-space contains a single state, which is abstracted from the ways in the unobserved state-space.

The numerical values which are assigned to the probabilities of the various ways form a set. Sets of numerical values of infinite number are possibilities. Each possibility defines a different inference. Which inference is correct?

The principle of entropy maximization applies to the situation described. Under this principle, the correct inference maximizes its own entropy, under constraints expressing the available information.

The information about the way in which the outcome will occur is nil, by the definition of a fair gambling device. Thus, the entropy of the one correct inference is maximized without constraints. Maximization of the entropy assigns equal numerical values to the probabilities of the various ways.

 

Another application

The modern theory of heat, aka thermodynamics, arises from the principle of entropy maximization. The manner in which it arises is identical to the manner in which the theory of fair gambling devices arises, with the exception of the identities of the states in the unobserved state-space. In the theory of fair gambling devices, these states are the ways in which an outcome can occur in a game of chance. In the theory of heat, they are the “accessible microstates” for a body at thermodynamic equilibrium; the accessible microstates are the ways in which an outcome can occur, for a body at thermodynamic equilibrium.

 

Shannon’s theory of the optimal encoder

Shannon’s theory of the optimal encoder is an application of the principle of entropy maximization. Shannon’s theory features an inference-making device called an “encoder.” An encoder translates an un-encoded message, such as “Mary had a little lamb,” to an encoded message, such as “101110…” In doing so, an encoder makes an inference to an unobserved state-space; this state-space is the alphabet of the encoded message. The inference assigns a numerical value to the probability of each state in the unobserved state-space. The numerical values that are assigned to the probabilities of the various states form a set. Sets of infinite number are possibilities for assignment. Which one is correct? The designer of the decoder must decide!

Each state in the unobserved state-space of the inference is an example of a way. The observed state-space that participates with the unobserved state-space in making the inference contains a single state; this state is abstracted from the ways in the unobserved state-space.

The principle of entropy maximization applies to the situation described. The correct inference is the one that maximizes its own entropy, under constraints expressing the available information. Designing an encoder in this manner eliminates logical error from the inference that made by this encoder. An encoder that is designed in this manner is called an “optimal encoder.”

 

The principle of entropy minimization

Shannon described a second principle of reasoning for the modern theory of communication. Various theorists generalized this principle to a principle of reasoning for models, in general.

This principle is entropy minimization. A derivation of the principle follows.

If the observed state space of an inference contains several states, under the probabilistic logic the unique measure of this inference is its conditional entropy. The conditional entropy of an inference is the missing information in this inference, for a deductive conclusion.

The observed state-space of the inference can be defined in many different ways. Each way defines a different inference. Thus, each inference in the set { a, b… } of alternatives is a candidate for being made by a model. Which inference is correct? The model builder must decide!

The principle of entropy maximization does not apply to this situation, for its task is to assign numerical values to the probabilities of the states in an unobserved state-space. Here, the task is to determine the descriptions that are provided by the states in an observed state-space. Minimization of the conditional entropy is the optimization that determines the descriptions. Minimization of the conditional entropy uniquely determines the correct inference. Thus, minimization of the conditional entropy is a principle of reasoning. This principle is called “entropy minimization.”

 

Shannon’s theory of the optimal decoder

Shannon’s theory of the optimal decoder applies the principle of entropy minimization. Under Shannon’s theory, a device called a “decoder” translates an encoded message, such as “001110…,” to an un-encoded message, such as “Mary had a little lamb.” In doing so, a decoder makes an inference from an observed to an unobserved state-space. The unobserved state-space is the alphabet of the un-encoded message.

A variety of descriptions can be provided by the states in the observed state-space of this inference. Each description defines a different inference. Which inference in the set {a, b…} of alternatives for being made by the decoder is correct? The designer of the decoder must decide!

The principle of entropy minimization applies to the situation described. That inference is correct which minimizes its own conditional entropy.

In the vernacular of communications engineering, the conditional entropy is attributed to the “noise.” Lightning strikes to telephone lines are a source of noise, for they add to the conditional entropy. Minimization of the conditional entropy through the design features of the decoder minimizes the deleterious effects of this noise.

Conformity to the principle of entropy minimization eliminates logical error from the inference that is made by a decoder. A decoder that is free from logical error is called an “optimal” decoder.

 

Shannon’s theory of communication

Under Shannon’s theory of communication (Shannon, 1948), the designer of a communications system maximizes the capacity of this system by combining an optimal encoder with an optimal decoder. The effect is to maximize the missing information about the encoded message at the encoder of this message and minimize the missing information about the un-encoded message at the decoder of the same message. Shannon’s ideas underlie the designs of nearly all modern communications devices.

 

Christensen’s theory of knowledge

By 1963, the problem of induction remained unsolved. In that year, the engineer, lawyer and theoretical physicist Ronald Christensen got the idea that the problem of induction could be solved by construing a model to be the algorithm for an optimal decoder of a “message” from nature. This “message” consisted of the sequence of the outcomes of statistical events for which the model was designed. It consisted, for example, of the sequence: rain    rain  no rain  rain ….

In their quest for knowledge, research workers were hampered by the fact that “knowledge” was an undefined concept. Christensen’s idea supplied a definition that was uniquely logical. In doing so, it generated a logical theory of knowledge that was the only such theory. Going forward, this theory will be called “Christensen’s theory of knowledge.”

In the construction of a model, the issue repeatedly arose of which inference in a set { a, b…} of inferences that were candidates for being made by a model was the one correct inference. Under Christensen’s theory, each such issue was resolved by measuring the various candidates by Shannon’s measure and selecting that candidate whose measure was minimal or maximal.

This line of thinking yielded a pair of principles of reasoning. Christensen called these principles “entropy minimax.” Acting under entropy minimax, the builder of a model discovered patterns in empirical data. Christensen called the process of discovery “entropy minimax pattern discovery.”

Christensen’s theory employs an abundance of mathematical ideas. To keep unambiguous track of these ideas, it is necessary to employ mathematical symbols in referencing some of them.

Toward the end of keeping track, let the set O designate the set of outcomes of statistical events to which an inference is made by a model; O is an example of an unobserved state-space. Let C designate the observed state-space that participates with O in making an inference. The “knowledge” of Christensen’s theory is Sh( OC ); it is the information about the state in O, given the state in C. “Knowledge” must be defined in this way because, under the probabilistic logic, there is no other way to define it.

If the construction of a model is to create knowledge, this model must be built upon one or more independent variables. Each such variable is a measured variable or is computed from one or more measured variables. If an inference is to be made from C to O, a value must have been assigned to each independent variable at or before the time at which this inference is made.

A result from satisfaction of this requirement is for the set of independent variables to take on a value for each of its variables in the period before an inference is made. The set which contains a value for each of the independent variables is called a “tuple.” The complete set of tuples is called the “independent variable space” for the model.

For concreteness, let’s take a look at a simplified example. In the example, the model has two independent variables. The values of one of these variables are the elements of the set { heavy, light }. The values of the other variable are the elements of the set { long, short }. The associated independent variable space is the set { heavy-long, heavy-short, light-long, light-short }. Heavy-long one of the four tuples in this space.

The independent variable space of a model may be divided into parts. In the case of our example, this space may be divided into the part { heavy-long } and the part  { heavy-short, light-long, light-short }. The complete set of these parts is called a “partition” of the independent variable space. Each element of C is a tuple or is abstracted from the tuples in a part of this partition.

If C contains two or more states, these states are called “conditions,” for they are conditions on the independent variable space. The state heavy-short OR light-long OR light-short is an example of a condition; it is abstracted from the elements of the part { heavy-short, light-long, light-short } of the partition of the independent variable space that was described in the previous paragraph.

In practice, there are a great many possible partitions of the independent variable space. If at least one of the independent variables is continuous, the number of partitions is infinite. Each partition generates a different set of descriptions for the states in C. Each such set defines a different inference from C to O. Which of these inferences is correct? The model builder must decide!

The principle of entropy minimization applies to the situation described. That inference is correct which minimizes its own conditional entropy. With “knowledge” defined as previously described, the principle of entropy minimization is the equivalent of the principle that the model builder shall

Maximize the knowledge.  

Maximization of the knowledge is Christensen’s first principle of reasoning. His second principle of reasoning constrains the process of maximization of the knowledge by the availability of information for this purpose. With the availability of unlimited information, perfect knowledge is created by the application of Christensen’s first and second principles of reasoning. With the availability of no information, no knowledge is created. In practice, it is usually true that some but not perfect knowledge is created.

The foregoing description of Christensen’s second principle of reasoning is accurate; however, it is vague in the sense of failing to describe how information may be turned into knowledge. Elimination of this vagueness comes at the expense of exposing the student to mathematical details that are complicated and that may be confusing. In view of the potential for confusion, it would be wise and cost-effective for the student to engage a competent tutor. For those who wish to attempt to learn of the details without a tutor, the following self-guided tutorial is provided.

Shannon’s theory of the optimal decoder contained an omission. This was of means for assignment of a number to the probability of a state in C or to the probability of a state in O, given a state in C. These assignments had to be made in order for the knowledge to be computed and maximized.

To assign a number to each probability, one needed a solution to a so-called “inverse problem.” The problem was that, while model builders had to assign values to probabilities, all that experimental science gave to model builders was frequency ratios in statistical samples.

The major barrier to solving the inverse problem was the question of how to avoid violation of the law of non-contradiction. In response, Christensen developed a strategy that answered this question. The strategy was to set up the problem such that on each occasion on which the identity of the correct inference was at issue, this issue was resolved by the principle of entropy maximization. By this strategy, Christensen solved the inverse problem. A result from this strategy is Christensen’s second principle of reasoning. 

Christensen’s strategy is rich with mathematical ideas. In keeping track of the ideas, it helps to reference them by symbols. Toward this end, let T designate an unobserved state-space and let U designate the observed state-space that participates with T in making an inference.

Refresher:

O designates the set of outcomes of statistical events that is referenced by a model. O is an example of an unobserved state-space.

C designates the observed state-space that participates with O in making an inference. Provided that C contains two or more states, these states are called “conditions.”

T and U are variables, each of which takes on two values. T takes on the values of O and C. If T takes on the value of O then U takes on the value of C. If T takes on the value of C then U contains a single state and this state is abstracted from the states in C.

The description provided in the previous paragraph deliberately employs terminological sloppiness in which the state-space C, previously described as an “observed state-space,” can also be described as an “unobserved state-space.” The question of which kind of state-space C is in any given context is resolved by this context.

For concreteness, let’s look at a couple of examples. In both examples, O is the state-space { rain, no rain } while C is the state-space { cloudy, not cloudy }.

In the first example, the variable T takes on the value O and the variable U takes on the value C; thus, T is the state-space { rain, no rain } while U is the state-space { cloudy, not cloudy }. In the second example, T takes on the value C and U contains a single state that is abstracted from the states in C; thus, T is the state-space { cloudy, not cloudy } while U is the state-space { cloudy OR not cloudy }.

As the reader may recall, T is an example of an unobserved state-space while U is the observed state-space that participates with T in making an inference. Let { Tl,  Um } designate the pairing of an unspecified state in T with an unspecified state in U. The count of the elements of a statistical sample that are observed to be in state Um is an example of a “frequency”; let this frequency be designated by n. The count of the elements that are observed to be in state Tl AND Um is another example of a frequency; let this frequency be designated by x. By definition, the two frequencies form the “frequency ratio” of the state Tl given Um. Let this frequency ratio be designated by ‘x in n’.

Let V designate a statistical sample. If the frequency ratio of Tl given Um in V is ‘x in n’ what number shall be assigned to Pr( Tl given Um )? To answer this question, one needs a solution to the inverse problem. Often, model builders have assumed the “straight rule” to be this solution.

 

The straight rule

Under the straight rule, the number assigned to Pr( Tl given Um )  is x/n. x/n is the “relative frequency” of the state Tl given Um in the sample V. The relative frequency is the value that makes the frequency ratio ‘x in n” most likely. Thus, it is an example of a maximum likelihood estimator.

The straight rule is illogical, for it violates the principle of entropy maximization. This deficiency is most apparent in the circumstance that n is small.

To pick a specific example, if 1 swan was observed and it was white, the frequency ratio of the state white given swan is ‘1 in 1’, the relative frequency of this state is 1/1 and 1 is assigned to Pr( white given swan ) under the straight rule. This is the equivalent of the conclusion that “all swans are white.”

Is it logical to conclude that all swans are white on the basis of a sighting of a single white swan? No it’s not. One cannot logically state that all swans are white, for information is missing about the colors of the unobserved swans. Nonetheless, prior to 1957, statisticians were firm believers in the straight rule. A result, still present in the language of mathematical statistics, is use of the superlative “unbiased estimator” in reference to the result from the straight rule. Using the meaning of “biased” in common English, one would have to say that to assign the value of 1 to the probability of a white swan on the basis of a sighting of a single white swan is extremely biased. It is biased in the direction of presuming extremely more information than is possessed by the model builder about the colors of the unobserved swans.

The straight rule may be tested for its conformity to reality. In one such test, the state space T of the model contained a pair of outcomes. One of these outcomes was a hit in a time at bat in the game of baseball. The other outcome was not a hit. The state space U contained 18 conditions on the model’s independent variables. Each condition was the identity of the major league player who was the batter. The results of the test were published in the periodical Scientific American (Efron and Bradley, 1977).

In the language of baseball, a player’s relative frequency of the state a hit in a time at bat is called this player’s “batting average.” In the test of the straight rule, the performances of 18 major league players were measured in the 1970 season. Each player’s batting average in his first 45 times at bat was compared to this player’s batting average in the remainder of the season.

If the straight rule were consistent with reality, the two batting averages would have been of similar magnitude. It was found, however, that in the remainder of the season, the various players’ batting averages had shrunk far from their batting averages in the first 45 times at bat and close to the grand average of the eighteen players in their first 45 times at bat. This phenomenon became known as “shrinkage.” With shrinkage, the numbers assigned to probabilities by the straight rule were wrong. Thus the straight rule was invalided as a general guide to model building. The article called this phenomenon “Stein’s paradox,” after the statistician who had discovered it.

Under the probabilistic logic, the shrinkage has a cause. This cause is overestimation of the information one gets about the probability of a state in T, from knowing the state in U. If this information is nil, then the probability of a state in T is independent of the state in U and is called the “base-rate.” The shrinkage that is observed in empirical studies is toward the base-rate. The straight rule neglects the shrinkage toward base-rate when the information about the state in T, given the state in U, is less than perfect.

The shrinkage can be eliminated. To accomplish this, the model builder eliminates the overestimation of the information about the probability of a state in T from knowing the state in U. This elimination is effected by conformity to the principle of entropy maximization. To settle every issue of which inference is correct by the principle of entropy maximization is the idea that underlies a principle of reasoning which Christensen discovered. He calls this principle “maximum entropy expectation.”

 

The principle of maximum entropy expectation

Maximum entropy expectation solves the inverse problem. In setting up the inverse problem for solution, Christensen describes inferences by a carefully contrived strategy. This strategy is designed to render every issue of which of several inferences is correct decidable, under the principle of entropy maximization.

To review, the principle of entropy maximization applies if and only if:

o   an inference to an unobserved state-space assigns a numerical value to the probability of each of the states in this state-space and,

o   the observed state-space that participates with the unobserved state-space in making the inference contains a single state and,

o   the elements of the unobserved state-space are examples of ways.

The details of Christensen’s strategy result from the necessity for conforming to the three bulleted requirements for the principle of entropy maximization to be applicable.

Maximum entropy expectation assigns a number to Pr( Tl given Um ) over each value of the index l and the index m. The manner in which it computes each such number is the topic of the following derivation.  

The derivation features the set { W1, W2,… } of ways in which a state can occur in the unobserved state-space { T1 given Um, T2 given Um… }. The elements of { W1, W2,… } may be paired. Let an arbitrarily selected pair be designated by { Wi, Wj  }, where the index i does not equal the index j. Let it be stipulated that { Wi, Wj } is the unobserved state-space for an inference. Let it be stipulated that the observed state-space which participates with { Wi, Wj } in making this inference is { Wi OR Wj }.

Wi given Wi OR Wj is an example of a state. Let E2 designate the evidence that is available for the assignment of a numerical value to Pr( Wi given Wi OR Wj ). Let Pr[ ( Wi given Wi OR Wj ) given E2 ] designate this value. If Pr[ ( Wi given Wi OR Wj ) given E2 ] can be computed over all of the values of the indices i and j  then, it can be shown, sufficient data are available for assignment of a numerical value to Pr( Tl given Um ) over each value of the index l and each value of the index m as required for completion of the derivation.

How shall a number be assigned to Pr[ ( Wi given Wi OR Wj ) given E2 ]? As Christensen has set up the problem, it responds to the principle of entropy maximization. In the assignment of a number, an inference is made from the unobserved state-space Wi OR Wj to the observed state-space { Wi, Wj }. Inferences of infinite number are possibilities. Each inference assigns a number to Pr[ ( Wi given Wi OR Wj ) given E2 ] and a different number to  Pr[ ( Wj given Wi OR Wj ) given E2 ]. Which inference is correct? The model builder must decide!

Under the principle of entropy maximization, the correct inference is the one that maximizes its own entropy, under constraints expressing the available information.

What are the natures of the constraints? In answering this question, Christensen makes a second application of the principle of entropy maximization.

In developing this idea, we conduct a thought experiment. In each trial of this experiment, we observe whether the state is Wi, given that the state is Wi OR Wj.

In 1 trial of this experiment, it is a fact that the relative frequency of Wi given Wi OR Wj will be 0 OR 1. In 2 trials, the relative frequency will be 0 OR ½ OR 1. In 3 trials, the relative frequency will be 0 OR 1/3 OR 2/3 OR 1. In N trials, the relative frequency will be 0 OR 1/N OR 2/N OR 3/N…OR 1. Note that the relative frequency will surely be one of the elements in the sequence of numbers 0, 1/N, 2/N, 3/N,…,1.

Now, let the number of trials N increase without limit. The relative frequency becomes known as the “limiting relative frequency.” Let the limiting relative frequency of the state Wi given Wi OR Wj be designated by Fr( Wi given Wi OR Wj ).

Each element of the set { 0, 1/N, 2/N,…,1 } matches the description of a way in which Fr( Wi given Wi OR Wj ) can occur. Let this set be designated by F( Wi given Wi OR Wj ). F( Wi given Wi OR Wj ) is a variable whose true but thus far undetermined value is Fr( Wi given Wi OR Wj ). The values taken on by F( Wi given Wi OR Wj ) are in the sequence 0, 1/N, 2/N, 3/N…,1. The distance between adjacent values is fixed at 1/N and this distance is infinitesimal.

We stipulate that F( Wi given Wi OR Wj ) is the unobserved state-space for an inference. The observed state-space of this inference is { 0 OR 1/N OR 2/N OR 3/N OR…OR 1 }. The numerical values that are assigned to the probabilities of the elements of F( Wi given Wi OR Wj ) by this inference form a set. Sets of infinite number are possibilities. Each set defines a different inference. Which inference is correct? The model builder must decide!

The principle of entropy maximization applies to the situation described. The correct inference to F( Wi given Wi OR Wj ) is the one which maximizes its own entropy, under constraints expressing the available information.

For reasons that will become clear, it is convenient to stipulate that there are two sources for the available information. One of these is the piece of evidence we’ve already seen, namely E2. The other is the additional piece of evidence E1. Under constraints expressing the available information in E1, maximization of the entropy yields the probability distribution function Pr[ F(Wi given Wi OR Wj ) given E1 ]. Under constraints expressing the available information in E2, entropy maximization yields the probability distribution function Pr[ F(Wi given Wi OR Wj ) given E2 ].

By tradition, the set E1 is assumed to be empty and the set E2 is assumed to contain the frequency ratio ‘x in n’ in a sample. Under this tradition, Pr[ F( Wi given Wi OR Wj ) given E1 ] is called the “prior” probability distribution function while Pr[ F(Wi given Wi OR Wj ) given E2 ] is called the “posterior” probability distribution function. A theorem, proved independently in the eighteenth century by Thomas Bayes and Pierre-Simon Laplace and called “Bayes’ theorem,” maps the “prior” function plus the frequency ratio to the “posterior” function.

Bayes’ theorem is logically impeccable. However, under the tradition, the application of it is illogical for violating the law of non-contradiction. Non-contradiction is violated by the arbitrariness of the “prior” function.

If the principle of entropy maximization is employed in the determination of the “prior” function, this eliminates the arbitrariness. However, acting under the tradition, one concludes that the “prior” function is uniform. Usually, a result from acting on this conclusion is for the model to fail from the resulting shrinkage.

Either the principle of entropy maximization is empirically invalidated or the tradition is empirically invalidated. However, under the probabilistic logic, the principle of entropy maximization is a fact. Thus, it must be the tradition that is empirically invalidated.

Acting on the conclusion that the tradition is empirically invalidated, Christensen defines E1 and E2 outside the tradition. In particular:

o   E1 contains the frequency ratio ‘x1 in n1’ plus the function Pr[ F(Wi given Wi OR Wj ) given E2 ] and,

o   E2 contains the frequency ratio ‘x2 in n2’ plus the function Pr[ F(Wi given Wi OR Wj ) given E1 ].

where ‘x1 in n1’ designates the frequency ratio that is measured in a sample and ‘x2 in n2’ designates the frequency ratio that is measured in a different sample from the same population.

Bayes’ theorem maps ‘x2 in n2’ and Pr[ F(Wi given Wi OR Wj ) given E1 ] to Pr[ F(Wi given Wi OR Wj ) given E2 ]. A feedback loop devised by Christensen maps ‘x1 in n1’ and Pr[ F(Wi given Wi OR Wj ) given E2 ] to Pr[ F(Wi given Wi OR Wj ) given E1 ].

The evidence E1 pushes the entropy downward. The evidence E2 pulls the entropy upward. It is the absence of the evidence E2 that causes the model to fail, under the tradition. The feedback loop varies the portion of E2 which is Pr[ F(Wi given Wi OR Wj ) given E1 ] in such a way as to minimize a measure of the error when a number is assigned to Pr[ ( Wi given Wi OR Wj ) given E2 ] over each of the values of the indices i and j. By this strategy, the functions Pr[ F(Wi given Wi OR Wj ) given E1 ] and Pr[ F(Wi given Wi OR Wj ) given E2 ] are uniquely determined, the available information is precisely represented and the cause of shrinkage is eliminated.

With the definitions for E1 and E2, as modified by Christensen, the traditional terminological convention in which Pr[ F(Wi given Wi OR Wj ) given E1 ] is called the “prior” function and Pr[ F(Wi given Wi OR Wj ) given E2 ] is called the “posterior” function becomes misleading and inappropriate, for Pr[ F(Wi given Wi OR Wj ) given E1 ] is dependent upon, rather being prior to, observational data. Adherence to this convention has misled a large body of methodologists into the logically erroneous conclusion that conformity to Bayes’ theorem must be avoided when, for logical consistency, this conformity must be preserved.

A procedure has been described for determination of the function Pr[ F(Wi given Wi OR Wj ) given E2 ]. The next step in the derivation is to describe means for assignment of a number to the different function Pr( Wi given Wi OR Wj ), over each of the values of the indices i and j. With the availability of these means, the derivation can be completed.

In the context of this problem it is pertinent that the function Pr[ F(Wi given Wi OR Wj ) given E2 ] contains all of the information that is available for the assignment of a number to Pr(Wi given Wi OR Wj ). Our strategy is to discovery a measure of Pr[ F(Wi given Wi OR Wj ) given E2 ] with the properties that are required of the number which is assigned to Pr( Wi given Wi OR Wj ).

Let the function g(.) designate this measure. If the identity of g(.) is determined, then the required means are available for the assignment of a number to Pr( Wi given Wi OR Wj ).

What is the identity of the measure g(.)? In addressing this question, it is pertinent that, under the circumstance that the missing information about the value of the variable F( Wi given Wi OR Wj ) is reduced to nil, this value is Fr( Wi given Wi OR Wj ). Thus, under this circumstance,

g{ Pr[ F(Wi given Wi OR Wj ) given E2 ] } = Fr( Wi given Wi OR Wj )                              (3)

Equation (3) imposes the first of two constraints on the form of the measure g(.). On the assumption that the function Pr[ F(Wi given Wi OR Wj ) given E2 ] has a single maximum, a form for g(.) that is consistent with this constraint is that g(.) is the value of F( Wi given Wi OR Wj ) at this maximum. However, this assumption is inconsistent with the second of the two constraints.

The second of the constraints arises in the following way. To review, { Wi, Wj } designates the unobserved state-space of a kind of inference. The observed state-space that participates with { Wi, Wj } in making this inference is { Wi OR Wj }. Each such inference assigns a number to the probability of Wi and a different number to the probability of Wj.

To continue the review, F( Wi given Wi OR Wj ) designates the unobserved state-space of a different kind of inference. F( Wi given Wi OR Wj ) is a variable which takes on the values in the sequence 0, 1/N, 2/N, 3/N…, 1, where N is a large positive integer. The observed state-space that participates with F(Wi given Wi OR Wj ) in making this inference is { 0 OR 1/N OR 2/N OR 3/N OR…OR 1 }. Each such inference assigns a number to the probability of each element of F( Wi given Wi OR Wj ).

Many inferences of the first and second kinds are possibilities. Which ones are correct?

The principle of entropy maximization applies to the situation. Among the several possibilities for being made by a model, that inference is correct which maximizes its own entropy, under constraints expressing the available information.

Now, let it be stipulated that the bases for inferences of the two kinds are entirely empirical. In the circumstance that the available information about the value of F( Wi given Wi OR Wj ) is nil, the available information about the state in { Wi, Wj } must be nil, for there is a complete lack of information about it. It follows that the value of F( Wi given Wi OR Wj ) is ½; however, ½ is not the value for which Pr[ F(Wi given Wi OR Wj ) given E2 ] is at the maximum. In fact, under the stated circumstances, Pr[ F(Wi given Wi OR Wj ) given E2 ] is a constant. Thus, a single value of F(Wi given Wi OR Wj ) does not exist for which Pr[ F(Wi given Wi OR Wj ) given E2 ] is at the maximum but rather all of the values in the interval between 0 and 1 are at the maximum.

Given that a slight amount of information is available, the value assigned to F(Wi given Wi OR Wj ) must, by the definition of “information,” be close to ½. Under this circumstance, Pr[ F(Wi given Wi OR Wj ) given E2 ] may possess a maximum but the value of Pr(.) at this maximum is not necessarily close to ½. On these grounds, the proposition that the measure g(.) is the value of F(Wi given Wi OR Wj ) for which Pr[ F(Wi given Wi OR Wj ) given E2 ] is at the maximum is rejected.

In the search for an acceptable alternative, it may be noted that the function for which we require a measure, namely Pr[ F(Wi given Wi OR Wj ) given E2 ], is a set of pairs. Each such pair consists of an element of the set [ F(Wi given Wi OR Wj ) given E2 ] and the element of the set Pr(.) to which it maps. We stipulate that the collection of sets that are measurable by g(.) contains the set of these pairs and that the elements of this set are non-overlapping.

As the missing information about the true value of the variable F(Wi given Wi OR Wj ) approaches nil, the function Pr[ F(Wi given Wi OR Wj ) given E2 ] reduces to a Dirac delta function that is centered on the true value of F(Wi given Wi OR Wj ), namely Fr(Wi given Wi OR Wj ). By a property of the Dirac delta function, it must be true that the measure g(.) of a single pair in Pr[ F(Wi given Wi OR Wj ) given E2 ] is f(Wi given Wi OR Wj ) Pr[ f(Wi given Wi OR Wj ) given E2 ], where f(Wi given Wi OR Wj ) designates an element of F(Wi given Wi OR Wj ).

Under the precept of measure theory called additivity, the measure of the union of the pairs in the collection of measurable sets is the sum of the measures of the individual pairs. In establishing the identity of this sum, it is convenient to employ the substitution in which

Pr[ f( Wi given Wi OR Wj ) given E2 ] = Prʹ[ f( Wi given Wi OR Wj ) given E2 ] df( Wi given Wi OR Wj ),

where Prʹ[ f( Wi given Wi OR Wj ) given E2 ] is an example of a probability density function and df[ ( Wi given Wi OR Wj ) given E2 ] is the distance between adjacent elements of F(Wi given Wi OR Wj ). With this substitution, the measure g(.) of the union of the pairs in the collection of measurable sets is

The above quantity is the expected value of [ ( Wi given Wi OR Wj ) given E2 ] in the probability distribution function Pr[ F( Wi given Wi OR Wj ) given E2 ], by the definition of “the expected value.” With no available information about the true value of F(Wi given Wi OR Wj ), the expected value of it is ½; this value of ½  maximizes the entropy of the inference to { Wi, Wj }, as required under the principle of entropy maximization. On these grounds, we accept the proposition that the number assigned to Pr( Wi given Wi OR Wj ) is the expected value of the variable F(Wi given Wi OR Wj ) in the function Pr[ F(Wi given Wi OR Wj ) given E2 ]. 

A procedure has been described that assigns a numerical value to Pr( Wi given Wi OR Wj ), where Wi and Wj are ways in which a state can occur in the unobserved state-space { T1 given Um, T2 given Um… } of an inference. It can be shown that, by variation of the values of the indices i and j over their ranges and assignment of a value to Pr( Wi given Wi OR Wj ) by this procedure, one generates sufficient data for assignment of a numerical value to Pr( Tl given Um ) over each value of the index l and each value of the index m.

This completes our derivation of maximum entropy expectation. It is apt to classify maximum entropy expectation as a principle of reasoning because: a) it is based upon the principle of entropy maximization and b) the latter principle is a fact, under the probabilistic logic.

 

Frequentism

In the nineteenth century, a group of logicians advocated a reform in the field of mathematical statistics. This reform became known as “frequentism.”

They advocated this reform for the purpose of eliminating the violations of the law of non-contradiction that resulted from the arbitrariness of the “prior” probability distribution function. In the twentieth century, this reform became doctrine in the field of mathematical statistics and was embraced by most scientists. However, this reform could not have been more damaging to the advancement of science for, under frequentism, patterns could not be discovered and knowledge could not be created.

Frequentism is the idea that the constant numerical value of the limiting relative frequency of a state is assigned to the probability of this state. In particular,

 Pr( Wi given Wi OR Wj ) := c                                                                (4)

where the constant c is a real number. Under this method of assignment, the value which is assigned to Pr( Wi given Wi OR Wj ) is insensitive to the nature of the “prior” probability distribution function. The insensitivity eliminates the violation of the law of non-contradiction. However, there is an unsavory side effect.

The story of what it is that is unsavory tells best by way of an example. In the example, c is 0.6931…, Wi is rain given cloudy and Wj is no rain given cloudy. Thus, Wi given Wi OR Wj is { (rain given cloudy ) given [ (rain given cloudy ) OR ( no rain given cloudy ) ] }. The relationship between { (rain given cloudy ) given [ (rain given cloudy ) OR ( no rain given cloudy ) ] } and 0.6931… is one-to-one. Thus, that the value is 0.6931… determines that the state is [ (rain given cloudy ) given [ (rain given cloudy ) OR ( no rain given cloudy ) ].

The unsavory side effect is that, with the state { (rain given cloudy ) given [ (rain given cloudy ) OR ( no rain given cloudy ) ] } determined, the observed state-space { cloudy, not cloudy } cannot be swapped out and another state-description, say { barometric pressure falling, barometric pressure rising }, swapped in for the purpose of determination of whether this would reduce the conditional entropy of the inference to { rain, no rain }. This cannot be done, for to do so would be to change the value from 0.6931… to some other value but this value is fixed under the defining precept of frequentism. To generalize from this example, pattern discovery cannot take place under frequentism nor can knowledge be created because frequentism implies the states in observed state-spaces to be of fixed description. Viewed from the perspective of the probabilistic logic, frequentism fails from violation of the principle of entropy minimization.

Under maximum entropy expectation, the situation is different. The numerical value assigned to Pr( Wi given Wi OR Wj ) is the expected value of F[ ( Wi given Wi OR Wj ) given E2 ] in the function Pr[ F( Wi given Wi OR Wj ) given E2 ] } and rather than being fixed the expected value varies with the evidence E2 and with the identity of the state Wi given Wi OR Wj. Thus, swapping in a new observed state space simply changes the expected value. It follows that, under maximum entropy expectation, patterns can be discovered, knowledge can be created and the principle of entropy minimization can be followed.

 

Entropy minimax

Under the probabilistic logic, two principles of reasoning discriminate the one correct inference from the many incorrect ones in the construction of a model. The first is entropy minimization. The second is maximum entropy expectation. Christensen calls this pair of principles “entropy minimax.”

 

Pattern discovery

When the construction of a model is guided by entropy minimax, the elements of the observed state-space C are optimized abstractions. If there are two or more of them, it is customary and apt to call these abstractions “patterns.” Thus, in the construction of a model under entropy minimax, patterns are discovered. The process by which they are discovered is called “entropy minimax pattern discovery”.

 

Creation of the maximum possible knowledge

If knowledge is created by a scientific investigation, this is by the construction of a model. When this model is constructed under entropy minimax, the greatest possible knowledge is created from fixed resources, as “knowledge” is defined in Christensen’s theory of knowledge. Christensen’s theory has the merit of providing the only known logical approach to the construction of a model.

 

Peak performance

When a model is constructed under entropy minimax, logical errors are eliminated from it. It follows that the degradation in the performance of this model that would result from these errors is eliminated. Thus, the model performs at the highest possible level.

 

Reduction to the deductive logic

If the principles of reasoning are entropy minimax for the whole of logic, this logic must reduce to the deductive logic in the circumstance that every kind of missing information is reduced to nil. This is the case.

The reduction to the deductive logic occurs in the following manner. Associated with the deductive logic is a single principle of reasoning. This principle states that an argument is correct if and only if it matches the abstract argument called modus ponens or the abstract argument called modus tollens.

Modus ponens states:

                                    Major premise: A implies B

                                    Minor premise: A

                                    Conclusion:      B

Modus tollens states:

                                    Major premise: A implies B

                                    Minor premise: NOT B

                                    Conclusion:      NOT A

In the language of Christensen’s theory of knowledge, A is an example of a pattern while B is an example of an outcome. Modus ponens and modus tollens express the one-to-one relationship between patterns and outcomes, under entropy minimax, when the missing information for a deductive conclusion is reduced to nil.

Mathematics results from repeated application of the two arguments. Thus, mathematics results from the conformity of its arguments to entropy minimax.

 

Perfect knowledge

If Christensen’s theory of knowledge is correct, perfect knowledge can be expressed under entropy minimax pattern discovery. This is the case. Perfect knowledge results from reduction of the logic to the deductive logic by the elimination of all missing information for a deductive conclusion.

 

Perfect ignorance

If Christensen’s theory of knowledge is correct, perfect ignorance can be expressed under entropy minimax pattern discovery. This is the case. Perfect ignorance results from failure to discover patterns.

 

Empirical basis

No observational data conflict with the thesis that the principles of reasoning are entropy minimax. The quantity of observational data that bear on this issue is great. Much of this data results from tests of models or arguments that were, in effect, built under the principle of entropy minimization, the principle of entropy maximization or the principle of maximum entropy expectation, that have large domains of validity and that have worked perfectly under intensive testing over periods ranging from decades to millennia. They are:

o   Modus ponens and modus tollens plus all of the works that are built by these arguments, including mathematics,

o   the theory of fair gambling devices,

o   the modern theory of heat aka thermodynamics and,

o   the modern theory of communication.

A study (Christensen, 1986a) compares the performances of models built by entropy minimax pattern discovery to the performances of models built under the method of heuristics. It is reported that the former models consistently outperformed the latter. In certain cases, the degree of outperformance was very great.

 

Past applications

Questions addressed by entropy minimax pattern discovery have included:

o   whether a drug will be found to retard lymphoid leukemia, lymphocytic leukemia or melanocarcinoma in mice, based on this drug’s physical, chemical and biological features,

 

o   whether a patient will be found to have heart disease subsequent to his/her electrocardiogram, ECG,

 

o   whether an ECG waveform indicates a normal or an abnormal heartbeat,

 

o   which features of an ECG or other waveform contain the most information about outcomes,

 

o   whether a biopsy will reveal prostate cancer, conditioned on a patient’s level of prostate specific antigen, PSA, plus the values of other independent variables,

 

o   whether a biopsy will reveal cervical cancer, based on spectral analysis of data from tissue fluorescence,

 

o   whether a biopsy will reveal breast cancer, based on electrical potentials produced by the patient’s heart beats,

 

o   whether patients with lymphoma, chronic granulocytic leukemia or prostate cancer have high or low survival risks,

 

o   whether patients surgically treated for coronary artery disease have high or low survival risks, based on catheterization and clinical data,

 

o   whether a paroled prison inmate will return to prison,

 

o   whether depression and related psychological states are related to early childhood memories,

 

o   whether nuclear reactor fuel will be sufficiently deformed under accident conditions to obstruct coolant flow,

 

o   whether nuclear reactor fuel will be found to be leaking radioactive substances if removed from a reactor and tested,

 

o   whether a gasoline storage tank will be found to be leaking a carcinogen into an aquifer or an explosive into adjacent basements, if dug up and tested and,

 

o   how a photographic or video image should be classified as to type.

 

Decisions that have been supported by models built by entropy minimax pattern discovery include:

o   the course of treatment for non-Hodgkin’s lymphoma,

 

o   the course of treatment for disorders of the cervical spine,

 

o   which factors, in addition to PSA, improve the reliability of prostate cancer diagnosis,

 

o   which factors (now referenced in medicine as International Prognostic Indices, IPIs) indicate high risk for patients with lymphoma, chronic granulocytic leukemia or prostate cancer, for consideration in treatment selection,

 

o   whether to submit a request for approval of a diagnostic technique for breast cancer to the U.S. Food and Drug Administration, FDA,

 

o   whether to submit a request for approval of a diagnostic technique for cervical cancer to the FDA,

 

o   whether the U.S. Nuclear Regulatory Commission should require further research before certifying that nuclear reactors are adequately safe from loss of coolant accidents,

 

o   whether to restart a nuclear reactor containing parts that might fail in service,

 

o   whether to suspend licensing of nuclear reactors,

 

o   when to replace a leakage-prone gasoline storage tank,

 

o   the level of water that should be kept behind a dam, in light of the long range forecast for precipitation,

 

o   how an electric utility should plan for demand for air conditioning and,

 

o   whether an electric utility’s rate should be adjusted, in light of the long range forecast for precipitation.

 

Factors discovered by entropy minimax pattern discovery are embedded in the medical standard for the classification of patients with non-Hodgkin’s lymphoma.

 

Offerings

The situation in which logic has been completed but neither the scientific nor academic community has come to grips with this advance leaves a great deal of work to be done and very few people or organizations with the ability to do this work. KnowledgeToTheMax offers its help in filling this gap through services that include:

o   teaching of logic, not excluding the inductive logic,

o   consultancy on science policy,

o   consultancy on curriculum reform in education,

o   management of theoretical aspects of scientific studies and,

o   construction of ultra-optimized, logical, maximally effective models.

 

Staff

Currently, the staff of KnowledgeToTheMax consists solely of the firm’s owner, Terry Oldberg. From 1975 to 1982 Oldberg managed the theoretical side of the research program of the electric utilities of the U.S. on the performance of materials in the cores of their nuclear reactors. In this capacity, he, Dr. Ronald Christensen and their colleagues pioneered the application of entropy minimax pattern discovery in engineering research. Oldberg has held positions in research, management and engineering with the Lawrence Livermore National Laboratory, General Electric Company, Electric Power Research Institute, Alltel Healthcare Information Systems and Picturetel Corporation.

Oldberg holds the B.M.E. degree in mechanical engineering from Cornell University, the M.S.E. degree in mechanical engineering from the University of Michigan and the M.S.E.E. degree in electrical engineering from Santa Clara University. He is a registered professional engineer in nuclear engineering in the State of California.

A list of Oldberg’s publications relating to entropy minimax pattern discovery is available in the bibliography.

             Oldberg

 

Limitation

In building models, KnowledgeToTheMax operates under a technology sharing agreement with the developer of entropy minimax pattern discovery, Ronald Christensen. Under this agreement, KnowledgeToTheMax may freely employ proprietary technology owned by Christensen for the benefit of non-profit organizations and may employ the same technology for the benefit of others with Christensen’s permission.

 

Adding a mechanistic model

In some cases, the scientific community possesses a degree of mechanistic understanding of the phenomenon being modeled. In these cases, the amalgamation of a mechanistic model with an empirical one created by entropy minimax pattern discovery provides the benefit of greater knowledge or a larger domain of applicability.

In the amalgamated model, the mechanistic model plays two roles. First, certain of its independent variables may provide independent variables for the empirical model. Second, the inferences that are made to the outcomes of events by the mechanistic model may serve as a constraint on entropy maximization.

 

Fuzzy sets

Probability theory assumes that the sets in the collection of sets that are measured Pr(.) are crisply defined. In the construction of a model, though, the builder often encounters situations in which these sets are not crisply defined. In these situations, a similar logic applies but with set theory replaced by fuzzy set theory and the various measures by their fuzzy equivalents.

 

News

On Oct. 23, 2008, Terry Oldberg of KnowledgeToTheMax presented a lecture entitled “Information Theory: Maximizing Knowledge” to a meeting of the American Nuclear Society in San Francisco, California.

On Nov. 20, 2008, Oldberg presented a lecture entitled “Maximizing Knowledge” to a meeting of the American Chemical Society in Santa Clara, California. The announcement for the meeting is posted here.

On Feb. 11, 2009, Oldberg presented a lecture entitled “Maximizing Knowledge” to a meeting of the American Society for Quality in Santa Clara, California.

On May 7, 2009, Oldberg presented a lecture entitled “Maximizing Knowledge” to a meeting of the American Institute of Chemical Engineers in Berkeley, California.

 

Bibliography

A bibliography is available by clicking here. The literature is large and not completely user friendly. Proofs of some theorems are sketchy or absent. Hence, it would be far more cost effective to engage a tutor than to attempt to climb the learning curve unaided.

 

Contacting us

For further information, please contact the owner of KnowledgeToTheMax, Terry Oldberg. He may be reached at terry@KnowledgeToTheMax.com (Los Altos Hills, California).

 

Citing this work

Title:                   Offerings of KnowledgeToTheMax, Third Edition

Author:                Terry Oldberg

Publisher:           KnowledgeToTheMax, Los Altos Hills, CA

Publication date: November 14, 2009

 

Copyright

COPYRIGHT © 2008, 2009 by Terry Oldberg

 

ALL RIGHTS RESERVED

 

No part of this work may be reproduced or used in any form or my any means - including Web distribution or information storage and retrieval systems – without the written permission of the author. To request permission, contact the author at terry@KnowledgeToTheMax.com.

 

TOP

 

a