Learning to Reason

(1)

Learning to Reason

Fredrik Rømming

Thesis submitted for the degree of

Master in Informatics: Programming and System Architecture

60 credits

Department of Informatics

The Faculty of Mathematics and Natural Sciences

(2)

(3)

Learning to Reason

Fredrik Rømming

(4)

Printed: Reprosentralen, University of Oslo

(5)

Abstract

Theorem proving formalizes the notion of deductive reasoning, while machine learning formalizes the notion of inductive reasoning. In this thesis, we present an overview of the current state of machine learning guided first-order automated theorem proving systems and outline a novel high-level modular object-oriented framework for combining arbitrary machine learning models with arbitrary proof calculi. Additionally, we present an example implementation in the aforementioned framework taking a novel approach to combining graph neural networks with the first- order connection calculus, generating a new Python implementation of the leanCoP theorem prover as a by-product.

(6)

(7)

Introduction

“If we had it [a characteristica universalis], we should be able to reason in metaphysics and morals in much the same way as in geometry and analysis.

If controversies were to arise, there would be no more need of disputation between two philosophers than between two accountants. For it would suffice to take their pencils in their hands, to sit down to their slates, and to say to each other . . . : Let us calculate.”

- Gottfried Wilhelm Leibniz Reasoning is the ability to make steps of inference from premises to consequences. We traditionally divide inference into two subcategories:

deduction and induction.

Definition 1.0.1 (Deduction). Deductive inference makes steps from premises, which are known or assumed to be true, to logical conclusions by following rules of valid reasoning.

An example of deductive inference is the ancient Greek syllogism:

All men are mortal.

Socrates is a man.

Therefore, Socrates is mortal. ∴

The two sentences above the horizontal line are premises which combined imply the sentence below the line, which is a logical conclusion.

Definition 1.0.2 (Induction). Inductive inference (Not to be confused with "mathematical induction") makes steps from premises, which are observations providing some evidence, to conclusions saying something general about the phenomenon underlying the particular observations.

An example of inductive inference is the generalization:

(10)

The proportion Q of the sample has attribute A

Therefore, the proportion Q of the population has attribute A The sentence above the line states some specific observations, while the sentence below the line says something general about all possible potential observations.

In other words, deduction is concluding the particular from the general, while induction is concluding the general from the particular.

Systems automating the process of reasoning play a central role in software and hardware verification, and have applications in any area where logical reasoning is required, such as in artificial general intelligence, mathematics and philosophy. Although the overall goal of automated reasoning is to mechanize different forms of reasoning, the term has largely been identified with deductive reasoning as practiced in mathematics and formal logic. In this thesis, we explore the use of automated inductive reasoning (machine learning) to guide automated deductive reasoning (theorem proving).

In Section 2, 3, and 4 we introduce the fields of automated theorem proving and machine learning, and review what has been done to combine the two. In Section 5 we propose a high-level framework for machine learning guided automated theorem proving, and in Section 6 and 7 we describe how we developed a machine learning guided automated theorem prover in the aforementioned framework.

(11)

Chapter 2

Automated Theorem Proving (ATP)

2.1 Formalizing Deductive Reasoning

Deduction is the type of reasoning concerned with necessity, where each conclusion is a logical consequence of the premise. But what exactly does it mean for a statement to be a "logical consequence"? What are "valid rules of inference"? To formalize our natural notion of this, we use what are known as formal and logical systems:

Definition 2.1.1(Formal system). A formal system, consists of:

1. A formal language, which is formed by:

- A set of symbols, the alphabet, strings of which are called formulas.

- Rules for creating syntactically valid formulas (syntax).

2. A deductive system (proof calculus), which consists of:

- A set of axioms. Formulas providing a start or end point for our deduction.

- A set of inference rules operating on formulas in the formal language, determining what are valid inferences.

Definition 2.1.2(Logical system). A logical system, consists of 1. A formal system.

2. Rules for interpreting formulas in the formal language of the formal system (semantics).

Using these notions, we can formalize the process of deduction as mechanical theorem proving. We will therefore use the terms "theorem

(12)

To reach conclusions through deduction in this way, we need to use logical systems whose syntax and semantics are expressive enough and whose deductive systems are strong enough to prove theorems corresponding to the statements in our domain of discourse. What formal framework to use is highly dependent on the nature of the problem/theorem domain. There is a wide range of imaginable domains in which we would like to perform automated deduction. Table 2.1 shows some example domains and their most common formalizations [1]:

Domain Common formalizations

General-purpose theorem proving and problem solving First-order logic, Simple type theory Program verification Typed First-order logic, Higher-order logic Distributed and concurrent systems Modal logic, Temporal logic

Program synthesis Intuitionistic logic

Hardware verification Higher-order logic, Propositional logic

Logic programming Horn logic

Constraint satisfaction Propositional logic, Satisfiabilty Modulo Theories

Computational metaphysics Higher-order modal logic

Table 2.1: Deduction domains and common formalizations

We will now have a look at the two most well-known categories of logical frameworks: propositional logic and first-order logic. For a more detailed overview and proofs of associated theorems, see [2].

2.2 Propositional Logic

Propositional logic, also called zeroth-order logic, is an umbrella term for a set of equally expressive logical systems which deal with propositions as the atomic entity.

Definition 2.2.1(Proposition). A proposition is a statement which is either true or false.

The statement "It is raining" is an example of a proposition.

2.2.1 Syntax

To formalize the notion of a proposition in a formal language for propositional logic, we build propositional formulae using two atomic types of symbols, logical connectives and propositional variables:

Definition 2.2.2(logical connectives). The logical connectives are the symbols¬(negation),∧(conjunction), and∨(disjunction),→(implication).

Definition 2.2.3 (Propositional variables). A propositional variable is a symbol which is not a logical connective, usually a lower case letterp,q,r, etc.

Using these we can build a formal language which will be a set of well- formed propositional formulae recursively defined as follows:

(13)

Definition 2.2.4 (Propositional formulae). A propositional variable is an atomic propositional formula. Any atomic propositional formula is a propositional formula. If F and G are propositional formulae then the following are also propositional formulae:

- ¬F - (F∧G) - (F∨G) - (F→G)

Some examples of well-formed formulas are:

(¬p →(q→ p)) and ((p→q)∧(q→ p))

It is common to impose a precedence, , over the connectives to decrease the number of parentheses needed to write a formula. If• ◦, thenp•q◦r means ((p•q)◦r) and not (p•(q◦r)). The standard precendence is:

¬ ∧ ∨ →. This precedence will be assumed throughout the rest of this thesis.

2.2.2 Semantics

To make sense of the formulae we defined in the previous subsection, we introduce some semantics to interpret them.

Definition 2.2.5 (Truth values and propositional interpretations). An interpretation, I, is an assignment of truth values, true or false, to each propositional variable.

We extend this notion of interpretation to formulas of the formal language described in the last subsection recursively as follows:

Definition 2.2.6(Interpretation of propositional formulae). IfFis an atomic formula then I(F) = true precisely when we have assigned true to F, otherwise I(F)= false. Otherwise if:

- F=¬G, then I(F)= true ifI(G)= false, otherwiseI(F)= true.

- F= (G∧H), thenI(F)= true when bothI(G)= true andI(H)= true, otherwiseI(F)= false.

- F = (G∨H), then I(F)= false when both I(G)= false and I(H)= false, otherwiseI(F)= true.

- F = (G → H), then I(F)= false when both I(G)= true and I(H)= false, otherwiseI(F)= true.

This way, the symbols: ¬,∧,∨ and → match our intuitive notions of the natural language logical connectives: "not", "and", "or", and "implies"

respectively. To make this more concrete, let pbe a propositional variable representing the proposition "It rains" and q be a propositional variable representing the proposition "I go outside". The formula p → ¬ q then

(14)

precisely when the following sentence is true: "It is not the case that it rains and I go outside". In propositional syntax that is: ¬(p∧q)_.

Definition 2.2.7 (Semantic equivalence). Two formulae F and G are semantically equivalent ifI(F)=I(G)for all possible interpretationsI. This is usually writtenF ≡G.

Definition 2.2.8 (Semantic consequence). Formula F is a logical consequence of the set of formulae Γ, if whenever I(G) = true for allG ∈ _Γ simultaneously, then I(F)= true, for all interpretations I. This is usually writtenΓ ^F.

Definition 2.2.9(Tautology). Formula F is a tautology if it is true for every interpretation.

In propositional logic, the formulaeFandGare semantically equivalent if for all possible assignments of truth values to their propositional variables, ifFis true, thenGis true, and ifFis false, thenGis false. We can show this using what are known as truth tables (T = true and F = false):

p q p → ¬ q ¬( q ∧ p) ¬ p ¬ p →( p → ¬ q)

T T T F F T F T T T F T F T T T F F T

T F T T T F T F F T F T F T T T T T F

F T F T F T T T F F T F T F T F T F T

F F F T T F T F F F T F T F T F T T F

Table 2.2: Truth table

The leftmost columns of Table 2.2 indicate truth assignments to the propositional variables pandq. Two variables and two truth values (true, false) gives 2² = 4 possible interpretations, hence there are 4 rows of truth values in the table. Row 1: p= T andq= T, Row 2: p= T andq= F, etc. The remaining columns indicate truth values of the above substituent formulae as defined in Definition 2.2.5, with the columns marked in red indicating the truth values with respect to the outermost connectives and hence the entire formula. We can now formally say that ¬ (q∧ p)and p → ¬qare semantically equivalent, since their truth values are the same for all truth assignments to p and q. We also see that both ¬ (q∧ p) and p → ¬ q are, according to Definition 2.2.8, semantic consequences of ¬ p, since if we know that I(¬ p) =true, then both ¬(q∧ p) = true and p → ¬ q= true. Checking with our intuition, if we know that "It is not raining", then we know that "It is not the case that it rains and I go outside". We are not satisfying both sides of the "and" connective. Lastly we see that¬ p→(p

→ ¬q) is a tautology because it is true irrespective of interpretation.

Notice: To determine the truth value of a formula, it does not matter what the constituent propositional variables represent, only what their truth values are.

(15)

2.2.3 Deductive systems

We previously defined a semantic notion of what logical consequence is by semantic consequence. We now introduce a purely syntactic way to determine whether conclusions follow from premises without considering interpretations.

Definition 2.2.10 (Syntactic consequence). Given a formal system as defined in Definition 2.1.1, a formula Fis a syntactic consequence of a set of formulae Γ, written Γ ` F, if F can be inferred by Γ according to the inference rules of the formal system.

Definition 2.2.11(Formal Proof). Given a formal system, a formal proof is a sequence of formulas, each of which is an axiom of the formal system or is a syntactic consequence of the preceding formulae in the sequence. The last formula in a formal proof is called a theorem.

This notion of a formal proof will not mean much unless our logical system has two core properties:

Definition 2.2.12(Semantical soundness). A logical system is called sound, iff any theorem is a tautology of the system.

Definition 2.2.13 (Semantical completeness). A logical system is called complete, iff any tautology is a theorem of the system.

Hence we want to connect the semantic notion of truth with the syntactic notion of proof, by defining a sound and complete logical system.

There are many formal systems which when combined with the standard semantics for propositional logic create a sound and complete logical system. First we will introduce some standard syntax for all deductive systems (proof calculi) to make it easier to compare different proof calculi:

Definition 2.2.14 (General calculus and proof syntax). A proof calculus consists of:

- Axioms written: (Axiom name)

w

- Rules written: w₁ w₂ · · · w_n

(Rule name)

w

Where w₁, . . . , wnare premises and w is the conclusion. A derivation of w is then a tree growing upwards such that:

- Nodes are axioms or rules of the calculus - The conclusion in the root node is w

- The conclusion of each inner node is a premise of its parent node A proof of w is a derivation of w for which all leaves are axioms.

(16)

Table 2.3 shows an example of a proof caclulus for propositional logic:

(axiom 1)

F→(G→G)

(axiom 2)

((F →(G→H))→((F→G)→(F→H)))

(axiom 3)

((¬F→ ¬G)→(G→F))

F F→ G (Modus Ponens)

G

Table 2.3: Łukasiewicz’ calculus for propositional logic [3]

Notice that Łukasiewicz’ calculus only uses the¬and→connectives. For this calculus to be sound and complete we must first translate all formulae containing∧and∨as follows:

- Any formula of the formF∨Gbecomes¬F→ G - Any formula of the formF∧Gbecomes¬(F→ ¬G)

The proof is omitted here, but one can verify that Łukasiewicz’ calculus (Table 2.3) with standard propositional semantics is sound and complete.

2.3 First-order Logic

Propositional logic treats propositions as atomic entities. It turns out, however, that it is very limiting to only be able to reason about propositions.

In propositional logic, we are easily able to reason about singular things that may or may not be true like whether it is raining. However, to reason that when it rains, grass gets wet, we would need to have a proposition for every single strain of grass in the world saying that it is wet if it rains.

First-order logic is a collection of equivalently expressive formal systems which extends propositional logic by breaking up propositions to allow for quantification of things in the world.

2.3.1 Syntax

We build first-order formal languages as follows:

Definition 2.3.1 (First-order symbols). Any first-order formal language is built using the following disjoint sets of of symbols, the logical symbols:

- Logical connectives:¬_,∧_,∨_,→_.

- Quantifiers:∀(universal),∃(existential)

- Variables:v₁,v₂,v₃, .. (countably infinitely many)

(17)

and the non-logical symbols:

- Constants: c₁,c2,c3, ...

- Functions: f₁,f₂,f₃, ...

- Predicates: P₁,P2,P3, ...

Definition 2.3.2(First-order signature). The non-logical symbols of a first- order language make up what is called the signature of the language. A signature is defined as a tupleσ = (Scon,S_{f un},S_rel,ar)such that:

- S_conis the set of constant symbols - S_{f un}is the set of function symbols - S_preis the set of predicate symbols

- ar : S_{f un}∪Spre → _N,arassigns a natural number called the arity to every function and predicate symbol

Definition 2.3.3(First-order terms). We define the set of terms inductively as the smallest set such that:

- Every variable and constant is a term denoted byt₁,t₂,t₃, ...

- If f is a function of arityar(f) = nandt₁, ...,tnare terms, then so is f(t₁, ...,t_n)

Definition 2.3.4(Atomic first-order formula). We define the set of atomic first-order formulae inductively as the smallest set such that for each predicate symbolP:

- IfPis a predicate symbol of arity ar(P) = n= 0, thenPis an atomic formula

- IfPis a predicate symbol of arityar(P) =n>0 andt₁, ...,tnare terms thenP(t₁, ...,t_n)is an atomic formula

Definition 2.3.5 (First-order formulae). We define the set of formulae inductively as the smallest set such that:

- All atomic formulae are formulae

- IfFandGare formulae, then¬F,(F∧G),(F∨G), and(F →G)are formulae

- IfFis a formula andvis a variable, then∀vFand∃vFare formulae All occurences of the variablevin a formula F, are said to be "bounded" in the formulae∀vF and∃vF and "inside the scope" of the respective quantifier in each formula.

As with propositional logic, there is a standard precedence for first-order formulae defined in order to reduce parentheses. The precedence is: ∀

∃ ¬ ∧ ∨ →. Where "" means same precedence, and "" is as before.

(18)

2.3.2 Semantics

Again, as with propositional logic, we want to make sense of the formulae we just defined. We are no longer dealing with propositions as atomic entities, instead we will be interested in the truth values of the first-order analogs of propositions, predicates:

Definition 2.3.6(First-order predicates). A predicate is an atomic formula containing one or more placeholders and has a truth value of true or false depending on the value of the placeholders.

Definition 2.3.7(First-order interpretation). A first-order interpretation of a first-order language with signatureσis a tupleI = (D,ν)such that:

- Dis a set of elements, called the domain of discourse - νis an interpretation function, such that:

- For every constant symbolc∈σ,ν(c)∈ D

- For every function symbol f ∈σof arityar(f) =n,ν(f): Dⁿ→ D

- For every relation symbolP∈σof arityar(P) =n,ν(P)⊆ Dⁿ Definition 2.3.8(First-order substitution and closed formulae). A substitu- tionτis a total mapping from the set of variables to terms. If F is a formula, we writeτ(F) to mean the result of substituting each free (not in scope of quantifier∀v) occurrence of all variablesvin the domain ofτwithτ(v)in F. We also write F[v/t] to mean the substitution of each free (not in scope of quantifier∀v) occurrence of the variablevwith the termtin F.

Definition 2.3.9 (Closed formula). A first-order formula F is said to be closed if it contains no free variables. We say F is closed under the substitutionτifτ(F)contains no free variables.

Definition 2.3.10 (Interpretation of first-order terms). Given a first-order interpretation,I = (D,ν), we interpret a termtrecursively as follows, if:

- tis a constant, thenv(t) =a, wherea∈ D

- t is a function, f, and ar(f) = m, then ν(f(t₁, ...,tm)) = ν(f)(ν(t₁), ...,ν(t_m))

Definition 2.3.11 (Interpretation of closed first-order formulae). Given a first-order interpretation,I, we can define what it means for a formula F to be true in the interpretation I = (D,ν), recursively as follows:

- A closed atomic formula F = P(t₁, ...,tn) is true in I if (ν(t₁), ...,ν(t_n))∈ ν(P).

- Otherwise if:

- F=¬G, then F is true in IifGis not true in I - F=G∧His true inIif bothGandHare true inI

(19)

- F=G∨His true inIif not bothGandHare false inI

- F = G → His true in Iif its not the case that Gis true in Iand His false in I

- F=∀xGifG[x/a]_{is true in}Ifor alla∈ D

- F=∃xGifG[x/a]is true inIfor at least onea∈D

Definition 2.3.12 (First-order satisfiability and validity). If F is true in an interpretationI, we say thatIsatisfies F. If there is an interpretationIwhich satisfies F, then we say F is satisfiable, otherwise we say it is unsatisfiable.

If a formula is true in every interpretation, then we say it is valid, otherwise we say it is invalid.

2.3.3 Deductive system

The previously defined notions of semantic and syntactic consequence hold for first-order logic as well. Table 2.4 shows Hilbert’s proof calculus for first-order logic leading to a semantically sound and complete first-order logical system.

(axiom 1)

F→(G→ G)

(axiom 2)

((F→(G→ H))→((F→G)→(F →H)))

(axiom 3)

((¬F→ ¬G)→(G→F))

(axiom 4)

∀xF→ F[x/t]

(axiom 5)

∀x(F→G)→(∀xF→ ∀xG)

(axiom 6)

F→ ∀xF

F F→G (Modus Ponens)

G

Table 2.4: A simplified Hilbert calculus [3]

Notice that, just like Łukasiewicz’ calculus for propositional calculus, the simplified Hilbert calculus uses a limited number of connectives. For this calculus to be sound and complete we must first translate all formulae containing∧,∨, and∃as follows:

- Any formula of the formF∨Gbecomes¬F→G - Any formula of the formF∧Gbecomes¬(F→ ¬G)

(20)

Combining Hilbert’s deductive system with the first-order syntax and semantics we have defined so far, we have a semantically sound and complete logical system expanding the propositional system with quantification. First-order logic is a key logic for a multitude of reasons:

Theorem 2.3.1(Compactness). A set of formulae,Γ, has a satisfying interpretation, if and only if every finite subset,Γ⁰ ⊆Γ, has a satisfying interpretation.

Theorem 2.3.2((downward) Löwenheim–Skolem). If a set of formulas,Γ, has a satisfying interpretation, it is satisfiable in an interpretation with a countable domain.

Theorem 2.3.3 (Lindström’s theorem). First-order logic is the most expressive logic having both the compactness property and the (downward) Löwen- heim–Skolem property.

Definition 2.3.13(Syntactic completeness). A formal system is syntactically complete if for each formula,F, in the formula language, eitherFor¬Fis provable.

Theorem 2.3.4(Gödel’s second incompleteness theorem). First-order logic is not syntactically complete. That is, there is no algorithm that always terminates and decides whether an arbitrary first-order proposition is a theorem or not. We say that first order logic is "undecidable".

First-order logic can express a large part of mathematics and encode any computable problem. Therefore, Lindström’s theorem and the semantical soundness and completeness of first-order logic make it an interesting system for automation of proof search. Unfortunately, as Gödel’s theorem states, first-order logic is undecidable.

However, since first-order logic is semantically complete, there is an algorithm that given a valid first-order proposition terminates with a proof of this formula, but given an invalid formula might not terminate. In this case we say first-order logic validity is semi-decidable (or Turing recognizable).

2.4 First-Order ATP

The goal of automating theorem proving procedures was an early and major impetus for development of the field of computer science. The Turing machine was invented to prove that first-order logic is undecidable [4] and the formalization of the most important problem in computer science, the P/NP problem, is based on ATP in propositional logic [5]. We will now have a look at the state of the art algorithms which semi-decide validity of first-order formulae.

Automating first-order theorem proving boils down to automating a first- order deductive system.

(21)

2.4.1 Natural Deduction and Sequent Calculus

The calculi presented earlier in Tables 2.3 and 2.4 are examples of so called Hilbert-style systems. These came bundled with the early exploration of modern logic in the late-19th/early-20th century by Frege, Hilbert, and Russell and are quite compact and elegant; only using the modus ponens rule and a few axioms. However, these calculi are quite far from the way humans normally reason. Wishing to construct a formalism closer to human reasoning, Gerhard Gentzen proposed the natural deduction system in 1934 [6]. In this system Gentzen addresses the fact that humans do not normally start proofs from axioms, but instead make claims under some assumptions and then analyze the assumptions and claims separately and combine them later in the proof. Assumptions are combined into claims by what are known as introduction rules, while assumptions are split into its parts by elimination rules. However, this two-way search, makes natural deduction not well suited for automation.

Gentzen also later realized that he could convert this into one-way ordeal by making assumptions local and encoding the derivability relations in natural deduction by what he called sequents. A sequent, writtenΓ ` _∆, consists of a succedentΓand an antescedent∆, where ∆andΓare sets of formulae. Instead of formulae, the words of the calculus are sequents.

(axiom)

Γ,A` A,∆ Γ,A,B ` _∆

(∧-left)

Γ,A∧B ` _∆

Γ ` A,∆ Γ ` B,∆

(∧-right)

Γ ` A∧B,∆ Γ,A ` _∆ _Γ,B ` _∆

(∨-left)

Γ,A∨B ` _∆

Γ ` A,B,∆

(∧-left)

Γ ` A∨B,∆ Γ ` A,∆ Γ,B ` _∆

(→-left)

Γ,A→B ` _∆

Γ,A ` B,∆ ₍_→_-right) Γ ` A→B,∆

Γ ` A,∆

(¬-left)

Γ,¬A ` _∆

Γ,A ` _∆

(¬-right)

Γ ` ¬A,∆ Γ,A[x\t_i]_,∀xA ` _∆

(∀-left)

Γ,∀xA ` _∆

Γ ` A[x\t_j]_,_∆

(∀-right)

Γ ` ∀xA,∆ Γ,A[x\t_j] ` _∆

(∃-left)

Γ,∃xA ` _∆

Γ ` ∃xA,A[x\t_i],∆

(∃-right)

Γ ` ∃xA,∆

(For∀-right and∃-right,tjmust not occur in the conclusion)

Table 2.5: Sequent Calculus LK

(22)

Theorem 2.4.1(Correctness and Completeness of LK). A formula F is valid iff there is a proof for "` F".

When applying the ∀ and∃ rules in the Table 2.5 LK calculus above, we are currently choosing arbitrary terms t_i and t_j for x. We would like to do something more intelligent than random guessing of the terms we are substituting in so that we are more likely to end up with two syntactically equivalent formulae in the antecedent and succedent. That is, more likely to end up in an axiom.

Definition 2.4.1 (Unification and most general unifier). We say that two terms s and t are unifiable if there exists a substitution such that σ(s) _is syntactically equivalent to σ(t), written σ(s) = σ(t). e.g. s = f(x) and t= f(a)are unifiable using the substitutionσ={x\a}. We callσa "unifier"

forsandt. A unifierσ₁is a "most general unifier" (mgu) forsandtif:

- σ₁is a unifier forsandt

- for every unifierσ₂ ofsandt, there exists a substitutionτ, such that σ2=τ(σ₁)

Keeping track of an mgu σ and using so called "free variables", allows us to delay making substitution decisions, instantiating terms, until they are absolutely necessary and thus reduces the search space. The following calculi we will look at, all use unification and are more machine oriented than the previous calculi. Most real world ATP systems for first-order logic use unification.

2.4.2 Method of Analytic Tableaux

The semantics of sequents are such that a sequent Γ ` _∆ is true in every interpretation, if all interpretations that satisfy all formulae in Γsatisfy at least one formula in∆. In classical logics such as first-order logic, one way to prove that a formula F is true in every interpretation is to show that

¬F is false in every interpretation. This is a consequence of "the law of the excluded middle"; either a statement is true or it’s negation is true.

Expressed in terms of sequents, F is true in every interpretation if the sequent{¬F} ` {}is true in every interpretation. A proof of this is called a "Proof by contradiction" or "refutation".

By negating every formula in the succedent of a sequent and moving them to the anticedent we can create a sequent calculus where all rules have empty succedents in both the premises and conclusions. These are called

"one-sided sequent calculi" or "block tableaux". We will see an example of this shortly, but first we will have a look at some further useful notions.

Definition 2.4.2 (Negation Normal Form). Any first-order formula F can be translated into a semantically equivalent formulaF⁰in negation normal form (NNF). Formulae in NNF do not contain →, and ¬ only occurs (directly) in front of atomic formulae. Formulae are translated into NNF as follows:

(23)

- Any formula of the formA→Bbecomes¬A∨B - Any formula of the form¬(A∧B)becomes(¬A∨ ¬B) - Any formula of the form¬(A∨B)becomes(¬A∧ ¬B) - Any formula of the form¬∀xAbecomes∃x¬A

- Any formula of the form¬∃xAbecomes∀x¬A - Any formula of the form¬¬AbecomesA

Definition 2.4.3 (Skolemization). Through Skolemization, any first-order formula F⁰ can be translated into an equisatisfiable formula F⁰⁰ which does not contain ∃. Equisatisfiable means that F is satisfiable if and only if F⁰ is satisfiable, however they may be satisfied by different variable instantiations so they are not necessarily semantically equivalent. A formulaF⁰in NNF is Skolemized as follows:

- Any formula of the form∀y₁, . . . ,∀y_n∃xAbecomes

∀y1, . . . ,∀ynA[_x\f(_y₁_{, . . . ,}_y_n)]_{, where} _f is a new function symbol.

Combining the notions of one-sided sequents, unification, and Skolemized negation normal form we can build the fairly compact sound and complete block tableau calculus shown in Table 2.6 with the`sign omitted.

(axiom)

L₁,¬L₂,∆

(Such thatσis a m.g.u ofL1andL2)

F,G,∆

(α-rule)

F∧G,∆

F,∆ G,∆ _(β-rule) F∨G,∆

F[x\x^∗],∀xF,∆

(γ-rule)

∀xF,∆

(x^∗is a new variable)

Table 2.6: Block Tableau Calculus

Theorem 2.4.2 (Correctness and Completeness of the Block Tableau Calculus). A formula F is valid iff there is a Block Tableau proof for the Skolemized negation normal form of¬F for a single substitutionσ.

This calculus has "tableau" in the name because it has a quite nice (unambiguous) graphical representation due to only having one branching rule and two non-branching rules. For example a proof tree for refuting the propositional set of formulae: {¬p∨q,(p∨r)∨q}can be drawn as shown in Figure 2.1.

(24)

¬p∧q,(p∨r)∨ ¬q

¬p∧q,p∨r,¬q

¬p,p∨r,¬q

¬p,p,r,¬q

q,p∨r,¬q

¬p∧q,(p∨r)∨ ¬q p∨r

¬q

¬p r p

q

Figure 2.1: Proof tableau, Left: Explicit, Right: Implicit [7]

In the left tree in Fig. 2.1 we write ∆ explicitly in each node. We have reached an axiom when a node contains both Aand¬A. In the right tree, we only write the new formulae generated by the rule application in each node. An axiom is then reached when both A and ¬A are on the same branch.

2.4.3 Connection Calculus

Building on the formula transformations defined earlier we arrive at the normal form which is used for most state-of-the-art theorem provers.

Definition 2.4.4 (Clausal Form). Any formulaF⁰ in Skolemized NNF can be translated into an equisatisfiable set of atomic formulae F⁰⁰ in clausal form. The translation is as follows:

1. Any subformulae of the formA∨(B∧C)_becomes(A∨B)∧(A∨C) 2. For any subformula of the form∀xAinF⁰,AbecomesA[x\x^∗]andF⁰

becomes∀x^∗F⁰, wherex∗is a new unique variable.

3. The resulting formula should now be of the form ∀₁x₁, . . . ,∀_nxnM, where M (called the matrix) is a conjunction of disjunctions. For every conjunction in M create a set of its constituent atomic formulae. F⁰⁰ is then the set of all these sets.

Using the clausal form of a formula we can define another proof calculus reminiscent of the block tableau calculus called the connection calculus.

(25)

(axiom)

{},M,Path C₂,M,Path

(start)

e,M,e C,M,Path∪ {L2}

(reduction)

C∪ {L₁},M,Path∪ {L₂} C₂\ {L₂}_,M,Path∪ {L₁} C,M,Path

(extention)

C∪ {L₁},M,Path

In all casesC2is a copy ofC1∈M,L2∈C2, andσ(L1) =σ(L2)

Table 2.7: Connection Calculus [8]

The complementLof a literalLisPifLis of the form¬P, and¬Lotherwise.

In Table 2.7, the words of the calculus are tuples “C,M,Path”, where M is the matrix of the given formula in clausal form, while C and Path are sets whose elements are removed and added according to the rules of the calculus.

Theorem 2.4.3(Correctness Completeness of the Connection Calculus). A formula F in clausal form is valid iff there is a Connection proof for "e,M,e", where M is the matrix of¬F, for a single substitutionσ.

Further description of the intuition behind these sets and proofs of soundness and completeness can be found in Bibel’s book [8]. This calculus is the basis of the leanCoP [9, 10] prover which we will have a closer look at later in Section 4.

2.4.4 Resolution Calculus

A common heuristic for writing efficient programs is to restrict branching computation as much as possible. The resolution calculus has been the standard basis for first-order proof search since it was introduced by Davis and Putnam in 1960 [11] and improved by the introduction of unification in 1965 by Robinson [12].

(26)

(axiom)

C₁, . . . ,{}, . . . ,C_n C₁, . . . ,C_i∪ {L₁}, . . . ,C_j∪ {L₂}, . . . ,C_n,σ(C_i∪C_j)

(resolution)

C₁, . . . ,C_i∪ {L₁}, . . . ,C_j∪ {L₂}, . . . ,C_n

(Whereσis a m.g.u ofL1andL2)

C₁, . . . ,C_i∪ {L₁, . . . ,Lm}, . . . ,Cn,σ(C_i∪ {L₁})

(factorization)

C₁, . . . ,C_i∪ {L₁, . . . ,L_m}, . . . ,C_n

(Whereσis a mgu ofL1, . . . ,Lm)

Table 2.8: Resolution Calculus

Theorem 2.4.4(Correctness and Completeness of the Resoultion Calculus).

A formula F in clausal form is valid iff there is a Resolution proof for "¬F".

Saturation style calculi based on resolution are at the core of all the currently most efficient first-order ATP systems such as Vampire [13] and E [14] (measured on a subset of the TPTP theorem database during the CADE ATP System Competitions [15]). These are both based on the superposition calculus which is an extension of resolution.

(27)

Chapter 3

Machine Learning (ML)

3.1 Formalizing Inductive Reasoning

Machine Learning (ML) is the subfield of Artificial Intelligence (AI) concerned with algorithms which improve automatically through experience.

More concretely, these algorithms are capable of generalization. That is, they provide sensible outputs for inputs not encountered previously by extracting relevant information from previous inputs and supervision signals, and applying it to analyze new inputs. The algorithms build mathematical models based on sample data, known as "training data", in order to make predictions or decisions they haven’t been explicitly programmed to make. This section is loosely inspired by, and contains paraphrasings from, the following textbooks: [16–19].

3.2 ML Paradigms

Machine learning approaches have traditionally been divided into three categories, depending on the nature of the training data available and the use case of the algorithm.

Supervised Unsupervised Reinforcement

Critic Critic

Error

Reward

Input Input

Input

Output

Output Output

Figure 3.1: Comparison of ML paradigms [20]

(28)

Supervised learning

While all machine learning algorithms learn input/output behaviour by optimizing a function, called the objective function or loss function, and are in this sense "supervised" by that goal, "supervised" in the case of supervised learning means that the training data consists of both example inputs and their desired outputs. The goal is to learn a function that best generalizes this input→ output behaviour. Most of the examples we will see in the next subsection are in this category.

Definition 3.2.1 (Regression). Supervised learning task where the output space is continuous.

Definition 3.2.2(Classification). Supervised learning task where the output space is discrete.

Unsupervised learning

In unsupervised learning, while we are still "supervised" by our goal of optimizing with respect to an objective function, our training data only consists of example inputs no example outputs. The goal of unsupervised algorithms is to discover hidden patterns in inputs and produce output summarizing found patterns.

Definition 3.2.3 (Dimensionality reduction). Unsupervised learning task where the output space is continuous.

Definition 3.2.4(Clustering). Unsupervised learning task where the output space is discrete.

Reinforcement learning

Reinforcement learning is perhaps the most general of the three paradigms, and is the category concerned with agents which interact with environ- ments. In reinforcement learning, training data consists of example inputs, outputs, and rewards for giving the respective outputs for the respective inputs. The goal is to learn a function from input to output which maxim- izes reward.

Definition 3.2.5(Optimal control/decision-making). Reinforcement learning task where the output space can be both discrete and continuous (actions).

3.3 ML Algorithms and Models

Machine learning algorithms build models of input → output behaviour.

These models describe which output to give for any given input. ML algorithms use training data to build models during what’s called "training".

Statistically any machine learning problem can be expressed as follows:

(29)

Y_i =g(X_i) +e_i

Y_i is the observed output, gis the underlying function describing the real relationship between input and output, X_i = (x₁, . . . ,x_n)is the input, and e_irepresents un-modeled determinants ofY_ior random statistical noise.

Our job is to find a model, f, which resemblesgas closely as possible. Dif- ferent algorithms build different types of models, the choice of algorithm will depend on the requirements of the model. The models differentiate themselves in terms of expressive power, efficiency, and interpretability.

3.3.1 Non-parametric Models

Non parametric models do not make any assumptions about the structure of the function approximator or underlying data distributions, and can fit a large variety of functions. They are therefore good when there is little prior knowledge about the data.

K-nearest neighbors (kNN)

This model has no training phase. Instead, to find f(X_i), it measures the distance between X_i and every point X_j in its training data according to some metric (e.g. L2-distance) and considers the k closest points.

Classification is done by a plurality vote of the label of the k nearest neighbors, while regression is done by averaging the label of the k nearest neighbors.

Decision Trees (DT)

Decision trees are more efficient at prediction time than kNN in exchange for having a training phase. In the decision tree model, during training, we split the feature space into regions which best splits the training data according to some metric (usually based on information gain/entropy).

To find f(X_i), during prediction, instead of looking at the entire training data set, we only have to determine which regionX_i belongs to. Figure 3.2 shows a decision tree boundary for a decision tree classifier and the data points used to train it. A decision boundary is the threshold where points are classified as class 0 on one side and class 1 on the other side.

(30)

Figure 3.2: Decision tree training data and decision boundary We can visualize the splits of the feature space shown in Figure 3.2 as a tree as shown in Figure 3.3, hence the name "decision tree". f(X_i)is the leaf found by following the correct branch corresponding to the values of the features ofX_i.

Yes

No

No Class 0

Class 1

Class 0

Figure 3.3: Decision tree prediction

In general, finding an optimal decision tree for a task is infeasible.

Therefore it is quite common to use ensemble methods which construct more than one tree (e.g. boosting and random forests).

At the time of writing, most data science contests, such as the ones on Kaggle.com, are won either by Neural Network models (which will be

(31)

discussed in detail in the next subsection) or Gradient boosting models (usually ensemble of decision trees) [21]. These models are heavily applied in real-world applications and research since they tend to show good results on general data.

3.3.2 Parametric Models

Unlike non-parametric models, parametric models have a predetermined finite number of parameters, w. In particular, w ∈ _{Ω, where} _Ω is a finite-dimensional parameter space. In these models, the same number of parameters will be fit to approximate the input/output function, f, regardless of the nature and size of the training data. In all the models we will look at, wis commonly referred to as the "weights" of the model.

The intuition is that the weights represent the discriminatory information contained in the training data.

Fitting the function f (training) is usually done by either maximizing a likelihood function, so that under the assumed conditions the observed data is most probable, or by more general Bayesian methods.

Generalized Linear Models (GLM)

In these models, the predicted output Y_i is assumed to have an error distribution belonging to the exponential family. The exponential family is a large class of statistical distributions including, among others, the normal, binomial, and Poisson distributions. We let the range of f be the expected value ofY_i given the input:

f(X_i;w) =E(Y_i|X_i)

Furthermore, f assumes the output is given by a linear combination of the input features, X_i, determined by the parameters w. This linear combination is then fed to a mean function h, giving the mean µ of the assumed error distribution of the output. In general:

f(X_i;w) =E(Y_i|X_i) =µ= h(X_i·w)

If the predicted Y_i is assumed to have a normally distributed error distribution,his the identity function:

h(z) =z Our function approximator, f, then becomes:

f(X;w) =h(X ·w) =X ·w

(32)

The retrieved model is what is known as "Linear regression".

Figure 3.4: Linear regression with one feature The function found by linear regression in Figure 3.4 is:

f(X_i) =1.8x₁+0.97

That is, every data point is described as a vector X_i = (x0,x₁) = (1,x₁), wherex₁is the feature of our data we use to predict, andx₀is a bias term to ensure the found function is affine. In terms ofw, we sayfis parameterized by:

w= (w₀,w₁) = (0.97, 1.8)

Now if the predicted Y_i is assumed to have a Bernoulli distributed error distribution,his the logistic function:

h(z) =σ(z) = ¹ 1+e⁻^z Our function approximator, f, then becomes:

f(X_i;) =h(X_i·w) =σ(X_i·w)

The retrieved model is what is known as "Logistic regression". This model is commonly used for binary classification by introducing a threshold function,g, to the above f, say:

g(f(X_i)) =

(1 if f(X_i)>0.5 0 if f(X_i)≤0.5

(33)

Figure 3.5: Logistic regression with two features The function found by logistic regression in Figure 3.5 is:

f(X_i) =σ(1.67x₂+1.30x₁−3.97)

We now have two features of the data we use to predict, so every point is described as a vectorX_i = (x₀,x₁,x₂) = (1,x₁,x₂), and f is parameterized by:

w= (w₀,w₁,w₂) = (−3.97, 1.30, 1.67)

The blue line in Figure 3.5 shows where f(z) = 0.5. That is, the decision boundary. The decision boundary generalizes to a hyperplane in higher dimensions if we predict using more features.

Perceptron and Neural Networks (NN):

We will now look at some function approximators which take on the same form as the GLM’s, that is:

f(X_i;w) =h(X_i·w)

However, they do not restrict the function, h, to depend on the error distribution of the output belonging to the exponential family. Instead they draw inspiration from the brain. Here synapses carry signals between neurons and neurons fire signals as a function of its received signals. In this scenario, we typically refer tohas the "activation function", and its output is called an "activation", denoted bya. All of this is illustrated in Figure 3.6.

(34)

Figure 3.6: Neuron model

The simplest model based on this concept is called the "Perceptron". This model uses the step function as, h:

h(z) =

(1 ifz >0 0 ifz ≤0

Hence if the sum of incoming signals is greater than 0, the neuron

"fires", otherwise it doesn’t. This yields a simpler linear model for binary classification than the logistic regression model, but would have similar weights and plot if trained on the same data as the classifier in Figure 3.5.

The idea of neural networks is to compose these neurons to create a non- linear model. Neural networks compose neurons in two ways, in width and depth. We compose in width by making w a matrix (denoted W) instead of a vector, where each column of W corresponds to the weights wof one neuron, andha vector valued function. Letting X_iW = z, f now becomes the vector valued function:

f(X_i;W) =h(X_iW) = (h0(z0), . . . ,h_l(z_l)) = (a0, . . . ,a_l)

This is shown graphically in Figure 3.7. Here is an explanation of the notation used in these figures 3.7 and 3.8:

- w_i,jcorresponds to the entry ofWat rowiand columnj.

- x_i means thei-th entry of vectorx.

- x⁽^d⁾means the version ofxat depthd.

- w⁽^d,q⁾is theq-th column ofW⁽^d⁾

(35)

Figure 3.7: Neurons composed in width

To compose the neurons depth-wise, we introduce a temporal element to the model, by composing functions. f is now defined as follows:

a⁽⁰⁾ =X_i

a⁽^j⁾ =h⁽^j⁾(a⁽^j⁻¹⁾W⁽^j⁾) f(X_i;W⁽^1,...,d⁾) =a⁽^d⁾

(36)

Composing neurons both in width and depth, we arrive at a family of models called "Artificial Neural Networks" (ANN). In practice most ANNs typically have a simpler structure than what is strictly possible. The sameh functions are typically used for every inner neuron; they are usually simple and provide a smooth (differentiable) transition as input values change. A commonly used activation function is the rectified linear unit:

h(z) =ReLU(z) =max(0,z)

The defining advantage of neural networks is that, by the universal approximation theorem, they can approximate effectively any function to any given precision just by increasing their width and depth enough. That includes non-linear functions (due to the non-linear activation functions)!

Graph Embeddings, Convolutions and Attention Networks

Deep learning is usually said to mean neural networks with many layers (large d in the previous equations). In addition to heavily increased computing power and available data, the recent explosion in interest for machine learning can be largely attributed to developments in the design of ANNs; exploiting the general a priori structure of input data. Introducing convolutions for spatially structured data such as images has shown great results for tasks such as image classification. Using the output of a previous input as supplementary input for temporally structured data such as sentences or audio has shown great results for tasks such as translation and speech recognition. These spatial and temporal data structures are very simple data structures, which can be viewed as special types of graphs. Graphs in themselves are very simple, but also very powerful data structures generalizing the notion of relational data.

Definition 3.3.1(Graph). A graph is a 2-tupleG= (V,E)where:

- Vis a set of vertices.

- E⊆ {(x,y)|(x,y)∈ V²}, is a set of edges.

In figure 3.9, vertices are values at certain time steps and neighbouring time steps are connected by edges. In Figure 3.10, vertices are coordinates/pixels and neighbouring coordinates are connected by edges. In figure 3.11 vertices and edges represents an arbitrary relational structure.

In order to process these types of data with any of our previously introduced models, we need to somehow represent the data as constant- shape tensorsX.

For temporal data such as sentences or audio, a common solution is to represent the nodes with some constant-dimensional node feature vector X_t and have the model process neighbours sequentially using previous output as input, i.e. h_t = f(X_t,h_t −1) where f is a neural network.

(37)

Figure 3.9: Temporal data

Figure 3.10: Spatial (2D) data

Figure 3.11: Arbitrary relational data

These architectures, recursing in a linear fashion, are called recurrent neural networks.

For spatial data such as images, a common solution is to again have a constant-dimensional feature vector for each node (pixel), i.e. RGB values, luminocity, etc, but consider neighbours simultaneously by convolving a constant sized set of weights over the picture representation.

The goal in both of these cases is to produce what’s called an embedding of the graph. These are vector representations of the graph, such that similar graphs are close in the vector space. The embeddings can then be used to discriminate/generate in any machine learning task.

Graph neural networks (GNN) are an attempt at generalizing these concepts to arbitrary graph inputs G = (V,E), such that our model is invariant to node order and number of nodes and edges. The goal is to learn an embedding h_v ∈ _R^s for every node v ∈ V which contains information about the neighborhood ofv. In general,h^t_vis given by:

h⁰_v= Xv

h^t_v= q^t



h^t_v⁻¹, ^[

u∈N(v)

f^t

h^t_v⁻¹,h^t_u⁻¹,e_u→v





This is shown graphically for one node in the Figure 3.11 graph in Figure 3.12.

(38)

A B

C D

E F

INPUT

Target node

Embedding of node at

Figure 3.12: Graph neural network node embedding

Where f^t is a differentiable message function, extracting the most important information from each neighbour of the current node. ^S is a permutation-invariant function to aggregate the messages (like the sum or the average), N(_v)are the neighbors of v, and eu→v is the edge attribute of edge u→ v. q^t is a differentiable update function, taking the previous value ofh_vand the aggregation of the attended neighbours to produce the embedding ofvat the next time step.

3.4 Reinforcement Learning (RL)

RL is used in scenarios where we have no chance of knowing the correct input/output pairs a priori, but we can reward some good input/output pairs when they pay of, such as with complex decision-making problems where we don’t know long term consequences right away. We therefore typically view the reinforcement learning problem as an intelligent agent, the decision maker, trying to maximize long term reward by interacting with a reward-giving environment.

(39)

3.4.1 Markov Decision Processes (MDP)

The environment of an RL problem is usually formalized as a Markov Decision Process. A Markov decision process is a discrete-time stochastic control process, which means it provides a mathematical framework for modeling decision making where outcomes are partly random and partly controlled by a decision maker. A Markov decision process is a 4-tuple (S,A,Pa,Ra), where:

- Sis a set of states called the "state space", - Ais a set of actions called the "action space",

- P_a(s,s⁰)is the probability that taking actiona when in states at time t will lead to state s⁰ at time t +1. This is called the "transition probability",

- R_a(s,s⁰)is the reward received for transitioning from state sto state s⁰by actiona.

Agent MDP (Environment)

Action State

Reward

Figure 3.13: RL as interplay between an intelligent agent and an MDP The goal of the agent in Figure 3.13 is to build a model, π : S → A, a policy for which actions to take from any given state. In general, we want to maximize discounted expected reward over a potentially infinite time frame:

E

"

∑

∞ t=0

γ^tR_π₍_s_t₎(st,s_t+1)

#

Where:

- the expectation is taken overs_t+1∼ P_π₍_s_t₎(s_t,s_t+1),

- πis the optimal policy andπ(st)is the optimal action taken at timet, - 0≤γ≤1 is a discount factor motivating taking actions early.

The field of reinforcement learning studies algorithms for finding optimal in different scenarios, depending on the nature of the 4 components of

(40)

3.4.2 Model-free vs. Model-based RL Algorithms

The term "Model" in "model-free vs. model-based RL", refers to the MDP of the RL problem. In model-free RL we do not model the MDP. That is, we do not try to find explicit expressions definingPandR, we rely on sampling and simulation to estimate the optimal policy. We do not need to know the inner workings of the MDP. Intuitively, model-based RL is the opposite. In model-based RL we use the MDP directly to estimate the optimal policyπ.

Model-based methods are thought to be more sample efficient than model- free methods since they are able to explicitly plan ahead and therefore don’t need to test as many trajectories as model-free methods to find the optimal policy.

Model-Free RL

RL Algorithms

Model-Based RL

Policy Optimization Q-Learning

TRPO

Learn the Model Given the Model

I2A

World Models AlphaZero

MBMF C51

QR-DQN DQN

HER PPO

A2C / A3C Policy Gradient

SAC TD3 DDPG

MBVE

Figure 3.14: A non-exhaustive, but useful taxonomy of algorithms [22–33]

in modern RL from [34]

The algorithms in Figure 3.14 differ by what they learn and how they use what they learn to determine a policy. They can learn:

- policies (which action to take from a state)

- action-value functions (quantifies how good it is to take an action from a state)

- value functions (quantifies how good a state is) - environment models (how the world works) 3.4.3 Exploration vs. Exploitation

A key concept in all of AI, and especially in RL, is the trade-off between how much to explore new solutions vs. how much to exploit known solutions. In the case of RL, this refers to how the trajectories the agent explores are generated. If the agent is making actions in the environment completely randomly, it has high exploration and low exploitation, if the

Learning to Reason

Learning to Reason

Fredrik Rømming

Thesis submitted for the degree of

Master in Informatics: Programming and System Architecture

60 credits

Department of Informatics

The Faculty of Mathematics and Natural Sciences

Learning to Reason

Fredrik Rømming

Abstract

Contents

Chapter 1

Introduction

Chapter 2

Automated Theorem Proving (ATP)

2.1 Formalizing Deductive Reasoning

2.2 Propositional Logic

2.3 First-order Logic

2.4 First-Order ATP

Chapter 3

Machine Learning (ML)

3.1 Formalizing Inductive Reasoning

3.2 ML Paradigms

3.3 ML Algorithms and Models

3.4 Reinforcement Learning (RL)

∑