\pdfoutput=1 \documentclass[11pt]{article} \usepackage{amsmath} \usepackage{EMNLP2023} \usepackage{times} \usepackage{latexsym} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{microtype} \usepackage{inconsolata} \title{Symbolic Logic is Also Needed} \author{Greg Coppola \\ {\em Apocalypse Like Right Now} \\ \texttt{greg@coppola.ai}} \begin{document} \maketitle \begin{abstract} ChatGPT has shown that, surprisingly, a GPT language model can learn from unsupervised text something approximating a logical model of accurate world knowledge. But, ChatGPT has the well-known ability to ``hallucinate'', and give facts or cite sources that do not exist (i.e., were not in the training data). Also, ChatGPT has the limitation that it is not able to reason in a principled way, empirically making reasoning mistakes, and also not being theoretically connected to a complete or consistent proof system. We propose to address both of these using a unified solution paradigm we call {\em SymbolicGPT}, characterized by: 1) the incorporation of {\em discrete logical forms} into a generative pre-trained transformer model, 2) the postulation as latent variables of syntactic parses and logical forms, and 3) a probabilistic model that can assign probabilities to discrete symbolic logical statements, based on a knowledge base of probabilistic discrete logical statements. While the model proposed is only theoretical, in that we do not have results from running the model, this paper does contribute several examples of empirical data of ChatGPT's performance in important relevant cases. \end{abstract} \section{Introduction} \citep{vaswani:attention-is-all-you-need:2017}, which introduced the popular {\em transformer} architecture, used the famously memified title {\em Attention Is All You Need}. The literal meaning of this intuitively humorous phrase is that, given attention, one can dispense with ``complex recurrent or convolutional neural networks.'' However, implicit in the idea that this title is a joke, is that, at some level of analysis, eventually attention will {\em not} be all that is needed. We analyze that the problems with ChatGPT as it currently stands are: 1) it hallucinates, and 2) it cannot maintain a logically coherent worldview. We propose that what is also needed at this time is a probabilistic knowledge base expressed in {\em symbolic logic}. \section{Analysis of ChatGPT} \subsection{The Ability to Build a World Model and Converse} The popularity of the ChatGPT application is due to the fact that ChatGPT allows better access to information than any other alternative, implying both a useful world model, and a useful interface. It has been suggested that ChatGPT is passing the ``Turing Test'' \citep{bengio:eye-on-ai:2023}. \subsection{The Problem of Hallucination} The problem of {\em hallucination} is a central one for ChatGPT. \subsubsection{An Experiment Showing Hallucination} Here is example from the preparation of this paper. Consider that in response to the question {\em what are some citations for a logical probabilistic database?}, ChatGPT responded with a citation that seems to not exist: \begin{quote} Kersting, K., \& De Raedt, L. (2007). Probabilistic logic programming. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI), 1623-1630. \end{quote} It seems according to {\em Google Search} and {\em Google Scholar} that this article does not exist, but there is a similar existing source: \begin{quote} {\em Probabilistic inductive logic programming}, Luc De Raedt, Kristian Kersting, 2008, Probabilistic inductive logic programming: theory and applications, 1-27, Springer Berlin Heidelberg. \end{quote} In other words, the title is {\em Probabilistic inductive logic programming}, not {\em Probabilistic logic programming}, and apparently the journal, publisher, pages and year are wrong. This is a typical example of hallucination, where ChatGPT returns a result similar to something that exists, but not exactly or actually something that exists. The fact that ChatGPT's results cannot be relied on is a major barrier to being used more widely. \subsubsection{Sutskever's Take on Hallucination} In {\em Forbes}, OpenAI Chief Scientist Ilya Sutskever recently gave his thoughts on the hallucination issue: \begin{quote} Now let's talk about the limitations. {\em It is indeed the case that these neural networks have a tendency to hallucinate}. That's because a language model is great for learning about the world, but it is a little bit less great for producing good outputs. And there are various technical reasons for that. There are technical reasons why a language model is much better at learning about the world, learning incredible representations of ideas, of concepts, of people, of processes that exist, but its outputs aren't quite as good as one would hope, or rather as good as they could be. \citep{sutskever:forbes:2023}. \end{quote} So, clearly OpenAI views hallucination as a primary limitation of ChatGPT. Sutskever theorizes that the ChatGPT model {\em is} good at {\em learning} from the world, but that the problem is with the {\em outputs}, and goes on to mention that {\em reinforcement learning} can be used to better control the output. \subsubsection{An Experiment on ChatGPT's Conception of Truth} We believe that the transformer model does not have a clear distinction between true and false. In support of this hypothesis, consider ChatGPT's response in the following: \begin{quote} {\bf paper author}: how does a gpt model represent a difference between "true" and "false"? {\bf ChatGPT}: {\em A GPT model does not inherently understand the concepts of "true" and "false" in the same way humans do}. Instead, it learns statistical patterns from large amounts of text data to generate outputs that are likely to be coherent and grammatically correct given the input prompt. \end{quote} \subsubsection{Aristotle's Law of the Excluded Middle} Aristotle's law of the excluded middle states that: \begin{itemize} \item Aristotle's Law of Excluded Middle: For any proposition P, either P or not-P is true. \item Probabilistic Version: For any proposition $P$, $p(P) + p(\neg P) = 1$. \end{itemize} There is no clear analog to this in the transformer, either in practice or in theory, and even ChatGPT does believe there is an analog. The generative GPT model has a fixed number of continuous parameters. This intuitively suggests something that needs fixing. \subsubsection{Discrete Line Between Known and Unknown} In 2002, Donald Rumsfeld famously popularized a distinction between {\em known unknowns}, and {\em unknown unknowns}. That is, there are three kinds of states for a database: \begin{itemize} \item {\em known knowns}: both the question and the answer are known \item {\em known unknowns}: the question is known but the answer is unknown \item {\em unknown unknowns}: a relevant factor exists, but neither was it known nor was even the question to ask known \end{itemize} We observe that ChatGPT does not make this distinction. It seems only to have parameters that help it predict the next character, without clearly demarcating these distinctions between known and unknown. \subsection{The Problem of a Logically Consistent Worldview} Evidently, a the key-value mechanism of attention is able to encode some kind of functional relationships, that, when expanded large enough, can learn a representation of the ``world" that is good enough to store information, and make it available again for consumers on a wide array of tasks. ChatGPT can answer logical questions, and shows in much behavior signs of logical reasoning. \subsubsection{Hinton Stresses the Problem of a Logically Consistent Worldview} Recently, in an appearance on {\em CBS Mornings}, Geoff Hinton discussed the problem of creating a {\em logically consistent worldview}: \begin{quote} We're at a transition point now where ChatGPT is this kind of idiot savant, and it also doesn't really understand about truth. It's being trained on lots of inconsistent data it's trying to predict what someone will say next on the web. And people have different opinions and it [ChatGPT] has to have a kind of blend of all these opinions so that it can model what anybody might say. It's very different from a person who tries to have a consistent world view. Particularly, if you want to act in the world, it's good to have a consistent world view. And, I think what's good one thing that's going to happen is we're going to move towards systems that can understand different world views and can understand that okay if you have this world view then this is the answer and if you have this other world view then that's the answer. \citep{hinton:cbs-mornings:2023} \end{quote} \subsubsection{Aspects of a Logically Consistent Worldview} The ability to hold a ``logically consistent'' worldview includes the following competencies: \begin{itemize} \item the ability to expand a set of premises $P$ to some $P' \supset P$ using a consistent set of {\em inference rules} \item the ability to detect a contradiction in a set $P$ of sentences \item the ability to partition a contradictory set of sentences $P$ into maximally consistent sets of sentences, $P_1, ..., P_N$ \item the ability to compare two theories $P$ and $Q$, e.g., and determine which is a better fit for the data \item the ability to explain a piece of data based on multiple consistent worldviews, i.e., to compute the probability of an observation under multiple different world views \end{itemize} \subsubsection{An Experiment in Reasoning with ChatGPT} There is no formal guarantee that the generative word model's explanatory parameters amount to logical inference. Indeed, consider the following interaction we recorded: \begin{quote} \item {\bf paper author}: if Alice is shorter than Bob and Bob is shorter than Charlie, then who is the tallest and why? \item {\bf answer}: Based on the information given, it is not possible to determine who is the tallest among Alice, Bob, and Charlie. All we know is that Alice is shorter than Bob, and Bob is shorter than Charlie. However, we do not have any information about the actual heights of any of the three individuals, so we cannot make any conclusions about who is the tallest. For example, it is possible that Alice is 5 feet tall, Bob is 5'5" tall, and Charlie is 6 feet tall. In this case, Charlie would be the tallest. Alternatively, it is possible that Alice is 6 feet tall, Bob is 5'10" tall, and Charlie is 5'5" tall. In this case, Alice would be the tallest. Therefore, without additional information about their heights, we cannot determine who is the tallest among Alice, Bob, and Charlie. \end{quote} To analyze this mistake, we can first of all say that the answer should have been that {\em Charlie} is the tallest. This true because the notion of being taller/shorter is transitive, so from the definitions of words and the two relations given, we can conclude that Charlie is the tallest. ChatGPT also suggests a possible problem scenario in which Alice is 6 feet tall, but Bob is 5'10", contradicting the assumption that {\em Alice is shorter than Bob}. Clearly, ChatGPT is not thinking in terms of rigorously consistent logical structures. We believe this is related to the fact that there is no clear distinction between true and false, in the generative pre-trained model. And, because there is no proof system inherent in the transformer that guarantees consistency. There is no mechanism for detecting a logical contradiction. \section{A Symbolic Probability Database} \subsection{The Concept of a Discrete Probabilistic Database} We can refer generally to a {\em discrete logical database}, as a set of pairs $\left\{(z, p) \in (\ell, [0,1]) \right\}$, where each $z$ is a statement in $\ell$ and each $p$ is its associated probability. One example of such a network would be a {\em Bayesian Network} \citep{koller:propabilistic-graphical-models:2009}. A Bayesian Network could satisfy all of the properties we need of evaluating and learning from sentences in $\ell$, but for the problem of {\em universal quantification}, i.e., sentences of the form $\forall x\ f(x) \rightarrow g(x)$. More generally we can look to various theories of {\em logic programming} \citep{deraedt:probabilistic-inductive-logic-programming:2008}, which do attempt to address the quantification issue. There is no clear sense of unity or standardization so far, it seems, in theories of logical inference like there is in neural network inference. \subsection{Motivation for a Discrete Probabilistic Database} We propose that by making the knowledge database discrete, we will avoid the problem of hallucinations, because we have two clearly stored dimensions: \begin{itemize} \item discrete database membership: for a given sentence $z$, either $(z, p) \in K$ for some $p$, or it is not. We can thus detect whether a fact has actually been seen, and incorporated, before. \item ability to evaluate likelihood: for a given sentence $z$ that is in $K$, we can evaluate whether its associated probability $p$ is high or low \end{itemize} Given this information correctly constructed, hallucination could be avoided, because unseen things are not in this database. \subsection{Requirements of a Logical Database} In order to incorporate a discrete probabilistic model into our generative story, we must assume a system that has the following properties: \begin{itemize} \item given a model $\Phi$, can efficiently assign a probability $P_\Phi(z)$ for any $z \in \ell$ \item given a database of logical facts $\left\{z_n\right\}_{n \in U}$ of all facts in the entire universe $U$, we can call $\Phi = \textsc {ReTrain}(\left\{z_n\right\}_{n \in W})$ \end{itemize} Presumably, we would want to be able to make ``on-line'' updates to the model, and it will not be practical to re-estimate all parameters all of the time. However, we want to leave $\Phi$ an abstract object, and so avoid considerations of how updates could be made online, to avoid restraining the model class prematurely, and leave optimizations for future work. \section{Syntax and Semantics} \subsection{Traditional Conception of Linguistics} The traditional conception of linguistics was that a layer called {\em syntax} served to intermediate between the two logical interfaces of {\em logical form}, in which reasoning is done, and {\em surface form}, corresponding to what is written or spoken \citep{sausure:course-in-general-linguistics:1916,chomsky:minimalist-program:2014,steedman:syntactic-process:2001}. Thus, the fact that ChatGPT does not rely on syntactic structures is perhaps the most surprising feature, and has on this grounds drawn criticism from Chomsky \citep{chomsky:debunking-the-great-ai-lie:2023}. We believe that the fact that linguistics has been seen as involving a layer of syntactic analysis is a major kind of evidence that the syntactic layer has relevance. It is an amazing achievement by ChatGPT that they can build a world knowledge model by avoiding this layer. However, the ever-present intuition that syntactic analysis is relevant suggests that in the future, syntax must make a comeback in natural language processing. \subsection{Mapping Logical Form to Surface Form} Let $z_n$ be a logical form, $x_n$ be a paragraph of surface tokens, and $y_n$ be a parse that mediates between them. For a given paragraph $x_n$, the set of candidate parses $C(x_n)$ will grow more than exponentially in the length of $x_n$. However, in practice this space can be pruned using, if necessary using a $k$-best pruning from an existing parsing model \citep{charniak:coarse-to-fine:2005,huang:forest-reranking:2008}. We will assume that a parse $y$ uniquely determines a logical form $z$, so that we can write that the set of candidate parses for a sentence $x$ is $C(x)$ a set of pairs $(y, z)$, where $y$ is a parse that yields $x$, and $z$ is the unique logical form determined by $y$. \section{Generating Text with Latent Logical Forms} \subsection{ChatGPT's Generative Story} Let ${\bf x} = \left[x_n\right]_{n=1}^N$ be a document of $N$ paragraphs, each of length $W$ tokens. A ChatGPT-style language model \cite{radford:improving-language-understanding-by-generative:2018} models the probability of a document as: \[ p({\bf x}) = \Pi_{i=1}^n p(x_n \mid x_{n-1}) \] That is, in the generative story, one paragraph is created based only on a previous one, with no hypothesized latent state of any kind. \subsection{High-Level Flow of a Latent Variable Generative Story} We want to create a generative story, which we call {\em SymbolicGPT}, in which, to create $x_n$ based on $x_{n-1}$: \begin{itemize} \item with probability $p(z_n \mid x_{n-1}; \theta, \Phi)$, choose a logical form $z_n$, based on (encoding vectors from) the previous paragraph $x_{n-1}$, the neural parameters $\theta$, and the prior likelihood of articulating the logical concept $z_n$ according to logical database $\Phi$ \item with probability $p(y_n \mid x_{n-1}, z_n; \theta)$, choose a syntactic analysis $y_n$, based on $z_n$, (encoding vectors from) the previous paragraph $x_{n-1}$, and the neural parameters $\theta$ \item the output text $x_n$ is determined uniquely by $y_n$ \end{itemize} \subsection{Factorization of Logical Forms} We suppose that the $x_n$ are {\em paragraphs}, which is to say sequences of sentences. We assume that we receive the document or corpus segmented into paragraphs $x_n$, but that each $x_n$ contains within it logically separable sentences $[s_1, ..., s_{S_{x_n}}]$, but that tokenization internal to $x_n$ is to be done internal to the model we are describing. The idea is that since each $x_n$ represents a sequence of logically separable sentences, we can use the distribution over sentence interpretations from one logical sentence within $x_n$, to mutually help inform the interpretation of another logically separable sentence within $x_n$. That is, in the generative story, we include a function $p(z_n \mid x_{n-1})$, rather than $p(z_n \mid z_{n-1}, x_{n-1})$. That is, $z_n$ does {\em not} depend on $z_{n-1}$, to avoid a blowing up in complexity, both conceptual and run-time, of the computational graph. \subsection{Generating the Logical Form} First, we choose a logical form $z_n$, based on $x_{n-1}$, $\theta$, and $\Phi$: \[ p(z_n \mid x_{n-1}, \Phi, \theta) \] One straightforward way to factor this is as a product of the syntactic likelihood of producing $z_n$ (as a syntactic object), in the given context: \[ p(z_n \mid x_{n-1}, \theta) \] And then the logical likelihood of saying $z_n$ at all, according to $\Phi$: \[ p(z_n \mid \Phi) \] \subsection{Generating the Parse} We next choose the parse $y_n$ in the normal transformer way: \[ p(y_n \mid z_n, x_{n-1}; \theta) \] There are a number of different parse formalisms (CFG parsing, unlabeled dependency parsing, labeled dependency parsing), and we are not currently proposing to choose between them. The only requirement of the parsing formalism is that it must map a surface form $x_n$ to some $z_n$ that we can reason with. One can use a discriminative parser to constrain the space, and this would create a parse in a bottom-up fashion. Once the space is pruned and structured, the generative story can be told in a top-down fashion, like in the original parsing models \citep{collins:a-new-statistical-parser:1996,charniak:statistical-parsing-with-a-cfg:1997}. \subsection{Generating the Text} The text $x_n$, we reiterate, is a part of the parse $y_n$. \section{Parameter Estimation} The problem of parameter estimation in the ChatGPT GPT model is the problem of estimating what we are calling $\theta$, the continuously valued parameters of the neural network part of the model. In our case, we must also now estimate $\Phi$, which is the discrete logical model. And, we must hypothesize latent syntactic/logical parses. \subsection{Expectation Maximization} The first problem to be solved is that the logical forms are {\em not} part of the given data, and so must be modeled as {\em latent variables}. The ChatGPT transformer does not use latent variables in the way that they were understood in the context of part-of-speech tagging or syntactic parsing, and this greatly simplifies the training. However, in order to realize the traditional vision of syntax as mapping logical forms to surface form \citep{montague:universal-grammar:1970,chomsky:minimalist-program:2014,steedman:syntactic-process:2000}, we will have to involve a logical representation, which is not present in the data, and therefore is latent. In order to train with latent variables, we can use {\em expectation maximization} \citep{dempster:maximum-likelihood:1977}. In the expectation step, we assign distributions to possible parses $y_n$, each of which implies an associated logical representations $z_n$. In the {\em maximization} step, we optimize $\theta$ and $\phi$. Boot-strapping this process will perhaps be difficult. One option is to start by boot-strapping from a high resource language, e.g., English. A syntactic parser can be discriminatively trained based on a GPT generative model, the way that few shot learning happens today. If logical world knowledge can be encoded based on English, then other languages can map onto this logical space. \subsection{Heterogenous Training} The neural network part of the model can be trained using back-propagation. It is not clear exactly how the logical model would be trained, or whether back-propagation would be appropriate. We have specified that we require the abstract interface $\Phi = \textsc {ReTrain}(\left\{z_n\right\}_{n \in W})$ to retrain $\Phi$, but how this would be done is difficult future work. However, in order to allow the most generality, we can train the neural network part in a heterogeneous way compared to the logical model. The predictions of the logical model can be treated as exogenous features, and back-propagation can work through the exclusively neural part. \subsection{The Logical Model} The fundamental difficulty in estimating a model using symbolic logic is the question of, what are the symbols? In other words, the problem for the Bayesian network formulation is that we must identify both a graphical structure $G$, not known a priori, and a probability model $\Phi$ over these. The question of how this would work must be left to future work. However, for a fixed directed Bayesian network, the problem of parameter estimation is trivial, and just involves counting. \section{Discussion} We are not able to evaluate this model at the present time. We only record the logic and the mathematics in order to receive feedback on the equations. However, we reiterate the benefits we expect to see from this model. Hallucinations can hopefully be removed from the use of a discrete membership database, where a fact is either in the database or not, and then if it does exist, it either has high or low probability. The discrete logical database will allow for the creation of logical reasoning, and an ability to detect whether things actually contradict. \section{Conclusion} We have introduced {\em SymbolicGPT}, a new algorithm that proposes to integrate a symbolic logic grammar into a ChatGPT-style generative pre-trained transformer, with the goal of eliminating hallucinations, and creating a logically consistent worldview. This new architecture requires the postulation of a latent syntactic form, which can be learned through expectation-maximization. Of course, this is only a theoretical proposal, and not an empirical result. The obvious future work is to instantiate the parameters of this theory to get it actually working. A narrower future work would be to simply refine the logical reasoning engine. \bibliography{anthology,custom} \bibliographystyle{acl_natbib} \end{document}