Keywords

1 Introduction

In recent years, large language models (LLMs) have made significant progress in AI and Natural Language Processing (NLP) research, and their reasoning abilities have been evaluated in various ways. In particular, LLMs such as BERT [7] and GPT [3] are often claimed to perform natural language inferences at a level comparable to human reasoning. However, it remains unclear whether LLMs can perform logical reasoning accurately. Previous studies [1, 6, 8, 19] have shown that LLMs often exhibit reasoning biases similar to those identified in humans by cognitive science research [10, 15, 24]. This suggests that there is room for improvement in the logical reasoning abilities of current LLMs.

In cognitive science, extensive research has been conducted to elucidate the cognitive role of diagrammatic representations, as opposed to sentential representations [2, 12, 23]. These studies explore the cognitive effects of diagrammatic representations and the reasons behind these effects, with a specific focus on the role of the Euler diagrams in syllogistic reasoning. Sato et al. [22] empirically tested the efficacy of Euler diagrams in human syllogistic reasoning. Their findings suggest that humans use diagrammatic representations as aids in logical reasoning, thereby mitigating the effects of reasoning biases that can impede accurate logical reasoning.

To date, the evaluation of the reasoning ability of LLMs has concentrated on linguistic reasoning, and sufficient research has not been done on reasoning with diagrams. Based on the findings in cognitive science research on diagrams, we may expect that LLMs perform logical reasoning tasks better when aided by diagrams, potentially avoiding reasoning biases just like humans. In this paper, therefore, we systematically examine how accurately LLMs can perform logical reasoning by using diagrams as auxiliary input, and whether they exhibit reasoning biases similar to those of humans. In particular, we focus on syllogistic reasoning, which is known to include various reasoning biases, and examine whether Euler diagrams are effective for error-prone syllogistic reasoning. In doing so, we construct a syllogism dataset where each premise sentence is paired with corresponding Euler diagrams, and each syllogism is annotated with labels indicating reasoning biases (Sect. 3). We will report experimental results and analyses from evaluating the current state-of-the-art LLMs using this dataset under various conditions (Sect. 4).

2 Background

In this section, we first introduce syllogisms and their representation in Euler diagrams (Sect. 2.1). Then, we explain the background of the cognitive science study of human reasoning using Euler diagrams (Sect. 2.2) and the recent research on LLMs’ reasoning ability with sentences and diagrams (Sect. 2.3).

Table 1. Four types of categorical sentences and Euler diagrams.

2.1 Syllogistic Reasoning and Euler Diagrams

A syllogism is an inference that consists of two premises and one conclusion, where the premises and the conclusion are composed of four basic types of quantified sentences as shown in Table 1.

Each syllogism is classified by the order of terms in the premises, traditionally called figure. Given three terms, S, P, and M, we assume that the terms in the conclusion always follow the order SP. Based on this, there are four possible figures, as shown in Table 2. For instance, in figure 1, the terms appear in the premises in the order MP and SM. This implies that each form of syllogism is identified by the figure and the types of its two premises. For example, (1) below is the syllogism where Premise 1 (P1) is of A-type, Premise 2 (P2) is of A-type, and the conclusion (C) is also of A-type, with its figure being 1 (the MP, SM order). This is abbreviated as AA1A. (2) is a syllogism where Premise 1 (P1) is of A-type, Premise 2 (P2) is of E-type, and the conclusion (C) is also of E-type, with its figure being 2 (the PM, SM order). This is abbreviated as AE2E. Both are valid syllogisms.

Table 2. Four figures of syllogisms.
figure a

Each sentence is assigned a set-theoretic meaning as shown in Table 1 and represented in its corresponding Euler diagram. In this paper, we adopt the representation system of Euler diagrams presented in Sato et al. [21, 22]. The sentence “All S are P” is represented by the inclusion relation between circles, while “No S are P” is represented by the circles that are disjoint. In addition, this system uses the symbol \(\times \) to represent the existence of elements in a region. Consequently, the sentences “Some S are P” and “Some S are not P” can be uniquely represented by the diagrams shown in Table 1.

2.2 Human Reasoning with Euler Diagrams

Syllogistic reasoning is often challenging for humans, and it is widely studied in cognitive science which types of syllogisms are more likely to cause human reasoning errors, i.e., which involve reasoning biases [10, 15, 24]. Some typical biases of syllogism will be introduced in Sect. 3.

Based on prior studies on the cognitive role of diagrams in human deductive reasoning [2, 12], Sato et al. [22] empirically investigated the cognitive differences between reasoning with Euler diagrams and reasoning with Venn diagrams, and analyzed the role that each diagram plays in the reasoning process. Sato et al. [22] conducted an experiment with human subjects to test whether the auxiliary use of diagrams suppresses reasoning biases. The results showed that among the two types of diagrams, Euler diagrams not only contribute to subjects’ correct interpretation of categorical sentences, but also play an important role in the reasoning process itself. In particular, the results suggested that Euler diagrams work effectively in problems that cause reasoning biases.

2.3 Logical and Diagrammatic Reasoning Abilities of LLMs

Advanced deep learning models including LLMs have made significant progress in recent AI research. Given their considerable natural language reasoning abilities, extensive research has been devoted to solving complex logical inferences including syllogisms using deep learning algorithms in the field of natural language processing [20, 29]. However, the extent to which these models can accurately perform logical reasoning remains uncertain.

Dasgupta et al. [6] showed that in reasoning tasks including syllogism and Wason’s selection task, LLMs reason more accurately about believable or realistic situations. According to their study, LLMs tend to judge inferences with believable content as valid and those with content inconsistent with our ordinary beliefs as invalid regardless of forms of inferences, thus failing to separate forms from contents (the content effects). Eisape et al. [8] demonstrated that LLMs exhibit error tendencies similar to those of humans, including the figural effect, in syllogistic reasoning. Ando et al. [1] and Ozeki et al. [19] worked on a syllogism dataset called NeuBAROCO, a bilingual parallel dataset of syllogisms in English and Japanese with annotations related to three types of reasoning biases. Evaluating LLMs on this dataset revealed that they exhibit reasoning biases similar to humans, indicating a need for improvement in their logical reasoning capabilities.

As one of the few pioneering studies in diagrammatic reasoning combined with deep learning research, Wang et al. [26] and Wang [25] introduced a method for utilizing deep learning models to learn continuous vector representations of diagrams, and presented Euler-Net, a visual reasoning system for solving syllogisms with Euler diagrams. Euler-Net takes diagrammatic premises as input and generates either categorical or diagrammatic conclusions with high accuracy (99.5%) without human intervention.

While Wang [25] focuses on purely visual (spatial) reasoning with Euler diagrams, we focus on hybrid reasoning combining sentences and diagrams, with focus on the effectiveness of diagrams as auxiliary means of linguistic inference. We also focus on the reasoning abilities of current state-of-the-art LLMs such as GPT-4 [17] and GPT-4-Vision [16] in the zero-shot learning setting with and without Chain-of-Thought prompting [27], which is currently being actively studied. We systematically examine whether Euler diagrams are effective in reasoning in which humans and LLMs are likely to make mistakes.

3 Dataset for Reasoning with Euler Diagrams

We present our constructed syllogism dataset, where the premises of each syllogism are associated with Euler diagrams.Footnote 1 We synthetically generated syllogism problems, as explained in Sect. 3.1, assigning correct answer labels (gold labels), inference types, and labels related to reasoning biases to each problem.

3.1 Two Types of Inference Problems

The dataset we constructed consists of two types of inference problems: Multiple Choice (MC) problems and Validity Checking (VC) problems. Figure 1 shows some examples. Multiple Choice problems are a format frequently used in psychological experiments on syllogistic reasoning (see [4, 11, 14] for an overview). This format is also employed in studies on diagrammatic reasoning, such as those conducted in [21, 22]. It involves selecting the correct answer from five possible conclusions (hypotheses), given two premises. The hypotheses are sentences that combine two terms in forms based on the four quantifiers, All (A), No (E), Some (I), Some ... not (O). Additionally, it is possible to answer with ‘None of them’, which means that there is no valid conclusion drawn from the two premises.

Fig. 1.
figure 1

Two types of syllogism problems: Multiple Choice and Validity Checking

Validity Checking problems are known as the Recognizing Textual Entailment (RTE) [5] or Natural Language Inference (NLI) [28] task in NLP research. In the case of syllogism, this format requires checking whether the set of two premises (P1, P2) entails the conclusion (C). There are three types of answers: entailment, contradiction, and neutral. If the premises entail the conclusion (i.e., if the premises are true, then the conclusion is also true), the inference is labeled entailment. If the premises contradict the conclusion (i.e., if the premises are true, then the conclusion is false), the inference is labeled contradiction. If the relationship between the premises and the conclusion is neither entailment nor contradiction, the inference is labeled neutral.

As shown in Table 2, with three terms S, M, and P forming a syllogism and the conclusion where the order of the terms is fixed to SP, there are a total of 64 inference patterns. Out of these, 15 patterns are considered valid according to modern predicate logic, while the rest are considered invalid.Footnote 2 For the MC problems, we adopt a total of 32 patterns to ensure a balance between valid and invalid patterns, following Sato et al. [22]. This includes the 15 valid patterns and the 17 invalid patterns that share the same figure as any of those valid patterns (See Table 11 for all 32 patterns). For the VC problems, the 15 valid inference patterns are labeled as entailment. From these patterns, those that are obtained by exchanging the forms of conclusions between All (A-type) and Some...not (O-type), or between Some (I-type) and No (E-type), result in contradictions and are labeled as contradiction. The rest of the patterns are labeled as neutral. We describe the detailed construction of the dataset in Sect. 3.3.

Table 3. Examples of syllogisms labeled as Symbolic, Congruent, and Incongruent. All examples are an instance of entailment.

3.2 Inference Types and Reasoning Biases

We distinguish three types of inferences, symbolic, congruent, and incongruent, in terms of what kind of content the sentences appearing in the inference have as follows. Table 3 shows some examples.

Symbolic. When all terms are abstract symbols such as A, B, and C, the inference is labeled as symbol.

Congruent. If there is no inconsistency with common-sense beliefs in all premises and conclusions, the inference is labeled Congruent.

Incongruent. If at least one of the premises or the conclusion does not align with common-sense beliefs, the inference is labeled Incongruent.

The class of Incongruent problems may cause belief-bias effects in human reasoning, one of the well-known biases in cognitive psychology [9, 10]. This is the tendency of humans to endorse inferences whose conclusions they believe and to reject inferences whose conclusions they do not believe, regardless of their logical validity. In addition to Incongruent problems, we consider two other types of reasoning biases in syllogistic reasoning.

Table 4. Examples of conversion errors: Both are labeled as neutral, but when P1 is interpreted as shown in parentheses, the label changes to entailment.

Conversion Errors. Conversion errors are the errors that occur by mistakenly reversing the order of the two terms appearing in a quantified sentence [10, 11]. For example, All A are B and Some A are not B are misinterpreted as All B are A and Some B are not A, respectively. Each of these two sentences may appear similar, but their logical meanings are different. We labeled as Conversion those neutral inferences where A-type (All) or O-type (Some...not) sentences appear in the premises and their correct answer changes from neutral to entailment if the order of the terms appearing in at least one premise is converted.Footnote 3 Typical examples are shown in Table 4.

In Euler diagrams, A-type and O-type sentences are represented as shown in Fig. 2. These diagrams directly illustrate that the relationships between the two circles are asymmetric in both the A-type and the O-type pairs, indicating that the two terms cannot be substituted interchangeably. Consistent with this observation, Sato et al. [22] reported the results of human experiments showing that providing Euler diagrams with syllogisms can mitigate conversion errors.

Fig. 2.
figure 2

Euler diagrams for A-type sentences and O-type sentences: (a) and (b) for A-type (All) and (c) and (d) for O-type (Some...not).

Table 5. Examples of figural effects.

Figural Effects. Human reasoning is sensitive to the order in which terms appear, and it has been observed that deriving the conclusion of a syllogism in the order of C–A is easier with figure 1 (the B–A; C–B order) than with figure 4 (the A–B; B–C order), an effect known as the figural effect [13]. Following [22], we focus on the difference between the EI1O-type and the EI4O-type. Examples of the two types of syllogisms are shown in Table 5. In EI1O, the order of terms when combining the two premises by identifying the middle term B matches the order of terms in the conclusion (both in C–A order). However, in EI4O, the order differs: the premises are arranged in A–C order when combining them by identifying the middle term B, while the conclusion is in C–A order.

As noted in [22], one reason for the difference for humans between EI1O and EI4O is the difficulty in understanding the logical equivalence between the E-type sentences No A are B and No B are A, as well as between the I-type sentences Some A are B and Some B are A. Sato et al. [22] reported that figural effects can be mitigated by using the Euler diagrams, as shown in Fig. 3. This is because it can be immediately seen that the two circles in the E-type pair and the O-type pair are symmetric, thereby conveying the logically equivalent information.

While the belief bias is a bias related to the content of syllogisms, conversion errors and figural effects are biases related to their form. We examine whether these biases can be mitigated in the context of reasoning by LLMs, as observed in human reasoning.

3.3 Overview of the Dataset

For MC problems, we first created a list of term triples, consisting of one triple of abstract symbols (A, B, C) and 21 common noun triples, such as (fruits, foods, stones). These triples were used to fill prepared sentence templates to formulate two premises. The templates are based on all 64 syllogistic patterns. For each problem, four hypotheses determined by the figure of the syllogism, plus an option for ‘none of them’, were numbered from 1 to 5. The number of the correct answer was also included as a gold label. The order of the options was randomized so that the gold labels were evenly distributed. In order to classify the non-symbolic problems into Congruent and Incongruent ones, we annotated the relations between the common nouns of every triple in the initial list into four categories: subset, superset, overlap, and disjoint. We sorted out whether the sentences of the types A, E, I, and O are consistent with common belief or not, as shown in Table 6. Finally, we adopted the 32 syllogistic patterns of problems discussed in Sect. 3.1 from the resulting problems. Given that Incongruent problems outnumbered Congruent problems in the dataset, we balanced it by randomly sampling from the former.

Fig. 3.
figure 3

Euler diagrams for E-type sentences and I-type sentences: (a) and (b) for E-type (No) and (c) and (d) for I-type (Some).

Table 6. The classification of sentences according to noun-pairs and sentence types in terms of belief congruence. The sentences in the gray cells are labeled Incongruent and the others are labeled Congruent.

For VC problems, we created four distinct problems from each MC problem described above. These problems were derived by combining the premises with a hypothesis from the MC problems. We then annotated the problems with gold labels (entailment, contradiction, and neutral) along with other labels. Moreover, within each class of an inference type, we ensured an even distribution of gold labels by randomly sampling the problems, with the constraint that all possible combinations of sentence and figure types were covered in the final dataset.

Our dataset consists of 194 MC and 285 VC problems. Table 7 shows the counts of the MC problems for each inference type, as well as the numbers of valid syllogisms and those with no valid conclusion in the options. Table 8 shows the numbers of the VC problems for each label. Each VC problem is annotated with (i) entailment labels (entailment, contradiction, or neutral) and (ii) inference type labels (Symbolic, Congruent, Incongruent).

Table 7. Number of MC problems by type and label. Valid means that the problem has a valid conclusion in the options. NoValid means that the problem has no valid conclusion in the options (i.e., the correct answer is ‘none of them’.)
Table 8. Number of VC problems by type and label.

4 Experiments

4.1 Evaluated Models and Experimental Settings

In our experiment, we used GPT-4-Vision (GPT-4V) [16] as the main subject of the evaluation. GPT-4V is a multimodal language model capable of accepting both text and images as input. For comparison, in experiments without diagrams, we also evaluated using GPT-3.5 [18] and GPT-4 [17], which only accept text as input. These models were accessed through OpenAI’s API.Footnote 4 Regarding the hyperparameters of the LLMs, to prevent excessively long responses, the maximum token length was set to 350. Default values were used for all other hyperparameters.

The experiments were conducted in settings with and without Euler diagrams. In the setting without diagrams, in addition to ch1GPTsps4V, GPT-3.5 and GPT-4 were also evaluated. In the setting with diagrams, experiments were conducted with three different types of prompts:

  1. 1.

    The basic prompt is the same as the one used in the setting without diagrams.

  2. 2.

    The Extended prompt is a prompt that adds a minimal description and instructions about the diagrams to the basic prompt. For both the basic and Extended prompts, instructions are given to output only the answer (Figs. 4 and 5 show examples).

  3. 3.

    The CoT prompt is based on the Chain-of-Thought prompting technique [27], instructing the language model to first explain the diagrams and then provide an answer (Fig. 6 shows an example output). The explanation of the diagrams is instructed to be 200 words or less.

To indicate the (non-basic) prompts used, we append subscripts to the names of the models: “Ext” for the Extended prompt and “CoT” for the CoT prompt.

In the setting without diagrams, a single input (prompt) includes instructions regarding the problem and answer, along with a single problem from the dataset. In the setting with diagrams, the diagrams corresponding to the problems from the dataset are further added as image inputs. The output of the language model is non-deterministic, so the results may vary with re-experimentation, but in our preliminary experiments, the models showed almost similar tendencies.

Fig. 4.
figure 4

Example Extended prompt for the MC task (correct answer: 3)

Fig. 5.
figure 5

Example Extended prompt for the VC task (correct answer: entailment)

4.2 Tasks

We conducted experiments on the following two tasks.

Multiple Choice (MC) Task. In this task, in addition to the instructions, two premises and four hypotheses that are candidates for the conclusion (each of which is one of the codes A, E, I, or O), plus ‘none of them’ are presented, numbered from 1 to 5, and the language models are asked to answer with one of the numbers. Figure 4 shows an example Extended prompt for this task.

Validity Checking (VC) Task. In this task, two premises and one hypothesis that is a candidate for a conclusion are presented along with instructions about the problem. The models are asked to answer either entailment, contradiction, or neither. Figure 5 shows an example Extended prompt for this task.

4.3 Results and Analysis

The evaluation results for the Multiple Choice and Validity Checking tasks are shown in Tables 9 and 10, respectively.

Table 9. Accuracy (%) on the MC task (\(n = 194\)).
Table 10. Accuracy (%) on the VC task (\(n = 285\)). E = entailment, C = contradiction, N = neutral, Incong = Incongruent, Cong = Congruent, Conv = Conversion.

Results for the MC Task. The overall accuracy is highest in GPT-4 without diagrams (\(66.49\%\)), in comparison to which GPT-4V without diagrams has a lower accuracy (\(53.09\%\)). Therefore, to compare the model’s performances with and without diagrams, we use GPT-4V without diagrams as the baseline.

While GPT-4V with diagrams shows an increase in accuracy of about 5% compared to GPT-4V without diagrams, GPT-4V\(_{\text {Ext}}\) and GPT-4V\(_{\text {CoT}}\) show little improvement in accuracy. For Valid problems, the accuracy is higher with diagrams, reaching about 93% in GPT-4V\(_{\text {CoT}}\).

Fig. 6.
figure 6

Example output of GPT-4V\(_{\text {CoT}}\) for the MC problem with Euler diagrams shown in Fig. 4.

In the case of NoValid problems, providing diagrams alone increases the accuracy from 20% to 28%. This suggests that diagrams are effective on Conversion problems because, in the MC task, all NoValid problems are also Conversion problems. Figure 6 shows an example output of GPT-4V\(_{\text {CoT}}\) for a NoValid problem in the MC task with Euler diagrams. In this example, accurate paraphrasing is provided for the Euler diagrams corresponding to the premises in the first paragraph, while in the second paragraph, when synthesizing the two diagrams, the relationship between A and C within B is undetermined, leading to selecting ‘none of them’ as the correct answer. Note, however, that GPT-4V\(_{\text {Ext}}\) and GPT-4V\(_{\text {CoT}}\) show a decrease in accuracy for NoValid problems (about 16% and 14%, respectively). This could be attributed to their tendency to attempt to synthesize information from the two diagrams but fail to determine the position of A and C when ‘none of them’ is the correct answer. This leads to further errors in inference chains. Interestingly, when explanations are not required, there is an increase in selecting ‘none of them’ as the correct answer.

In the case of Incongruent problems, the effect of diagrams is hardly observed. Detailed results for each problem in the MC Task are provided in Appendix Table 11. Regarding figural effects, the average accuracy for the models without diagrams shows no difference in accuracy between EI1 and EI4 problems (both 77.78%), whereas, for the models with diagrams, the accuracy for EI1 problems (88.89%) is lower than that for EI4 problems (100%). This result is contrary to the human case, where without diagrams, the accuracy for EI4 problems is lower than that for EI1 problems, but providing diagrams eliminates the difference in accuracy between EI1 and EI4 problems.

Results for the VC Task. Similar to the results on the MC task, GPT-4 without diagrams achieved higher accuracy than GPT-4V without diagrams (\(77.54\%\) vs. \(66.67\%\)). We choose GPT-4V without diagrams as the baseline for comparing the models’ performances with and without diagrams.

Comparing GPT-4V without diagrams to models with diagrams (GPT-4V, GPT-4V\(_{\text {Ext}}\), GPT-4V\(_{\text {CoT}}\)), all models with diagrams surpass the accuracy of GPT-4V without diagrams, with GPT-4V\(_{\text {CoT}}\) reaching \(76.84\%\). Note that GPT-4V without diagrams exhibits considerably lower accuracy (\(12.63\%\)) for neutral problems, compared to entailment and contradiction problems. However, under conditions with diagrams, the accuracy of neutral problems increases by around 10% to 20%.

Regarding the problems involving reasoning biases, the accuracy of models with diagrams increases for Incongruent and Conversion problems (GPT-4V\(_{\text {CoT}}\) reaching approximately 74% and 37%, respectively). In addition, for Symbolic problems, the accuracy improves under conditions with diagrams, particularly with GPT-4V achieving a high accuracy of 88%.

5 Conclusion and Future Work

We have investigated how accurately current LLMs can perform syllogistic reasoning when provided with Euler diagrams as auxiliary input. Overall, the experimental results showed that using diagrams as auxiliary input is effective for LLMs, although the effect varies under different conditions. In the VC task, where a single specific conclusion is presented, neutral problems are particularly challenging; however, diagrams have been shown to slightly improve the models’ performance. In the MC task, selecting the “No valid conclusion” option (‘none of them’) is notably difficult, and it was observed that providing diagrams in a Chain-of-Thought setting decreased the accuracy. In both tasks, the diagrams improved the accuracy for Conversion problems, but the improvement in accuracy was not as high as that seen in humans. Regarding Incongruent problems, the inclusion of diagrams has been seen to slightly improve performance.

Future work will explore logical inferences beyond syllogisms, such as those in propositional logic augmented with Venn diagrams. Additionally, conducting more detailed qualitative analyses of the models’ outputs to compare the explainability of models provided with sentences and those with diagrams would be interesting. Our results suggest a tendency for the paraphrasing of diagrams to be accurate, while the models make mistakes in the subsequent reasoning process concerning diagrams. A more detailed analysis will shed light on aspects that cannot be adequately evaluated by overall accuracy alone.