In the development of our Critical Thinking Assessment for Literacy in Visualizations (CALVI), we extend previous definitions of visualization literacy and define the critical thinking aspect of visualization literacy to be the ability to read, interpret, and reason about erroneous or potentially misleading visualizations.
Subsequently in the test construction phase, we created a design space to systematically build our item bank, then we conducted a preliminary study to evaluate and iterate on the items.
3.2 Test Construction: Visualization Content Design Space
To systematically generate the items in the bank, we constructed a design space that consisted of combinations of possible ways a chart can become misleading (we refer to these as
misleaders) and
chart types. Below, we describe the process of distilling the set of misleaders and chart types used to construct the design space (also shown in Figure
3).
Misleaders. To compile an initial set of ways a visualization can mislead, we drew from two main prior works: categorizations from McNutt et al. and Lo et al. [
30,
32]. We reviewed each category from McNutt et al., extracted relevant categories based on main criteria such as
visually detectable (we cannot test for misleaders that people cannot detect visually) and
not cognitive biases (cognitive biases from readers do not fit in the definition of misleader as they are not part of the visualization construction process
3), and further categorized them into higher-level or lower-level categories (
in Figure
3). The same process is repeated with categorizations from Lo et al.
4 Then, we merged the two sets by mapping the lower-level categories from Lo et al. to the higher-level categories from McNutt et al. (
in Figure
3). During the merge, we mapped 17 lower-level categories to the
Manipulation of Scales higher-level category, making it the largest category. We then split it into four subcategories:
Inappropriate Order,
Inappropriate Scale Range,
Inappropriate Use of Scale Functions, and
Unconventional Scale Directions, resulting in 14 misleaders (
in Figure
3).
In order to systemically apply the misleaders to different chart types, the misleaders have to be generalizable across chart types. Thus, misleaders with an
inability to generalize cannot be applied to a variety of chart types to populate the design space. Additionally, the items need to be self-contained, so they
cannot require domain-specific knowledge. Three high-level categories from McNutt et al. [
32] were removed from the set of 14 based on these two criteria (
in Figure
3). Namely,
within-the-bar-bias and
misunderstanding area as quantity categories were removed because of their inability to generalize to chart types without bars and without area encodings, respectively.
Assumptions of causality category was removed due to it requiring domain-specific knowledge: e.g., to decide whether or not a correlation in a visual representation reflects a causal relationship requires knowing the causal structure of the domain, and is not a property of the visualization itself. Thus, the result is a set of 11 misleaders, whose descriptions are shown in Table
1.
Chart Types. We started with the 12 chart types from VLAT surveyed by Lee et al. [
29] (
in Figure
3). Because we want the combinations of chart types and misleaders to create misinformative visualizations we might expect people to see in the wild, we remove chart types that are less likely to appear in reality (i.e.,
realism criteria). Hence, treemap was removed: it was ranked in the bottom half in the list of chart types in data visualization authoring tools and news outlets by Lee et al. [
29]. In addition, to run an experiment in which each item is seen by a reasonable number of people, we need an item bank that is not too large. Yet we still want it to be diverse (i.e.,
diversity criteria). Thus, we removed histogram due to its similarities with bar chart after applying the misleaders. Additionally, we merged bubble chart into scatterplot as it is essentially a type of scatterplot with an additional dimension (
in Figure
3). The result is a set of 9 chart types.
Design Space Structure. To construct the skeleton of the design space, we generated a matrix with misleaders as rows and chart types as columns (
in Figure
3).
5 This matrix helped us explore how misleaders can be applied across chart types to systematically construct misleading visualizations, and we refer a cell in this matrix as an
item type.
Designing Visualizations. Next, we filled out the design space by applying misleaders to different chart types. We want to note one important consideration during the design process: visualizations alone are not necessarily misleading. To be misled means that the conclusion drawn from the visualization deviates from the correct conclusion, requiring a correct answer or conclusion to exist in the first place. Thus, a visualization can only be misleading when there is a specific question or task that viewers need to answer or perform using the visualization. We acknowledge that certain types of visualization are implicitly associated with specific tasks because they are designed in such a way to highlight certain aspect of information from data, so those visualizations can be misleading even without explicitly asking viewers to perform a task based on them. Therefore, a misleading visualization cannot be divorced from its visualization task, and this is crucial to keep in mind while designing the visualizations in the design space (and writing the question text afterwards).
While filling out the design space, we also designed visual modifications that we believed would make some items easier or harder than others, and we systematically applied the alterations across chart types (
in Figure
3). This is only our attempt to have items with varying levels of easiness, a property of items ultimately measured by IRT (see section
5). We show another example alteration in Figure
4, which stemmed from applying the misleader
Manipulation of Scales - Inappropriate Scale Range on bar charts. The base version of this is simply truncating the
y-axis, and the alteration of adding an axis break should make it easier to identify the truncated axis. This specific item type (i.e., bar chart with
Manipulation of Scales - Inappropriate Scale Range) has two variations. With the variations in each item type, a total of 128 candidate items were generated after applying misleaders across chart types, as shown in
in Figure
3.
Within the set of 128 candidate items, there were redundancies due to generating up to three variations within each item type. Thus, to arrive at a diverse bank of items, we reviewed and eliminated misleading visualizations in the design space by following the same two criteria as before:
realism and
diversity (
in Figure
3). By the realism criterion, if an item type is not likely to appear in a real world setting, then it is eliminated. For instance, inconsistent grid lines and an arrow next to them were less likely to appear in reality for the
Manipulation of Scales - Inappropriate Use of Scale Functions misleader, so such items were removed (
in Figure
3). Per the diversity criterion, we removed item types that are redundant. One such example was
Manipulation of Scales - Unconventional Scale Directions for stacked bar chart: including this item type would not add variety as it is essentially the same outcome design as a regular bar chart, so we eliminated it. After removing item types based on the realism and diversity criteria, we are left with a total of 35 item types that stemmed from 11 misleaders and 9 chart types. Along with the variations within each item type, the resulting bank contained 52 items associated with erroneous and misleading visualizations (
in Figure
3). An overview of our final design space is in Figure
5.
3.3 Test Construction: Writing Trick Items
Again, it is important to note that whether a visualization is misleading depends on the underlying visualization task one is asked to perform. Take the example of a line chart with an inverted
y-axis: this visualization can be very misleading if one were asked to identify the trend of the line and did not notice the inverted axis. However, if the viewer was asked to retrieve the
y value of the line at a specific
x value, then it is not misleading because they would simply identify the point of interest on the
x-axis and look up its corresponding
y value. When we wrote the question text for each visualization in the design space, we ensured that the visualization task associated with each question is a relevant task for the visualization to be misleading. For example,
Concealed Uncertainty is most salient in prediction-making, so all items in this category ask test takers to
make predictions; to test whether people can detect
Inappropriate Aggregation, the items must ask them to
aggregate values from the visualization, such as finding the average. There are also misleaders that have multiple relevant tasks: in
Manipulation of Scales - Unconventional Scale Directions, it is appropriate to ask people to
find correlations/trends from a line chart with an inverted
y-axis or to
make comparisons of two regions in a choropleth map where the color scale is inverted. The items in our bank involve a total of six tasks:
retrieve value,
find extremum,
find correlations/trends,
make comparisons,
make predictions, and
aggregate values, all of which are visualization tasks rooted in literature [
1,
8,
39].
6 The first four tasks were taken from VLAT [
29], and the latter two were added to support the
Concealed Uncertainty and
Inappropriate Aggregation misleaders. For further discussion on the relationship between tasks and misleaders, see section
9.5.
To further increase the diversity of the items, we created a total of 52 different contexts (background stories), covering topics from the prevalence of a plant species to the market share of cell phone brands. Representative item examples from each of the 11 misleaders in our design space are shown in Figure
6. The contexts were intentionally made fictional in order to limit participants’ use of prior knowledge. Thus, every item is self-contained and participants should be able to select the best answer solely based on what is given in the question text, answer choices, and the associated visualization. To isolate the effect of the specific visualization misleader of each item, we designed the choices for the items to be such that there is at least one correct (best) answer and at least one wrong but seemingly correct answer due to the misleader (i.e., wrong-due-to-misleader). Any other incorrect answers are more obviously wrong and are unrelated to the misleader (i.e., wrong-but-unrelated-to-misleader). The wrong-due-to-misleader answer(s) are needed to measure susceptibility to the misleader (see Figure
6).
For items with visualizations that do not present enough information for the test taker to choose a reasonable answer, such as the items in the Inappropriate Aggregation misleader category, we considered two ways of phrasing the correct answer: “Cannot be inferred” and “Inadequate information”. After trying both in pilots, we decided to keep it consistent throughout the test and used “Cannot be inferred / inadequate information” for these answers. Additionally, we added this option to items where the correct answer is not “Cannot be inferred / inadequate information” to ensure that this option is not the correct answer every time it appears in an item, so that the answer choices did not hint at the correct answer.
One difficulty while writing the items was to not make the question text and answer choices too long. The concern with long question text is that reading and parsing it might interfere with the goal of measuring the ability to identify the misleader (i.e., answering incorrectly due to errors in reading the question text or answer choices instead of not noticing the misleader). An example technique we used to shorten the texts is to label the points of interest on a chart to avoid using long descriptions to pinpoint those points.
3.5 Test Construction: Preliminary Study
The goal of the preliminary study is twofold: (1) to qualitatively identify sources of ambiguity and misunderstanding in the question text and the visualizations in the test and (2) use the preliminary data to help determine the sample size needed for the test tryout phase. Therefore, in addition to asking each participant to answer 30 items, we also asked them to complete a set of open-ended questions related to the set of items that were randomly assigned to them. The open-ended questions asked participants to explain why they selected their answers for the items that they answered incorrectly (participants were oblivious to this logic). Thus, along with 3 attention check questions, each participant received 33 selected-response items and a subset of open-ended questions depending on their responses in the selected-response section. There was no time limit.
Participants. We recruited 30 participants
7 from Prolific for the preliminary study, whose average approval rate is 99.57%. All of the participants are located in the U.S. and speak fluent English. In addition, participants whose ages do not fall between 18 and 65 or who do not have normal or corrected-to-normal vision are excluded from the study. We collected a balanced sample of 15 males and 15 females with age ranging from 19 to 64.
Procedure. First, we presented participants with a consent form and information page describing the structure of the study.
8 They were instructed that they are required to select an answer for the current item/question before moving on to the next one, and once they moved on, they could not return to previous ones.
Method. Because one of the goals of the preliminary study was to uncover any confusion in the question text or visualizations, for each selected-response item, we reviewed participants’ responses to the corresponding open-ended question to understand their reasoning behind their incorrect choices and whether their decision was due to the intended misleader or problems with our question text or visualization.
Qualitative Results. The text responses revealed a few sources of ambiguity and inconsistency in the design of items. Two such items are on pie chart and
Manipulation of Scales - Inappropriate Use of Scale Functions; in these items, the sizes of the pie slices are inconsistent with the percentages written on the slices. For both items, we initially designated the option of “Cannot be inferred / inadequate information” as the correct answer, because this inconsistency should suggest that the visualizations do not convey any reliable information. However, some participants expressed that they noticed this conflict between the percentages and the sizes of the pie slices, but they decided to trust one over the other nonetheless. Another item on line chart and
Misleading Annotations has a similar quality: the title of the chart disagrees with the trend of the line. We were curious about how people understand such visualizations when there is conflicting information present, so we decided to implement an open-ended questions section in the test tryout phase and asked participants who received the corresponding
trick item(s) in the selected-response section to justify their answers for these three items (shown in Figure
7).
Another (slightly ironic) ambiguity came in the interpretation of stacked bar charts and stacked area charts. The key ambiguity for these charts is whether the chart was constructed using a position encoding or length encoding. If a position encoding is used, then the quantity of each segment of the bar (or area) should be the corresponding number on the y-axis. If a length encoding is used, then the quantity of each segment should be the difference between the top of the segment and the bottom of the segment (i.e., the top of the segment immediately below). We had disagreement ourselves during item construction: two different authors had used both types of interpretations in the construction of stacked bar charts and area charts, and many participants were able to understand both interpretations and make correct choices based on them. This reflected the ambiguous nature of stacked charts. To resolve this inconsistency, we unified our interpretation to using length encoding, which is the more common interpretation, and we designed the choices of the items in such a way that the “correct” answer for the position-encoding interpretation do not appear in the choices of most items related to stacked bar charts and area charts. To further reduce the ambiguity, we added visual cues with the goal of making the use of length encoding more obvious, such as add transparency to the stacked area charts as well as grid lines in the background to emphasize the interpretation should be based on length encoding.
For the rest of the items, there were no major ambiguities. We made stylistic modifications to a small subset of visualizations to improve their presentations, such as adding strokes to state and country borders in choropleth maps and making the colors more distinct from each other for certain visualizations with color encoding. Through reading the open-ended responses, we also noticed that the vast majority of participants who answered the wrong-due-to-misleader answers continued to explain their reasoning without recognizing the misleader.
Test Tryout Sample Size Determination. The data from the 30 participants were used to fit a preliminary model that was then used to simulate 500 participants, which is the sample size that would likely be sufficient for the 2PL IRT model [
19]. We fit the simulated 500 participants to our preregistered model (details in section
5.2) and checked that our model converged with a sample size of 500, posterior predictive checks were reasonable, and we could estimate the correlation between
trick and
normal items to a resolution of approximately ± 0.1.
9 Thus, we chose 500 as our sample size for test tryout.