For participants, developing geospatial analyses involved constructing pipelines that applied many geospatial operators (in a particular order) to input layers to produce target outputs. Geospatial operators transform both the geometries and attributes of geospatial data, making it difficult to reason about their behavior. For example, the
Dissolve operator merges the boundaries of geographic features possessing a shared attribute value and combines attributes of merged features using an aggregation function (e.g.,
sum) (Figure
6). Constructing analysis pipelines required participants to have deep knowledge of operators and their semantics as well as the ability to inspect and debug generated outputs.
5.3.1 Identifying Geospatial Operators.
Participants struggled to identify the correct operators to transform input layers into target outputs (PE3, PE7, PE9, PS4, PJ4, PO2). Even an expert, PE7, noted that distinguishing the behaviors of different geospatial operators is challenging: “I can never remember the vector operations. There’s like
Union and
Merge.
Combine! I can never remember exactly what they do. I know exactly what the output should look like in the end; I’m just trying to figure out the tool that gets me that output.” PE9 spent 16 minutes searching for a
geopandas operator to filter a point layer to locations falling within a specific polygon in a separate layer. They experimented with programs using
intersection and
sjoin before identifying a solution using
within, reflecting: “I feel like I spend a lot of time getting stuck on, like, very simple GIS. It’s things like
Merge vs.
Join, getting confused with which one you want. Or
Spatial Join vs. a regular
Join. Sometimes just the terminology can be confusing, and sometimes it’s not consistent between QGIS and Arc[GIS] and
geopandas.” The number of operators in GIS software and geospatial analysis libraries exacerbates this challenge. For example, ArcGIS has over 200 operators in its Spatial Analyst toolbox, ranging from bitwise operators to kriging algorithms [
31]. This is only one of its 41 toolboxes.
Alternative Expressions of Intent.
Although participants struggled to construct analysis pipelines, many could describe their intent in other ways (PE7, PE9, PS4, PJ3, PJ4, PJ7, PJ8). Some used natural language descriptions, either spoken aloud (PE7, PE9, PS4, PJ3, PJ4, PJ7) or written as comments (PE8, PS4, PJ1, PJ4). For example, PJ4 phrased their intent as a question: “How many homicides did each neighborhood have this year, and how did that compare to, like, last year or the last five years, or something like that, right?... So now I’m doing the puzzle in my head, like, how am I gonna get there?” They proceeded to write individual subgoals for each analysis step in comments in their Jupyter notebook (e.g., “Spatially join homicides to [neighbor]hoods”). Some participants interacted directly with features in a map view to express their intent (PS4, PJ7). PJ7 used their mouse to demonstrate in QGIS how they would compute buffers around each line feature in their stream dataset, then compute the area of overlap between these buffers and a raster deforestation dataset. This would yield the total area of illegal logging in their analysis region.
Code Foraging.
When participants could not identify the correct operator for an analysis context, they resorted to foraging for similar analysis examples on Google (PE3, PE7, PE9, PJ2, PJ4), StackOverflow (PE9, PE10), in documentation (PE7, PE9, PS4, PJ4), in online tutorials (PE3, PE7, PE9, PS3, PS4, PJ3, PJ8), in colleagues’ computational notebooks and source code (PE1, PE4, PE5, PS3, PJ3), or in their own notebooks and source code (PE1, PE9, PE10, PS3, PJ3, PJ4, PJ8). PE9 demonstrated nearly all of these behaviors, visiting six online tutorials, six StackOverflow pages, and two pages of the
geopandas documentation to determine the first two operators to use in their pipeline (Figure
7).
5.3.2 Understanding Geospatial Operator Semantics.
Even when participants could identify candidate operators, they struggled to understand operator behaviors (PE3, PE7, PE8, PE9, PJ4, PJ8, PO2). As PE7 and PE9 noted in Section
5.3.1, this is partly due to the ambiguous naming of geospatial operators. Moreover, operator semantics differ subtly across GISs and geospatial analysis libraries, meaning “you do need some sort of specificity for doing the actual [analysis]” (PS2) in a particular environment. For example, ArcGIS’s
Merge combines vector layers of any geometric type—point, line, or polygon—into a single layer [
32], while its QGIS-equivalent,
Merge Vector Layers, can only merge vector layers of the
same geometric type [
3].
geopandas merge inherits from
pandas, ignoring geometry altogether and performing a join on shared attributes [
39]. As this example illustrates, knowledge of geospatial operator behavior in one tool rarely transfers to another.
Participants used diverse strategies to understand operator semantics. We highlight two common techniques.
Output-Centered Hypothesis Testing.
To test hypotheses about candidate operators’ behaviors, participants ran operators, then manually inspected generated outputs (PE1, PE3, PE7, PE9, PS1, PJ4, PO2). For example, PE3 attempted to combine two single-band rasters into one multi-band raster in QGIS, hypothesizing that the Merge operator might be appropriate for the task. After running Merge, they inspected the output raster and found that it was still composed of a single band. They next examined pixel values of this raster, noticing they were identical to pixel values of just one of the input rasters. From this inspection, they inferred that Merge stitches together input rasters of differing geographic extents rather than combining raster bands.
When testing candidate operators, participants focused on small subsets of pixels or features and compared their values in input layers to their corresponding values in outputs. Sometimes, selection of pixels or features was random (PE7, PS2, PO2). More often, they selected parts of the output where unexpected behavior would produce obviously incorrect values (PE1, PE3, PE8, PS1, PJ2, PJ3). For example, PE1 computed a Normalized Difference Water Index raster and checked the pixel values of a lake in the generated output; if the algorithm succeeded, these values would be close to the maximum value of one.
Observing Feature Count Changes.
Many geospatial operations, such as those that filter, intersect, or aggregate geographies, produce output layers containing a different number of features than their inputs. Participants used changes in feature counts to assess operator behavior, with the magnitude and direction of change serving as proxies for correctness (PE9, PE10, PS1, PS4, PJ2, PJ3, PJ4, PJ8). For example, PS4 checked the feature count of the dataframe produced by an st_join operation: “This should only be 372 observations because each [Census] tract is unique, but instead test2 [the output dataframe] is 2790, which is implying that there is something wrong.”
5.3.3 Visibility of Geometry in Programming Environments.
Participants relied on examining the geometry of their data to understand operator behavior and validate operator output. GISs center the geometry of geospatial data via a map view, a canvas that allows users to pan, zoom, and inspect features and pixels directly. Conversely, participants using programming environments had to write additional code to perform these interactions (PE8, PE10, PJ7, PJ8, PO2). For example, PO2 wrote code to pan and zoom static matplotlib figures to particular parcels in their OpenStreetMap dataset. This involved a repetitive process of guessing the coordinates of bounding boxes containing the parcels, updating a Python dictionary encoding these coordinates, re-executing their code in IPython, and re-rendering the matplotlib figures until they achieved the desired view.
Programming environments made rendering and interacting with geospatial data challenging enough that, even when participants used them for analysis, they often moved their data to GISs to “see” and “layer” (PJ6) it interactively (PE9, PJ2, PJ4, PJ6). PJ2 explained the immediate visibility of their data’s geometry in GISs outweighed the performance benefits of code: “I’m working in QGIS. I know that it’s slower than it would be to do it in PostGIS or maybe even
geopandas, and so I’ve considered switching to that. But I’m still... new enough that I need to kind of ‘see’ to make sure my projections are right and stuff like that.” PJ4 performed their analysis using
geopandas in Jupyter but explained they would visualize the results in QGIS: “Now I could try to visualize it here with
matplotlib and
geopandas, but I know those things are... not interactive and so I’m like, ‘I gotta take this to QGIS.’” These findings extend prior work highlighting visual exploration and cross-layer correlation as integral exploratory analysis techniques for geospatial data users [
29,
64]. Participants wanted visibility into their data’s geometry not only to identify spatial patterns but also to validate the correctness of their analyses visually.