research-article

Open access

Inkeraction: An Interaction Modality Powered by Ink Recognition and Synthesis

CHI '24: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems

Article No.: 181, Pages 1 - 26

Published: 11 May 2024 Publication History

Abstract

Ink is a powerful medium for note-taking and creativity tasks. Multi-touch devices and stylus input have enabled digital ink to be editable and searchable. To extend the capabilities of digital ink, we introduce Inkeraction, an interaction modality powered by ink recognition and synthesis. Inkeraction segments and classifies digital ink objects (e.g., handwriting and sketches), identifies relationships between them, and generates strokes in different writing styles. Inkeraction reshapes the design space for digital ink by enabling features that include: (1) assisting users to manipulate ink objects, (2) providing word-processor features such as spell checking, (3) automating repetitive writing tasks such as transcribing, and (4) bridging with generative models’ features such as brainstorming. Feedback from two user studies with a total of 22 participants demonstrated that Inkeraction supported writing activities by enabling participants to write faster with fewer steps and achieve better writing quality.

Figure 1:

1 Introduction

Ink is a powerful medium that has been used for centuries. It enables people to create, annotate, record, sketch, and think [7, 42, 48, 59], and it is an essential part of our daily lives for education, work, leisure, and more [3, 65, 74]. In recent decades, the innovation of styluses and tablets has made digital ink widely available, which offers new capabilities beyond traditional ink on paper. Users can rearrange handwriting [49, 50], search for digital content [28], and analyze data [35, 60] with digital ink.

Common digital ink tools are limited in their abilities to understand and modify ink. Most tools can only recognize individual strokes, failing to grasp the broader structures (e.g., lists and text blocks) and relationships (e.g., annotations and pointing gestures) that these strokes form. Users must explicitly define the intended written structure or relationships within the tool by switching between various writing tools and templates [6, 7, 49, 50, 61]. Additionally, most tools offer to modify ink by transforming existing strokes, without the ability to synthesize new handwriting content. These constraints limit the expressiveness and interactivity of a tool, making ink writing a time consuming and disruptive process. As a result, digital ink remains mainly as an input method, where a tool relies on other representations (e.g., typed text) to provide feedback [7, 67, 77].

We present Inkeraction, an interaction modality powered by ink recognition and synthesis. Our ink recognition algorithms are full-page, hierarchical, and excel at understanding relationships. “Full-page” refers to decoding the entirety of a handwritten page, expanding on prior work that provided limited recognition confined to predefined tools or templates. Our algorithms recognize higher-level objects as the compositions of lower-level ones (e.g., a list from words), representing strokes as nested, hierarchical structures. The algorithms can not only classify free-form ink objects, such as handwriting and sketches, but also identify the relationships between these objects (e.g., one item pointing to another). Our ink synthesis algorithms feature a diverse set of stroke manipulation tools for different use cases. They can clone and transform existing strokes, generate aesthetically pleasing strokes, and adapt to the user’s writing style.

These recognition and synthesis capabilities enable a variety of new ink interactions that were previously challenging. With Inkeraction, users can take notes without specific templates, and the tool responds to users in a similar writing style, as shown in Figure 1. In this paper, we explored how Inkeraction can be used to create a variety of features, including:

•

Assisted Manipulation to efficiently arrange ink content, without templates or tool switching (e.g., Figure 1c).

•

Word processor-inspired features, including Spell Check (e.g., Figure 1d), Auto Completion, and Formatting, which help users write efficiently and accurately.

•

Features to automate repetitive tasks, including Transcription (e.g., Figure 1e) and Beautification.

•

Opportunities to compose with generative models (e.g., Figure 1f), which could help users brainstorm content, fulfill written tasks, and organize less-structured documents.

To gain practical insight into Inkeraction and its features, we conducted two user studies. The first study assessed the overall ease of learning, usability, and usefulness of Inkeraction. Twelve participants experienced six Inkeraction-integrated features using our prototype. The word processor-inspired features were deemed most intuitive to learn, while the Assisted Manipulation feature was most favored due to its versatility. The second study compared an Inkeraction-powered ink tool to a conventional ink tool lacking recognition and synthesis capabilities, with ten participants tackling five inking tasks using both tools. The results showed that Inkeraction enabled faster writing speed, higher writing quality, and in fewer steps, freeing up time for thought.

Inkeraction, like other interaction modalities (e.g., speech), has limitations, including ambiguity, latency, and accommodating for user preferences. Ink structures may change over time and cause ambiguous recognition results and interaction outcomes. Latency can introduce usability issues for features that require quick interaction. Depending on preference and use cases, users may look for different levels of assistance. To address these challenges, we discuss interaction techniques that make Inkeraction more usable and helpful.

In summary, our contributions are as follows:

•

A new interaction modality of digital ink (Inkeraction) and its architecture with ink recognition and synthesis.

•

Design space of Inkeraction with use cases and applications.

•

Two user studies with a total of 22 participants that examined Inkeraction and its features.

2 Related Work

Our work builds on the foundation of prior art on understanding inking tools and their usages. We were also inspired by the research in ink manipulation, recognition, modification, and synthesis. To our knowledge, Inkeraction is the first attempt to combine hierarchical full-page ink recognition and multiple ink synthesis algorithms, enabling new design space for ink interaction.

2.1 Understanding Inking Practice

Inking, which is the act of writing and drawing with an analog pen or digital ink, has been studied by researchers from many perspectives. Prior work investigated how inking can be used for active reading [6, 11, 27, 29, 51, 57, 64, 69], notetaking [12, 28, 41, 52], annotations [4, 58, 80, 81], creativity [25, 32, 43], data analysis [35, 60], personal reflection [3, 13, 19], and collaboration [14, 17]. In addition to inking activities, researchers also studied the uses of different inking tools. Riche et al. [59] studied the gaps and discrepancies between the experiences afforded by analog pens and digital ink, and found that digital ink had certain benefits over analog pens, such as the ability to dynamically edit output. Camporro et al. [20] compared digital and analog tools for sketchnoting, and found that while digital tools were more editable, they suffered from the high cognitive load of unlimited editing options and details.

In previous research, inking has been shown to be associated with thinking. Inking allows us to shift cognitive load out of our heads and onto physical media [37, 71], and creates a permanent record of ephemeral thoughts (e.g., sketching an idea on paper) [37, 59]. Inking can be a messy process, as it often reflects our mental models and thinking processes. Inking content is usually free-form and contains visual content like sketches and diagrams [20, 37, 74]. When designing Inkeraction, we aim to support a wide range of inking activities, providing the flexibility to edit inking content with minimal interactions and interruptions.

2.2 Ink Manipulation and Recognition

Digital ink’s malleability empowers users to edit their work, but manipulating individual strokes can be laborious. To address this, researchers have developed recognition-independent techniques. Ink selection tools, such as the Lasso tool [8] and the Harpoon technique [44], enable users to specify a group of strokes for editing. Ink gestures act as shortcuts for modifying strokes. For example, Tivoli [49, 50] and Translucent Patches [39] used gestures to organize content into regions, reformat list items in multi-column lists, and move items by rows and columns. User-specified content empowers users to define the types of their ink content, aiding in organization and management. NiCEBook [7], for instance, allowed assigning categories to individual notes. Style Blink [61] introduced three building tools for organizing digital ink content. While valuable, these techniques can be complex and cognitively demanding. Additionally, switching between tools, tags, and templates can disrupt the thinking process.

Ink clustering groups related strokes together, providing some ink recognition capabilities to simplify manipulation. This technique allows users to select clustered strokes using simpler interaction techniques, such as a tap [63]. It also facilitates content rearrangement, as demonstrated in CLuster [55], where moving selected strokes triggered the other unselected strokes to move with them in their respective clustered groups. Additionally, users can customize the types of clustered strokes to cater to specific applications. For instance, in Flatland [53], users designated clustered strokes as a to-do list, enabling easy item completion and reordering through gestures.

Previous HCI research on ink manipulation explored ink recognition, but often faced limitations. Recognition capabilities were typically restricted to specific regions or tools. For example, WritLarge [77] recognized only user-selected strokes for editing suggestions, neglecting the potential involvement of unselected strokes during interactions (e.g., dragging selected strokes over unselected content). Bargeron et al. [5] attempted full-page recognition for ink annotations, but required manual labeling by participants in their user study.

Recent advancements in algorithms and databases enhance full-page handwriting recognition. Graph neural networks [78] and stacked recurrent neural networks [15] have shown promise in recognizing the handwriting documents on the IAMonDo dataset [31]. However, a discrepancy exists between these algorithms and datasets. These datasets encoded documents hierarchically (e.g., labeled lists containing labeled texts) [31, 34], while the algorithms in [15, 78] treated documents as flat, mutually exclusive objects (e.g., lists and texts as distinct labels). Additionally, the prior work focused mainly on the algorithms and databases, without exploring the interaction perspective of full-page recognition.

Inkeraction recognizes ink content on an entire page as users write and draw, including its hierarchical structure and diverse relationships. This enables the reuse of existing design patterns, such as tap-to-select and list reordering, while also creating novel possibilities for human-ink interaction.

2.3 Ink Morphing and Synthesis

Ink is an expressive medium, yet creating precise or controlled marks with it can be challenging. To address this, researchers have developed techniques for beautifying and correcting strokes. One approach is to use shape morphing, which converts handdrawn shapes into beautified symbols. For example, Fluid Sketches [2] continuously morphed shapes towards standard circles and boxes as a user draws them. Similarly, the system developed by Hse et al. [30] recognized and beautified 13 sketched symbols. In addition to shapes, handwriting can also be beautified. This can be achieved by normalizing geometric characteristics [66], aggregating multiple instances of the same written word or shape [82], and replacing with synthesized strokes in different writing styles [1].

Ink synthesis algorithms have emerged as a versatile tool for generating natural-looking handwriting. A comprehensive survey of these techniques was presented by Elarian et al. [18]. One notable application is DeepWriting [2], where ink synthesis enabled word-level editing, spell checking, and beautification. This work mainly focused on the synthesis algorithms, and the spell checking and beautification features were evaluated in a preliminary user study with 10 participants.

Inkeraction combines ink recognition and synthesis to enhance user interactions. It can not only synthesize handwriting text, but also intelligently arrange the synthesized text within existing structures like lists and paragraphs.

3 Design Process

Over two years, we iteratively developed Inkeraction, envisioning its design space through various design activities conducted with our engineering, design, and research collaborators. In this section, we focus on a design workshop that took place in early 2022 and motivated us to create Inkeraction.

The goal of the workshop was to identify how existing digital ink tools support inking activities and where they might fail as a tool for thinking [71]. To achieve this, we assembled a diverse panel of 12 experts to review existing inking software and hardware. The participants included three researchers, four engineers, four designers, and a product manager. Three participants held a Ph.D. degree, six possessed a master’s degree, and the remaining three had attained at least a bachelor’s degree. Their professional experience averaged 12 years (SD = 8.9 years), demonstrating extensive industry knowledge. They had developed and launched several advanced user interface systems prior to the workshop.

The six-week workshop employed a hybrid approach, combining in-person collaboration and individual evaluations. During the initial week, the participants convened for three days to establish rapport, set the workshop agenda, and exchange prior research findings on digital ink and thinking tools through individual presentations. In the following weeks, they engaged in an extensive offline evaluation of ink tools across various platforms and applications. These included Apple Notes, Google Keep, NEBO, Remarkable, Boox, Procreate, WeTransfer Paper, and Moleskine Flow. In the last day of the workshop, the participants reconvened to discuss their findings and deliver a joint presentation. We recorded and transcribed the final discussion and presentation, and two authors used an inductive analysis [70] with open coding [10] to identify a set of common themes in the participants’ findings, which are briefly reported as follows:

•

Malleability helps think. Many ink systems allow a user to rearrange their strokes after writing. Malleability distinguishes digital ink from physical pen and paper [59], and can help users change the representation of their initial ideas, which is an important task in sensemaking [62]. Prior research has shown that directly manipulating something tangible was easier than simulating the manipulation in your head [21, 38]. However, existing products and prior work only partially fulfill this task, see below.

•

Repetitive work bores minds. Taking notes is a mentally demanding task [56], and it can be even more difficult when users are constantly switching between tools and tasks [22, 36], as is the case with many existing products. For example, to move a paragraph to a new location in Apple Notes, a user needs to shift other content out of the way, which requires the user to switch between stroke selection and movement back and forth. Prior research work reduced some efforts for ink manipulation. Yet, it still requires using additional tools (e.g., [61]) in multiple steps (e.g., [77]), which could slow down the user’s thinking process.

•

Digital ink should be a co-pilot. Digital ink can enhance productivity by streamlining repetitive tasks, such as automatically converting handwritten dates into calendar events. However, existing inking tools are predominantly reactive, requiring user input and interpretation. Drawing inspiration from recent advancements in generative models and their emergent capabilities [75], the expert participants anticipated the next generation of inking tools to evolve into proactive collaborators. These active tools would not only handle mundane tasks but also aid in creative endeavors like brainstorming (as illustrated in Figure 1f), enabling users to concentrate on their ideas.

Based on the findings, the participants envisioned an inking tool that would be able to assist the user to manipulate strokes quickly and help the user complete inking tasks. To free the user from tool switching, the system needs to understand the user’s writing and drawing without additional information from the user. To automate tedious writing and drawing tasks for the user, the system should be able to generate ink output. We name this ink input and output modality as Inkeraction.

4 Inkeraction

Digital ink offers expressive freedom for note-taking and creativity activities. However, its content can be difficult for a machine to interpret for interactive features. Drawing upon prior art and workshop findings, we defined Inkeraction’s initial capabilities as follows:

•

Recognize words and underlines, which are widely used in digital note taking and annotations [68],

•

Recognize connected graphs, lists, and diagrams, which are essential to visual thinking [74].

•

Synthesize markings (e.g., arrows, connectors) and symbols (e.g., a list bullet), which are abundant in the above recognized objects and necessary for ink output.

•

Synthesize styled handwriting as handwriting can be highly personalized [1].

We concentrate on architecting a data representation and algorithms for ink recognition and synthesis, building upon existing research. Our processing pipeline design is described in detail below.

4.1 Ink Recognition

4.1.1 Ink Representation.

As shown in Figure 2, Inkeraction represents a handwritten document as a hierarchical graph with three levels of containment (shown with elbow connectors) and various marking relationships (shown with curved connectors). This representation captures the inherent structure of nested handwritten notes, as explored in prior work [31, 34]. Unlike previous recognition work that treated nested structures as monolithic entities, hindering interactions with intra-structure components (e.g., Ye et al. [78] recognized a list as a standalone object), Inkeraction preserves these structures, enabling fine-grained interaction with each lower-level object. For example, the list in Figure 2a is represented as a container comprising individual list items, each with a bullet and word, allowing us to interact with any of these objects. Theoretically, the hierarchical containment relationships can have an unlimited number of levels, but we found that three levels were sufficient to represent the objects within our scope.

In addition to the containment relationships, we add marking relationships like connecting and pointing to for inter-structure interactions.

To describe the recognized objects, the hierarchical containment and varied marking relationships, Inkeraction employs a dictionary of labels. The labeling system is similar to prior work [31, 34], but modified with the opinions from our expert participants, developers, and designers. See Figure 3 for common labels in the dictionary.

Figure 2:

Figure 3:

4.1.2 Recognition System.

To obtain the hierarchical relationship graphs, our recognition system employs a segmentation model and a classification model working in tandem. The segmentation model iteratively groups strokes representing meaningful objects at different levels within a handwritten document, ultimately creating a stacked hierarchy of Graph Neural Networks (GNNs), see Figure 4. At each level, a GNN receives a set of input objects and clusters them, generating output objects for the next level. The classification model then analyzes the embeddings generated by the GNNs at different levels, assigning labels to these objects, such as “word” for lower-level objects and “list” for higher-level ones. Finally, Inkeraction performs text recognition and character segmentation specifically for text objects.

Figure 4:

Segmentation. At the initial level of the GNN hierarchy, each node represents a single stroke or a non-stroke object (e.g., an image). Each stroke undergoes normalization, smoothing, and processing through a stroke embedding model to extract an initial fixed-size representation. The stroke embedding model uses a stack of Transformer layers [72] followed by a fully connected layer. Non-stroke objects have their embeddings retrieved from a constant dictionary based on their type. At each level, initial edge embeddings are calculated based on heuristic features between nodes, such as relative distance and size, similar to the approach described in [78].

The node embeddings are then iteratively updated. Each iteration begins by identifying a fixed number of neighbors for each node using the k-nearest neighbors algorithm, based on the Euclidean distance between nodes. Subsequently, each node’s embedding is updated through a multi-head attention mechanism over the edge embeddings and the embeddings of its neighbours followed by a residual update. Let h_i denote the node embedding for node i, and e_ij denote the edge embedding for the edge connecting node i to its neighbor node j. Each node embedding h_i is updated with the output of a multi-head attention mechanism, where h_i serves as the query, and the set of e_ij|h_j act as the keys and values. The attention layer is followed by a reduction sum and a fully connected layer, which calculates the updated node embedding by combining the mean of the attention outputs with the original node embedding (see Figure 4b). We repeat the update iteration a fixed number of times.

After updating the node embeddings, merge logits are calculated for each neighbor of each node in the GNN. The merge logits between source node i and target node j are also computed over a multi-head attention mechanism, where the queries, keys, and values are h_i, e_ij, and h_j, respectively. The attention outputs are then reduced to a two-dimensional logit tensor representing the probability of merging or not merging. If the “merge” logit value is larger than the “no merge” logit value, we conclude that node i wants to merge with node j.

Cliques of n nodes that mutually desire merging are fed to a merger model, which utilizes a multi-head attention layer followed by a reduction sum layer. The attention layer computes an embedding from each node by attending over other node embeddings and the corresponding edge embeddings. The merged node embedding is computed as an average over all the attention heads and outputs (see Figure 4c). After updating and merging node embeddings, the edge embeddings are recomputed. If any nodes have been merged, their associated edge embeddings are recalculated based on the heuristic features.

At each level, we iteratively repeat the node embedding update and merge process until no more cliques of nodes desire merging or a maximum number of iterations is reached. We then repeat the process with another set of model weights to create a higher-level GNN from the current node embeddings. For example, if we previously merged strokes into words, we would now merge words into lines of text. This iterative segmentation process enables us to hierarchically cluster strokes and non-stroke inputs into three GNNs: L1, L2, and L3.

Classification. The classification algorithms process the segmentation results to categorize objects and relationships. Each node in the GNNs is assigned a label (e.g., “word” at L1, “textline” at L2, and “list” at L3). The classification model is a single fully-connected layer that maps a node embedding to the logits for the given number of classes.

The relationships between objects are computed in the same way as the merge logits. Instead of producing the merge logits, the relationship model produces logits for each type of relationship (e.g. “no relationship”, “underlines”, and “points_to”) between two objects.

Text Handling. For text objects, Inkeraction runs text recognition using an LSTM-based model [9]. The recognized text is then segmented into characters using a Transformer-based model [33], which allows us to calculate text geometrics such as character sizes and word baselines.

4.2 Ink Synthesis

Inkeraction addresses the challenge of free-form stroke synthesis with three targeted methods: stroke cloning for repetitive objects (e.g., list bullets), curve synthesis for connectors and arrows, and style-based handwriting text synthesis for text editing and creation. Figure 5 shows the different use cases of these methods.

Figure 5:

4.2.1 Geometric Transformations.

Stroke cloning and curve synthesis are achieved through geometric transformations. A stroke, represented by s, is a sequence of N coordinates. We can apply a 3 × 3 geometric transformation matrix T to s for translation, rotation, and scaling: T · s.

To clone a stroke c from s, we copy s and transform it to the desired position: c = T · Copy(s).

Curve synthesis involves identifying a template and calculating a T to modify the shape and position of the template. For instance, when a user moves an arrow composed of multiple strokes, the original arrow serves as the template. We identify the longest stroke in the arrow as its tail and define its two endpoints, A and B, as reference points. Based on user’s action, we calculate the after-interaction coordinates for these points: A′ and B′. The transformation matrix T, represented by a scaling factor (r), a rotational factor (θ), and two translation factors (dx and dy), is then determined by solving the equations A′ = T · A and B′ = T · B. This calculated T is applied to all strokes of the template to generate the synthesized curve.

4.2.2 Handwriting Text Synthesizers.

There are different use cases for synthesizing handwriting text. We can extend existing handwriting fragments (e.g., finishing an incomplete sentence), generate entirely new handwritten content (e.g., producing a standalone phrase), and modify existing handwritten text (e.g., correcting a character in a word). Inkeraction extends and generates handwriting building upon the work of Graves [24], and modifies existing handwriting using the algorithms suggested by Maksai et al.[47].

Graves [24] proposed a Recurrent Neural Network (RNN) model along with a “priming” technique to synthesize styled handwriting. This technique leverages the model’s inherent ability to attend to its previously generated strokes. By design, when synthesizing a stroke sequence x for a piece of text t, the model analyzes the already generated strokes x_p and their corresponding text t_p to produce the remaining strokes x_r for the remaining text t_r, where t = t_p + t_r. This mechanism ensures a smooth transition and consistent style. By changing x_p and t_p (i.e., priming model with x_p and t_p), the model can generate strokes for text t_r in different styles. Even strokes from users not in the training data can be used to prime the model, mimicking their handwriting style. However, the generated strokes may still exhibit some bias towards the training data. This bias can be reduced by incorporating diverse writing samples, encouraging the model to generalize ink synthesis.

Inkeraction uses this priming technique for handwriting extension and generation. When extending handwriting, the model is primed on the existing strokes (i.e., x_p), conditioned on the current label (i.e., t_p) and the text to be extended (i.e., t_r). Similarly, for generating new content, the model is primed on template strokes, conditioned on the label of the template strokes and the text to be generated. User-produced strokes (e.g., from previous writing) can serve as templates, allowing personalized stroke synthesis without RNN retraining. Alternatively, built-in template strokes in a standard style can be used to generate strokes in a generic font. After generating strokes, we apply geometric transformations to ensure their proper positioning within the context. We further improved the stopping criteria compared to [24]. The original model used a heuristic to stop synthesis when the sampling window exceeded the last character, which could fail in certain cases. In our work, instead of using this heuristic, we predict an end-of-sequence binary value. In other words, the original model predicts (d_x, d_y, PenUp/PenDown) for each step but our implementation predicts (d_x, d_y, PenUp/PenDown, isEndOfSequence), where d_x and d_y represent the pen coordinates for the step.

To modify existing handwriting, we use the model from Inkorrect [47]. This model extends the work of Graves [24] with a dedicated style capture component, which makes modified strokes more similar to the original handwriting.

4.3 Dataset

To train the ink recognition algorithms and handwriting text synthesizers, we curated a handwriting dataset consisting of 18200 ink documents over a six-month period. The documents were collected in two stages: 5200 were written and annotated by 15 workers in the first stage, 13000 were written and annotated by 33 workers in the second stage. All the writing work was done in an Android application on tablets or convertible laptops equipped with active styluses, annotations were collected either through the same Android application or a separate web application.

Each document was written based on a unique prompt assembled from a diverse library of instructions. This library contained: (1) text snippets extracted from randomly chosen English Wikipedia pages; (2) lists generated from hand-curated list of words (e.g., aircraft models, country names); (3) tables extracted from English Wikipedia pages; (4) images from Wikipedia and the Open Images Dataset [40]; (5) randomly generated box charts and pie charts; (6) flowcharts diagrams randomly generated from a set of shapes, labels and connections; (7) chemical formulas from Wikipedia; (8) a curated list of sketching prompts (e.g., “draw something that evokes: happiness”); (9) synthesized names, fake email addresses, dates, times, and actions; (10) layout instructions (e.g., writing in particular parts of the page, drawing arrows between objects). When assembling a prompt, we randomly chose instructions from the library, ensuring each prompt was distinct. We empirically determined that around 15 instructions per prompt would fill most of the screen with a handwritten document. Below is a sample prompt with 8 instructions:

(1)

Write somewhere: The song’s tune was written by Harold Karr

(2)

Underline several consecutive words in (1)

(3)

Write somewhere: Thomas

(4)

Write below (1): message [email protected]

(5)

Draw this somewhere: [A picture of a chart]

(6)

Draw a line to connect (2) and (4)

(7)

Write a bulleted list (*) with elements: edamame; cinnamon; Garlic; Dried lentils

(8)

Draw something that evokes: love

During annotation, strokes were initially grouped into L1 objects. Then, these L1 objects were hierarchically grouped into L2 and L3 objects, see Figure 3 for example objects. Text content was also labeled. Documents were reviewed for annotation accuracy and feedback was given back to the workers. Our team also did extensive reviews and fixes of the labels.

This process resulted in 18.20 thousand (K) annotated ink documents with 5.19 million (M) strokes, consisting of 1.18M annotated L1 objects, 0.24M annotated L2 objects, and 83K annotated L3 objects. Our database is similar to the IAMonDo database [31], which also used a prompt-based collection process and hierarchical annotation with similar labels, although IAMonDo only had 0.94K annotated documents with 0.36M strokes.

4.4 Performance

We trained and evaluated the recognition system, which had 1.4M parameters, on our dataset. To enable comparison with previous work, we then evaluated the trained system on the designated test set (i.e., set3) of the IAMonDo dataset. As shown in Table 1, our recognition system achieved high performance when evaluated on our own test set that had 0.17M objects. However, when evaluated on the IAMonDo’s test set, the different definitions and label distributions between our dataset and IAMonDo might lead to lower recognition performance for some categories, such as “Connector.” Refer to [78, 79] for comparable results from prior work, but note that their models were trained and evaluated on the same dataset (e.g., IAMonDo).

Table 1:

Dataset	Metrics	Word	Bullet	Connector	Arrow	Underline	Textline	Textblock	List	Overall

Our Dataset	SR	0.9774	0.9991	0.9483	0.9177	0.9080	0.9763	0.9099	0.9613	0.9631
	acc	0.9995	0.9993	0.9133	0.9846	0.8389	0.9992	0.9978	0.9961	0.9944
IAMonDo	SR	0.7367	0.9331	0.6321	1.0000	0.8787	0.8235	0.4098	0.8131	0.7288
	acc	0.9970	0.9408	0.1792	1.0000	0.7769	0.9991	0.9613	0.8549	0.9744

Table 1: The recognition system was evaluated using two metrics adopted from prior work [79]: segmentation recall rate (SR) and stroke-level accuracy (acc). SR measures the percentage of objects correctly segmented, while acc measures the percentage of strokes accurately classified. Higher values in both metrics indicate better performance. The system was trained on our dataset and then evaluated on both our dataset and the IAMonDo dataset separately. We present the individual results for common labels and the overall performance across all the labels shared by both datasets.

The LSTM-based text recognition model [9], the Transformer-based character segmentation model [33], and the RNN-based text synthesizers [24, 47] were trained on the annotated text from our datasets. The ground truth for character segmentation was obtained following the procedure described in prior work [33]. As we made very minimal modifications to these models, please refer to the prior work [9, 24, 33, 47] for details on their performance, including their training and evaluation on the IAMonDo database.

We implemented the recognition and synthesis algorithms on device, refer to Section 7 for runtime performance.

5 Design Space

This section explores the current design space of Inkeraction along four dimensions, which may expand as its capabilities evolve. Certain features discussed here may need the incorporation of language models, which are beyond the scope of this paper. For illustrative purposes, we have employed hardcoded functions to demonstrate these features, but implementation with actual models remains a future endeavor.

5.1 Assisted Manipulation

Interpreting user intention is crucial to effective ink manipulation. This can be achieved by leveraging the relationship graph. Upon identifying the user’s intent, Inkeraction automates manipulation by transforming and synthesizing ink content accordingly.

User intention manifests as alterations to the relationship graph. For instance:

•

Adding ink content leads to additional nodes in the graph.

•

Deleting ink content leads to the removal of corresponding nodes from the graph.

•

Moving ink content modifies the containment relationships based on geometric boundaries:

Moving a node into another node’s boundary establishes a containment relationship (e.g., adding a word to a list).

Conversely, moving a node outside its parent node’s boundary dissolves the containment relationship (e.g., removing a list item from a list).

Marking relationships like pointing and underlining are kept by default, except when deleted (e.g., erasing an underline).

To automate manipulation operations, layout adjustments are generated by comparing the modified relationship graph to the original. Each node and relationship implements unique mechanisms to accommodate these changes. Here are some illustrations:

•

List keeps its list items ordered, aligned, and bulleted. Alterations (additions, deletions, or reordering) trigger the list node to infer paddings, spacings, and bullet types from the original graph. These properties are applied to the modified graph, generating a revised list layout. Figure 6 demonstrates that list modifications can arise from: moving a list item (Figure 6b and 6c), applying special gestures (Figure 6d), removing (Figure 6e) and adding list items (e.g., aligning a list item as the user writes down it).

•

Textblock and Textline preserve baselines and spacing of their content. When content is modified, these text nodes arrange the updated content to align its layout with the original graph. For example, deleting a word in a textline brings its neighboring words together.

•

Underline adjusts to follow changes made to underlined nodes.

•

Connector and Arrow maintain connections between nodes.

Figure 6:

Note that these layout mechanisms can happen at the same time, see Figure 7 for a combined case. As Inkeraction develops, we may create new use cases by thinking about what relationships matter in an ink interaction (e.g., highlighting, table container, shape container), and how to reflect the relationship change in layout.

Figure 7:

5.2 Word Processor for Ink

Prior work indicated that users could transfer their mental models of word processors to ink interfaces [23]. Inkeraction allows ink to function as both input and output, similar to typed text. Thus, common word processor features, such as copy, paste, resize, and format painter, can be implemented for handwritten text with Inkeraction.

5.2.1 Spell Check.

This feature checks for misspelled words in handwriting and displays them with a red underline, as shown in Figure 8a . The user can tap on the red underlines to bring up a popup menu for confirmation or rejection (Figure 8b). If confirmed, a correction will be synthesized to match the user’s handwriting. If the synthesized correction overlaps with existing strokes, we can use the relationship graph to shift strokes, as described in Section 5.1. Other rendering versions are shown in Figure 8c . Spell Check suggestions are rendered as soon as Inkeraction recognizes the handwriting, and the rendered underlines will stay there until the user confirms or rejects them.

Figure 8:

5.2.2 Auto Completion.

This feature predicts the content the user is going to write and renders the prediction in purple strokes that match the user’s handwriting, see Figure 9a . The user can tap on the prediction to accept it (Figure 9b), or they can ignore it and continue writing (Figure 9c). Predictions are displayed as soon as Inkeraction recognizes the handwriting.

Figure 9:

5.2.3 Formatting.

Word processors offer many ways to format typed content, which can be applied to ink content as well. For example, users can tap on a bullet to change a bulleted list to a numbered list (Figure 10 left), or use gestures to adjust list spacing (Figure 10 middle). Given the vast range of formatting possibilities, further research is warranted to determine the specific options necessary for handwriting. For example, underlining may be redundant as users can easily do it themselves. It is also important to consider how these formatting options can be activated within the constraints of limited ink interaction space.

Figure 10:

5.3 Writing Automation

Inkeraction streamlines numerous writing tasks that are cumbersome to complete using traditional pens and paper.

5.3.1 Transcription.

Note-taking tasks frequently involve transcribing content from various sources, which can be simplified using the Transcription feature. Figure 11 shows this feature enables copying of digital content from websites. Moreover, Transcription can be enhanced with microphones and cameras. For instance, it can be used in conjunction with speech input technologies to take real-time lecture notes. Additionally, optical character recognition (OCR) enables the digitization of notes written on whiteboards.

Figure 11:

5.3.2 Beautification.

Sometimes we need to revise notes for aesthetic purposes. With the Beautification feature, users can select text and apply beautification effects. There are many ways to beautify handwriting, and Inkeraction offers two algorithms for this purpose, as shown in Figure 12. The unsynthesized algorithm aligns text items and corrects margins using text baselines, yielding a more natural-looking result (Figure 12b). The synthesized algorithm rewrites text using a generic font, producing a more standardized output (Figure 12c).

Figure 12:

5.4 Composing with Generative Models

Inkeraction can seamlessly integrate generative models, including large language models (LLMs), into inking interfaces, enabling direct user assistance and conversation. We present illustrative examples that showcase the boundless potential of this design space.

5.4.1 Brainstorm and Text Generation.

Figure 1f shows the Brainstorm feature, which can help the user to expand an existing list. Similarly, Inkeraction can be combined with LLMs to summarize, bulletize, shorten, elaborate, and rephrase written text.

5.4.2 Task Fulfillment.

Generative models can empower inking interfaces by integrating API calls [54]. This transforms inking surfaces into hubs of ideation and computation, where users can express their thoughts and commands and ask the surfaces to comprehend and execute them. Examples include scheduling calendar events, setting alarms for deadlines, or creating task sequences through lists or arrows as shown in Figure 13.

Figure 13:

5.4.3 Organizing Genie.

While the Assisted Manipulation feature simplifies note arrangement, generative models offer even greater assistance. Figure 14 demonstrates how Inkeraction can automatically organize a cluttered page with a LLM. Inkeraction provides essential segmentation and text recognition data, which the LLM leverages to identify and represent relationships among the segmented content in a structured format. Inkeraction then visualizes these structures through connectors and handwritten text. Researchers have developed algorithms for relation extraction and structure generation. For example, TextSketch [67] employed traditional natural language processing to guide users in creating diagrams and visual notes. Studies have shown that LLMs outperform traditional models in relation extraction [73]. Due to the free-form nature of user-generated content, such as sketches and diagrams, developing algorithms for effective organization and structuring remains an ongoing research challenge. Inkeraction’s recognition capabilities offer valuable input for external algorithms to comprehend relationships, while its synthesis abilities allow machines to generate structures on behalf of users. As LLMs evolve, we believe their ability to extract various types of relations could also advance, especially with the help of Inkeraction’s relationship graph.

Figure 14:

6 Prototype

To assess Inkeraction’s functionality and user experience, we created a prototype, which runs on a Samsung Galaxy Tab S7+ tablet and includes six features from each bucket of Inkeraction’s design space. The features were selected for variety. Gestures were also implemented to interact with the features.

6.1 Gestures

We implemented gestures to move strokes and use Inkeraction features, as shwon in Figure 15. The ink gestures are scribble-to-delete (Figure 15b) and circle-to-select (Figure 15c). To support the mode switching between ink drawing and ink gesturing, we use a timeout technique [26]: the user must pause with the pen at the end of the stroke to activate an ink gesture, otherwise the stroke will be treated as a regular ink. We used heuristics and classification algorithms [76] to identify the ink gestures. Four finger gestures are tap-to-select (Figure 7a), tap-to-deselect (Figure 15f), tap-to-confirm (Figure 9b), and drag-to-move (Figure 15e). It is possible to replace these gestures and we did not intend to evaluate these gestures in the prototype.

Figure 15:

6.2 Features

We implemented the features using the recognition and synthesis models presented in Section 4. For features that rely on external models, like language models, we incorporated simplified functions with built-in data to simulate the behavior of these external models, as described below in 6.2.2, 6.2.3, and 6.2.6. We adopted this approach because our main objective was to evaluate Inkeraction, not the external models themselves. When integrated with actual external models, these features should be able to handle a wider range of use cases, but with the increased errors introduced by the external models.

6.2.1 Assisted Manipulation.

This feature supports the user cases in Figure 6 and Figure 7. It enables users to reorder, drag, delete, mark off, and move items within a list and a connected graph, both of which can be created by the users themselves. To utilize Assisted Manipulation, the user first selects an item by tapping it with their finger or circling it with the stylus. Once an item is selected, the user can drag the item to achieve the desired outcome.

6.2.2 Spell Check.

This feature utilizes a built-in dictionary of misspelling-correction pairs to suggest appropriate corrections for recognized words. The dictionary includes the four underlined misspellings in the following paragraph: “The Twelvth Night is a marvellous comedy. The play begins with a disasterous journey to sea in a yaucht.” These four words are predominantly used in British English but spelled differently in American English. If a correction is too lengthy, the feature will shift neighboring words, as shown in Figure 8c .

6.2.3 Auto Completion.

This feature automatically suggests the subsequent word in a phrase based on the recognition of the initial word. It employs a compact dictionary that stores first-second word pairings. For testing purposes, we implemented the following phrases from a grocery list: “Salad dressing; Peanut butter; Cottage cheese; Cream cheese; Baking powder; Potato chips.” Refer to Figure 9 for an illustration of the interaction.

6.2.4 Transcription.

This feature converts selected web text into handwriting. We used the Android WebView to display a web page within the application. The user interface resembles the one in Figure 11.

6.2.5 Beautification.

This feature changes handwriting using either the unsynthesized or the synthesized algorithms in Figure 12.

6.2.6 Brainstorm.

We designed this feature to provide suggestions for a packing list. To use this feature, select a list and two predefined items, “swimsuit” and “coat,” will be automatically added to the list.

6.3 User Interface

In addition to the features, the prototype (Figure 16) includes classic note-taking tools such as a stroke eraser, page creation, undo/redo buttons, and settings. These tools allow study facilitators and users to make quick edits during the study. They can also toggle beautification modes and features in the settings.

Figure 16:

7 Runtime Performance

We tested each of the implemented features, logging the timestamps and recognition results to measure the runtime performance. We used a charging Samsung Galaxy Tab S7+ tablet for the tests. Each data point below is based on a sample size of 10.

7.1 Recognition

The recognition pipeline continuously processes handwriting as a user writes. We optimized the stacked GNN so that new strokes can be merged with neighboring strokes incrementally without having to do a full-page recognition each time. However, the text recognition and character segmentation models still process all of the textual information each time. Adding text to a document will increase processing time, which can be optimized in future work by running text recognition and character segmentation only on new textual information. In our tests, a single character, “a,” took 20.50 milliseconds (ms) (SD = 2.51 ms) to recognize, while a long word, “character,” took 51.43 ms (SD = 5.60 ms). The scripted paragraph in the Spell Check feature took 362.83 ms (SD = 4.93 ms), and the scripted list in the Auto Completion feature took 359.87 ms (SD = 4.03 ms). During the tests, all structures and text were correctly recognized.

7.2 Synthesis

We only measured the time needed for style-based handwriting text synthesis. Stroke cloning and curve synthesis were not included in the measurements as they took very little resources. The processing time of the text synthesis is influenced by both the reference template and the target text. When using the built-in standard template for reference, a single character, “a,” took 21.62 ms (SD = 0.48 ms) to synthesize, while a long word, “character,” took 120.64 ms (SD = 1.50 ms). Synthesizing a sentence, “the quick brown fox jumps over the lazy dog,” took 502.59 ms (SD = 2.19 ms). When using a user-written word, “sample,” as the template, the same character took 40.60 ms (SD = 0.39 ms) to synthesize, while the same long word took 95.78 ms (SD = 1.05 ms). Synthesizing the same sentence took 282.19 ms (SD = 2.18 ms).

7.3 Features

The Spell Check feature relies on the recognition capability to detect misspelled words. When writing the scripted paragraph, it took 68.23 ms (SD = 1.48 ms) to detect the misspelled “Twelvth”, 123.66 ms (SD = 1.42 ms) to detect the misspelled “marvellous”, 266.11 ms (SD = 2.88 ms) to detect the misspelled “disasterous”, and 379.35 ms (SD = 3.68 ms) to detect the misspelled “yaucht”. The Auto Completion feature uses both the recognition and synthesis capabilities to render a suggestion. When writing the scripted list, it took 111.13 ms (SD = 2.46 ms) to suggest “dressing” after writing “Salad”, 250.96 ms (SD = 2.08 ms) to suggest “cheese” after writing “Cream”, 383.17 ms (SD = 7.74 ms) to suggest “chips” after writing “Potato”. The Beautification feature uses an enhanced synthesis algorithm that takes more time than the style-based ink synthesizer. Beautifying a sentence, “the quick brown fox jumps over the lazy dog,” took 542.44 ms (SD = 2.24 ms) using the unsynthesized algorithm, and 2.01 seconds (SD = 0.003 seconds) using the synthesized algorithm.

8 Study I: General Feedback

To understand how Inkeraction would benefit users in inking tasks, we conducted two user studies. Study I focused on gathering general feedback, delving into users’ perceptions of Inkeraction’s overall learnability, usability, and usefulness. Study II compared an Inkeraction-powered ink tool to a conventional inking tool. Below we describe our design and results of Study I.

8.1 Methods

8.1.1 Participants.

We recruited participants who regularly took handwritting notes. Twelve participants (P1 - P12) with an average age of 30.17 (SD=3.97) joined the study. Each participant signed a consent form before taking part in the study, and was given a notebook as a token of appreciation after the study. All participants (8 males, 4 females) had a college degree or above. Half of the participants took handwritting notes daily while the other half took weekly. Eight participants owned tablets with a stylus pen.

8.1.2 Procedure.

After answering demographic questions, the participants were shown the prototype and each experienced the six features in a balanced Latin square order. For each feature, a facilitator demonstrated how to use it first, and then the participant tried it out on their own. For the Beautification feature, the participants were able to try two different algorithms.

After each feature, each participant was asked to rate the feature with three 7-point scale Single Ease Questions (SEQs) :

(1)

Learnability: learning how to use this feature was [1 very difficult – 7 very easy]

(2)

Usability: using this feature was [1 very difficult – 7 very easy]

(3)

Usefulness: this feature was [1 useless - 7 very useful]

The participants were also asked to clarify their ratings.

After trying all of the six features, participants ranked them based on their overall experience. Then, the facilitator explained how the features worked and asked participants about their thoughts on privacy concerns. Finally, we collected feedback on new ink interaction features that participants would like to see in future note-taking apps. Each study session took about 45 minutes. d

8.1.3 Design and Analysis.

We recorded the screen and audio during the study. Then we analyzed the data from the SEQs and rankings, along with the participants’ transcribed comments. Finally, we identified common themes in the data.

8.2 Results and Findings

Figure 17 shows the user scores on the Single Ease Questions (SEQs) for each of the six tested features.

Figure 17:

8.2.1 Learnability.

Spell Check, Beautification, and Auto Completion features were the easiest to learn. Participants were already familiar with similar spell checking and word completing features from other word processors, as P3 said, “It’s just like Google Docs.” Beautification was easy to learn because it could be triggered by a simple button. Assisted Manipulation, however, was considered as the most difficult to learn. P6 and P8 attributed this to the feature discoverability while P1 and P9 felt there were many different ways one could manipulate strokes. P6 explained, “if you don’t tell me [the different stroke manipulation operations], I won’t use it.” P1 suggested “more time to demonstrate” will make Assisted Manipulation more learnable.

8.2.2 Usability.

Not all highly learnable features were considered usable. Spell Check and Beautification were rated the easiest to use, while Auto Completion was rated the most difficult to use. Nine participants complained about the performance and latency of the Auto Completion feature. The Auto Completion feature takes time to recognize written strokes before it can generate a suggestion, but users may write much faster. P1 commented, “It’s too slow, if I think of something, I will keep writing and it won’t catch up.” Moreover, the implemented version required the user to use a tap finger gesture to confirm the suggestion, which further slowed down the writing process. P5 suggested the feature should “automatically complete every word” without the confirmation. All participants were able to experience the Assisted Manipulation feature with their own list and connected graph. However, some participants were frustrated by the gestures they need to learn, which resulted in a moderate usability rating.

8.2.3 Usefulness.

Despite its low learnability and moderate usability ratings, Assisted Manipulation was considered as the most useful feature. Participants were excited about the flexibility and assistance provided by the feature. P6 said, “handwriting is fast and [this makes] modifying easy, so you will benefit from both physical and digital writing tools.” Participants noted that the manipulation could be especially useful for learning (P1, P5, P10) and planning (P8, P11, P12). Spell Check was also highly rated. Five participants (P1, P6, P8, P9, and P12) mentioned that spell errors were common in handwriting, and P9 further emphasized that, “I can’t tolerate any imperfections in my notes. [Such as] not looking good, typos. So I usually type. If you have this feature, I will start writing notes.”

Auto Completion and Brainstorm were ranked as the least useful features. Auto Completion suffered from its performance issues, as explained previously. Given the current performance, four participants (P3, P9, P11, P12) thought the feature could still be useful for complex words or sentences. As for the Brainstorm feature, participants felt the current prototype was too limited since only two items were generated. Some participants thought even a more capable AI would not be able to get what they wanted. P7 said, “when people are writing, they are not writing basic stuff and the topics may not be common knowledge”, which was echoed by P1, P2, and P10.

8.2.4 Overall Rankings.

Figure 18 shows the overall ranking results.

Figure 18:

Assisted Manipulation was the most favored one, with a mean rank of 2.42 (SD = 1.39), Ten people ranked it in their top 3 choices. Spell Check was the runner-up, which is supported by its high learnability, usability, and usefulness scores. Brainstorm was the least favored one, with a mean rank of 4.58 (SD = 1.16). The Brainstorm feature was criticized for its usefulness. The Auto Completion feature was considered the second least favored one due to its performance issues.

Beautification and Transcription were middle-ranked. When asked about their preference for different beautification algorithms, four participants (P1, P6, P10, P12) preferred the unsynthesized algorithm. They explained that they wanted to keep their writing styles. Five participants (P3, P4, P5, P7, P8) liked the synthesized algorithm better. They said that they wanted to make their writing more recognizable for sharing purposes. Three participants (P2, P9, P11) had no preference and suggested keeping both algorithms for different use cases. Seven participants suggested that the beautification feature should include the option to change writing styles and have typed fonts as well. Due to implementation limitations, it was difficult to select sentences in the Transcription feature, and participants wanted to use it to transcribe more than just words. P1 commented, “it’s just like copy and paste…[but it] would be better if you can select a sentence.” This was also shared by P4, P8, and P9.

8.2.5 Concerns and Suggestions.

Seven of the participants had no privacy concerns regarding the recognition and synthesis models. However, the other five participants felt that there should be enhanced security measures in place. P1, P2, P5, and P6 expressed that no stroke data should be uploaded to servers. They suggested that the models should run locally on devices instead. P6 also mentioned that he kept important personal information like passwords in notes, so he would not feel comfortable storing this information in the cloud. P11 said she would prefer to use these features with an authentication method that did not tie to her identity.

In addition to the feature suggestions mentioned above, the participants also provided advice on how to improve the Brainstorm feature and incorporate multimedia. Two participants wanted the Brainstorm feature to be more personal and more capable. P2 wanted the generative model to get to know her better and provide more specific information. For example, instead of adding “sunscreen” to the packing list, the AI should tell her which sunscreen to bring. P4 hoped the generative model could know his writing preferences and provide contextual suggestions, such as the date of today. P7, P9, and P11 shared common suggestions on building features for multimedia support, such as annotating multimedia, linking multimedia, and finding connections between multimedia and strokes.

9 Study II: Comparison

Given the insight from Study I, we wanted to further clarify Inkeraction’s unique value proposition compared to traditional inking tools. We conducted a subsequent comparison study to evaluate users’ behavior across different inking tasks, revealing Inkeraction’s specific impact.

9.1 Methods

9.1.1 Baseline.

To compare Inkeraction prototype with traditional inking software, we built a baseline version inside the prototype. The baseline disabled recognition and synthesis capabilities but had basic features such as pen, eraser, undo/redo buttons. In the Inkeraction prototype, users could select recognized objects by tapping on their strokes. This recognition-based selection was not available in the baseline. To enable selection in the baseline, we added a rectangular marquee selection tool, which selects all the strokes within the drawn rectangle, see Figure 19a .

9.1.2 Participants.

Ten participants (P13 - P22) with an average age of 29.6 (SD=3.86) joined the second study. All participants (5 females, 4 males, 1 non-binary) signed consent forms and received appreciation notebooks, and they had a master degree or above. Four took daily handwritten notes, while the rest did so weekly.

9.1.3 Procedure.

After completing demographic questions, participants were briefed about Inkeraction and the study’s purpose. To compare Inkeraction’s features described in Section 6.2 with the baseline, we designed five tasks, excluding the Brainstorm feature due to its lack of a baseline equivalent. Figure 19 shows these tasks and demonstrates their completion in both baseline and Inkeraction. Table 2 outlines the typical user actions needed for task completion. Before each task, a facilitator provided instructions. For Correcting, Rearranging, and Aligning tasks, the facilitator helped the participants set up the content needed for each trial. Then each participant completed each task twice, once using the Inkeraction prototype and once with the baseline. They had ample opportunity to familiarize themselves with the tasks and tools before the trials began. To ensure counterbalancing, half of the participants commenced with an Inkeraction trial, while the others started with a baseline trial. The tool order was reversed after each task. Each participant completed a total of ten trials. For the Correcting and Writing tasks, relevant content was displayed on a screen during the trials.

Afterward, the participant compared Inkeraction features with the baseline via a 7-point Likert scale (See Table 4). Then, they saw unimplemented features via video, sharing their thoughts on Inkeraction’s potential as a thinking and inking tool. The entire study took about 45 minutes.

Figure 19:

9.1.4 Metrics.

To assess the efficiency and effectiveness of inking tools, we measured the following metrics:

•

User Action: Every stroke, gesture, and button click counts as a user action, with fewer actions indicating higher efficiency

•

Completion Time: Time elapsed between the first user action and the moment the system completes processing the last user action, with shorter time indicating faster tool performance.

•

Writing Quality: Assessed through a survey comparing 50 pairs of writing samples from the 10 participants through 5 tasks (i.e., one sample in each trial). In each survey question, a rater, unaware of the writing tools used, compared a pair of samples produced by the same participant in one task but with different tools. The order of the questions were randomized. Figure 20 demonstrates an example question. Twenty-seven participants from our institute, who did not participate in the prior studies, completed the survey. Initially, we attempted to assess the writing quality through participants’ own feedback, which we found too subjective and suffered from participant response bias [16]. While acknowledging the subjective nature of handwriting, we believe the collective opinions from the raters provide a valid indicator of public perception.

Figure 20:

9.1.5 Design and Analysis.

During the study, we captured screen and audio recordings. Analysis of these recordings and log data enabled us to calculate the metrics. We summarized the results from the Likert scale and transcribed participant comments, and aggregated the survey data for further analysis.

9.2 Results and Findings

9.2.1 User Action.

Table 2 indicates that Inkeraction users completed tasks with statistically significantly fewer actions, demonstrating increased efficiency. Notably, in the Aligning task, Inkeraction reduced the actions needed by a factor of 18.5 compared to the baseline.

We also showed the typical user actions we observed. Not all participants followed these typical actions, as they could make mistakes or used other tools. Examining the types of performed actions, we found that Inkeraction reduced the number of actions by minimizing writing actions, simplifying selection, and automating alignment. For example, in the Transcribing, Correcting, and Writing tasks, handwriting a word cost participants an average of 6.08 (SD = 1.27) strokes and additional erasing actions, while Inkeraction allowed the participants to compose a word through one or two actions.

Inkeraction’s recognition capabilities also aided in tasks requiring selection (e.g., Rearranging, Aligning). For example, in the Rearranging task, the marquee selection tool cost more actions to select a desired target compared to the recognition-powered tap-to-select technique ¹. It took a participant 2.29 (SD = 1.22) strokes on average to marquee select a desired target, while the tap-to-select technique only cost 1.28 (SD = 0.26) taps. The tap-to-select technique had larger selection area and thus required less efforts to use [46], which may explain the improvement.

Additionally, Inkeraction reduced the number of selection and dragging by automatically aligning items. For example, in the Rearranging task, a participant typically dragged over three items (Mean = 3.56, SD = 1.33) to complete the baseline trial, compared to only two drags (Mean = 2.00, SD = 0.00) in the Inkeraction trial. As an extreme example, P18 dragged 7 times in the Rearranging baseline trial, which took them 95.05 seconds. The automatic alignment made ink manipulating more efficient (less steps) and more consistent (smaller deviation) across different users.

Table 2:

Task	Trial : Typical User Actions Needed to Complete the Task	Mean # of Actions	Significance
Transcribing	B(Baseline): Write[...All words...]	70.60(SD = 13.08)	t(9) = 9.43 p < .001
	I(Inkeraction): Write[“Types of Wine”] → 6 × { TapSelect<A phrase> → Drag }	28.50(SD = 3.98)
Correcting	B: 4 × { Erase[A misspelled word] → Write[A correct word]}	30.50(SD = 13.24)	t(9) = 5.35 p < .001
	I: 4 × { Tap[A misspelled word] → Tap[Suggested word]}	8.30(SD = 0.67)
Writing	B: Write[...All Words...]	69.10(SD = 16.09)	t(9) = 6.36 p < .001
	I: 6 × { Write[A word] → Tap[Auto completed word]}	46.90(SD = 10.03)
Rearranging	B: MarqueeSelect<A> → Drag to empty space → MarqueeSelect<B> → Drag to the original location of A → MarqueeSelect<A> → Drag the original location of B	22.00(SD = 15.06)	t(9) = 3.39 p = .012
	I: TapSelect<A> → Drag to the location of B → TapSelect<B> → Drag to the original location of A	7.00(SD = 1.05)
Aligning	B: RepeatUntilSatisfied { MarqueeSelect<A word> → Drag to a desired location }	18.50(SD = 9.95)	t(9) = 5.56 p < .001
	I: Tap[Beautification Button]	1.00(SD = 0.00)

Table 2: The user action data from the five tasks in Study II. For the number of user actions, the lower the better. The Inkeraction trials outperformed the baseline trials across all tasks. Through two-tailed paired t-tests, we validated that all the results were statistically significant (p < .05). We also documented the typical user actions needed to complete the tasks.

9.2.2 Completion Time.

Figure 21 shows the average time participants spent on each trial. Using Inkeraction, participants finished all the tasks statistically significantly faster than using the baseline. Comparing the data from all the trials (N = 100), there is a moderate positive linear correlation between the number of user actions and the completion time (r = 0.77). More actions would introduce more time, but the types of actions and the time gap between actions can also vary the time.

We also found that the Auto Completion feature, criticized by its lagging performance in Study I, still helped the participants with their Writing tasks. It statistically significantly reduced the number of actions and the completion time. This task has six two-word phrases. Auto Completion managed to suggest 5.30 (SD = 1.83) words on average, of which 4.80 (SD = 1.62) words were confirmed by the participants.

Figure 21:

9.2.3 Writing Quality.

Table 3 presents the survey results from 27 human raters. Votes were converted to scores ranging from -27 to 27, with positive scores indicating preference for Inkeraction and higher scores are better. Out of 50 pairs of writing samples, $76\%$ received positive scores. Inkeraction achieved a mean score of 8.7 (SD = 10.5) across all participants and tasks.

Table 3:

Task	Participant										Mean /Task
	P13	P14	P15	P16	P17	P18	P19	P20	P21	P22
Transcribing	+16/-7	+20/-5	+18/-4	+19/-7	+6/-19	+21/-4	+17/-3	+11/-12	+15/-4	+12/-13	7.7 $SD=9.1\hphantom{1}$
Correcting	+17/-1	+20/-1	+19/-3	+19/-2	+17/-5	+25/-0	+14/-6	+13/-6	+20/-3	+15/-8	14.4 $SD=5.6\hphantom{1}$
Writing	+8/-17	+16/-6	+21/-4	+5/-19	+16/-9	+20/-2	+22/-3	+18/-2	+11/-11	+14/-6	7.2 SD = 11.0
Rearranging	+5/-3	+6/-2	+21/-4	+0/-3	+5/-2	+18/-1	+18/-1	+3/-2	+3/-1	+2/-6	5.6 $SD=7.8\hphantom{1}$
Aligning	+9/-16	+27/-0	+27/-0	+20/-3	+20/-1	+19/-2	+14/-8	+9/-9	+9/-12	+2/-20	8.5 SD = 14.7
Mean /Participant	2.2 $SD=9.5\hphantom{1}$	15.0 $SD=7.8\hphantom{1}$	18.2 $SD=4.5\hphantom{1}$	5.8 SD = 12.3	5.6 SD = 10.7	18.8 $SD=3.1\hphantom{1}$	12.8 $SD=5.0\hphantom{1}$	4.6 $\hphantom{1}SD=6.3\hphantom{11}$	5.4 $\hphantom{1}SD=7.4\hphantom{11}$	-1.6 $\hphantom{1}SD=9.4\hphantom{11}$	8.7 SD = 10.5

Table 3: The ratings for writing samples. Each table cell shows the ratings for a pair of writing samples produced using Inkeraction and the baseline. Each sample pair was rated by 27 human raters, who preferred the one produced by Inkeraction, the one produced by baseline, or indifference. For each sample pair, we counted the votes for Inkeraction (marked with +) and the votes for baseline (marked with -). The votes for indifference could be inferred by (27 − positives − negatives). Each cell is color coded with a score, which is (positives − negatives). Higher scores favor Inkeraction. The highest possible score is +27, 0 means there is no difference, while the lowest possible score is -27. We also reported the mean scores.

Inkeraction outperformed the baseline in all tasks, with the highest score in the Correcting task. This is likely due to Inkeraction’s Spell Check feature, which replaced entire misspelled words and ensured alignment, while some participants made character-level edits (e.g., erasing a “l” in “marvellous”) that resulted in unbalanced samples. The Rearranging task received the lowest score, although still favoring Inkeraction. The similarity between the samples in this task, where only two words were swapped, may account for this. Notably, the samples from P15, P18, and P19 received higher vote gaps in the Rearranging task. Among these samples, two baseline samples had misaligned words or bullets while the other baseline sample missed a stroke (due to an inaccurate marquee selection).

Examining mean scores by participant, we found that 9 out of 10 participants’ samples received positive mean scores, indicating improved writing quality with Inkeraction. An exception was P22, who spent significantly more time on baseline trials. For example, in the Aligning task, they spent 45 seconds on the baseline trial, while the mean completion time was 27.11 seconds. We think this indicated that P22 was spending extra efforts in these trials, which led to higher writing quality outperforming the writing quality afforded by Inkeraction. Further investigation is warranted to confirm this hypothesis.

9.2.4 Likert Scale and Qualitative Feedback.

Table 4:

Statements	1	2	3	4	5	6	7	Mean(SD)
S1. With Inkeraction, I need to spend more time finishing tasks.	5	5						1.50 (0.50)
S2. With Inkeraction, I can finish tasks with less tool switching.					1	5	4	6.30 (0.64)
S3. It is more difficult to manipulate content in Inkeraction.	4	3		3				2.20 (1.25)
S4. Inkeraction reduces repetitive work for me.					1	4	5	6.40 (0.66)
S5. It is more difficult to finish the tasks with Inkeraction.	5	4	1					1.60 (0.66)
S6. With Inkeraction, I can focus more on thinking.			1	2	2	4	1	5.20 (1.17)

Table 4: The six statements (S1 – S6) in the Likert-scale used in Study II, with their histograms and mean scores. Each statement was scored from 1 (which indicates “strongly disagree”) to 7 (which indicates “strongly agree”). For positive statements (S2, S4, S6), higher scores are better. For negative statements (S1, S3, S5), lower scores are better.

Table 4 presents the Likert scale results, indicating that participants generally perceived Inkeraction as superior to the baseline. Inkeraction enhanced efficiency (low scores on S1), simplified interactions (high scores on S2 and S4), reduced effort (low scores on S3 and S5), and promoted focus (high scores on S6). Three participants remained neutral regarding S3, they mentioned that certain interactions remained equally challenging. For example, P17 noted that while Inkeraction reduced steps, it did not alter the intrinsic difficulty of dragging. Seven participants concurred that Inkeraction aided cognition, citing the reduced actions (P13) and simplified manipulations (P14, P17) as factors facilitating focus on thinking. The other three participants felt there was not enough evidence to support this statement. For example, P19 explained their negative score by saying “the tasks are too simple so there is little evidence.”

After watching other untested features, the participants were very excited about the potential use cases of Inkeraction, as expressed by P17, “I need these, right now!” The participants also provided suggestions. P22 thought that these features could be optimized for students, who need to take notes from slides and constantly organize notes. P20 provided ideas to improve the Auto Completion feature. Instead of using a finger tap to confirm a suggested word, they thought a user could indicate their confirmation by superimposing on the suggested word.

10 Discussion

10.1 Writing and Thinking with Inkeraction

Inkeraction is a new modality with potential to transform the way we interact with handwritten content. This paper presents features showcasing the power of Inkeraction. Study I revealed how participants embraced Inkeraction’s features to overcome challenges they encountered with traditional handwriting. Participants like P7, who exclaimed, “This new thing lets me write like a boss. Now I can pump out tons of stuff without worry,” exemplified the positive impact these features have on user experience. Study II provided compelling quantitative evidence. Participants equipped with Inkeraction wrote faster, with better writing quality, and using fewer actions. Their qualitative feedback further underscored their appreciation for Inkeraction’s efficiency, which freed up valuable time and allowed them to focus on the content itself.

Inkeraction was conceived as a cornerstone of our visionary thinking tool. While traditional pen-and-paper methods often fall short in capturing thought processes due to their lack of editability, digital inking tools, with their added malleability, can shift content as users refine their initial ideas. However, these tools often struggle to match the speed of thought and may involve repetitive tasks. Drawing upon the insights from the design workshop, Inkeraction enhances the malleability and reduces repetitive work, making handwriting faster and better. Additionally, Inkeraction empowers artificial intelligence to co-pilot users’ thinking activities, offering valuable assistance in brainstorming and information organization.

Developing and evaluating a comprehensive thinking tool remains an ongoing research question. An ink-based thinking tool transcends the role of a mere writing assistant, encompassing functions like information seeking, idea inspiration, and persistent organization, among others. While Inkeraction lays a strong foundation for such a tool, it remains a work in progress. As some participants noted, the inking tasks in Study II lacked the sophistication necessary to fully demonstrate how Inkeraction alters thinking patterns. This exploration forms a key focus of our future research journey.

10.2 Design Challenges and Treatments

Inkeraction, as like other modalities, still has limitations. Based on the studies and our development experience, we identify four major challenges and their treatments.

10.2.1 Recognition Challenge.

The recognition of strokes may not always match what the user expects. This can be due to the imperfect recognition algorithms, or the dynamic and versatile nature of ink. For example, a user may start writing a text block with multiple lines of text, but then decide to treat it as a list instead.

To address this issue, ink should be transparent as it approaches the user’s desired outcome, so the user can act accordingly [45]. In our prototype, we implemented a preview technique, which allows users to see the recognized structures and possible outcomes before they make a change. When the user drags an item, we use a force-directed layout algorithm to show the recognized structures (see Figure 22). The layout algorithm considers the containment relationships between nodes. The children of a node have a stronger connecting force than the forces between the node and its neighbors. Once the user hovers over a location for an extended time, we cancel the force-directed layout and render the possible outcome for that location. Newly synthesized items will be rendered in gray (see Figure 23b). This preview technique helps users to better align their expectations with the recognition results. It also makes the inkeraction more transparent, as users can see what is happening behind the scenes.

Figure 22:

Figure 23:

10.2.2 Interpretation Challenge.

Limited user input makes it hard to understand the desired outcome when there are many possible outcomes. For example, when dragging an item into a list, the user may want to insert the item as a new list item, or merge the dragging item with an existing list item.

To address the challenge of understanding user intent, we can use micro-interactions to derive the user’s desired outcome. In our prototype, we implemented Activation Areas (AAs) to interpret the user’s intent during a drag operation. Each structure has its own AA, which is based on its bounding box. We add padding to each bounding box. Figure 24 shows an example of AAs in a scene with a list and a text block.

Figure 24:

10.2.3 Help-Hinder Challenge.

One design dilemma we faced was that the more help we offered, the more likely users would feel hindered. Users may want different levels of assistance depending on tasks and context.

To avoid hindering users, we propose two techniques: Unassisted Mode and AI Undo. In Unassisted Mode, users can perform actions without help, such as leaving an item overlapping with other strokes. To do this, users can use a two-finger drag gesture to move strokes without assistance. We also considered a postfix method called AI Undo. When Inkeraction performs some tasks on behalf of the user, an undo option will show up and allow the user to cancel the Inkeraction’s effects. The undo option will fade away in seconds to minimize interruptions. AI Undo can also be used for mode switching. See Figure 25 for examples.

However, these techniques can be tedious for users to use, especially if there are several steps involved. To address this issue, we propose personalizing a decision model for individual users. The decision model would take the interaction context as input and output the most likely helping steps. This would allow us to provide users with the level of assistance that they want.

Figure 25:

10.2.4 Performance Challenge.

Our user study found that the speed of handwriting recognition can impact the user experience. While it would be ideal for handwriting to be recognized and displayed as quickly as typing, the machine learning algorithms used for handwriting recognition are more complex than the straightforward input and output devices of keyboards and monitors.

To accommodate for latency, we suggest avoiding competition with the user or require them to take immediate action. For example, the current Auto Completion feature only provides suggestions for one following word after the user has typed a word. To avoid competition, we can provide longer suggestions. To avoid immediate actions, we can ask the user to react to suggestions after they have finished their writing task.

10.3 Limitations and Future Work

The user studies provided valuable feedback on different features of Inkeraction, but did not examine the recognition and synthesis performance in detail. While we provided the technical specifications in Section 4 and runtime performance in Section 7, we need to test its performance in practice to fully understand the usefulness of Inkeraction. The free-form nature of handwriting makes it difficult to measure the performance of Inkeraction in controlled lab studies. These studies are typically limited to a small number of users who are asked to perform a specific task, and do not give us a realistic understanding of how Inkeraction will perform in real-world use cases. To address this limitation, we plan to conduct long-term studies of Inkeraction. We will monitor how often users notice unexpected behavior in their real-world use cases. This will give us a more accurate understanding of the system’s performance and identify any areas where it can be improved.

In our studies, we used simplified functions to simulate language models for features like Spell Check, Auto Completion, and Brainstorm. This allowed us to collect user feedback on these features, but it did not give us a complete picture of their potential. We believe that the Brainstorm feature was undervalued in our study. This is because the simulated language model was not able to generate as many creative and interesting ideas as a real language model. In the future, we plan to spend more time studying and testing how Inkeraction and generative models can create new user experience.

Inkeraction currently supports written text in English. There are challenges to expand language support to other languages that we plan to investigate in the future. For example, a right-to-left language (e.g., Hebrew) could have different rules for manipulating paragraphs and lists. For another example, when used in logosyllabic languages (e.g., Chinese or Korean), the Spell Check feature may be operated at the character level. While the features proposed in this work provide a general framework, we suggest that future research could adapt to individual use cases with a consideration of languages and cultures.

11 Conclusion

We introduced Inkeraction, a novel approach for interacting with digital ink. Inkeraction recognizes a user’s handwriting and synthesizes strokes to assist with inking tasks. It can help users rearrange strokes, provide word-processor-level writing assistance, automate repetitive writing tasks, and seek help from generative models when needed. We evaluated Inkeraction in two studies and found that it enabled users to write faster and with higher quality in fewer steps. We also discussed the limitations of Inkeraction and explored future techniques for using and improving it.

Acknowledgments

We thank all the participants for their time and feedback.

Footnote

Not every participants used the two selection tools. P15 rewrote content in the baseline trial and P14 still used the marquee tool in the Inkeraction trial.

Supplemental Material

MP4 File - Video Preview

Video Preview

Transcript for: Video Preview

MP4 File - Video Presentation

Video Presentation

Transcript for: Video Presentation

MP4 File - Video Figure

The demo video for the paper, showcasing the design space empowered by Inkeraction.

Transcript for: Video Figure

References

[1]

Emre Aksan, Fabrizio Pece, and Otmar Hilliges. 2018. DeepWriting: Making Digital Ink Editable via Deep Generative Modeling. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/3173574.3173779