4.1.1 Ink Representation.
As shown in Figure
2, Inkeraction represents a handwritten document as a hierarchical graph with three levels of containment (shown with elbow connectors) and various marking relationships (shown with curved connectors). This representation captures the inherent structure of nested handwritten notes, as explored in prior work [
31,
34]. Unlike previous recognition work that treated nested structures as monolithic entities, hindering interactions with intra-structure components (e.g., Ye et al. [
78] recognized a list as a standalone object), Inkeraction preserves these structures, enabling fine-grained interaction with each lower-level object. For example, the list in Figure
2a is represented as a container comprising individual list items, each with a bullet and word, allowing us to interact with any of these objects. Theoretically, the hierarchical containment relationships can have an unlimited number of levels, but we found that three levels were sufficient to represent the objects within our scope.
In addition to the containment relationships, we add marking relationships like connecting and pointing to for inter-structure interactions.
To describe the recognized objects, the hierarchical containment and varied marking relationships, Inkeraction employs a dictionary of labels. The labeling system is similar to prior work [
31,
34], but modified with the opinions from our expert participants, developers, and designers. See Figure
3 for common labels in the dictionary.
4.1.2 Recognition System.
To obtain the hierarchical relationship graphs, our recognition system employs a segmentation model and a classification model working in tandem. The segmentation model iteratively groups strokes representing meaningful objects at different levels within a handwritten document, ultimately creating a stacked hierarchy of Graph Neural Networks (GNNs), see Figure
4. At each level, a GNN receives a set of input objects and clusters them, generating output objects for the next level. The classification model then analyzes the embeddings generated by the GNNs at different levels, assigning labels to these objects, such as “word” for lower-level objects and “list” for higher-level ones. Finally, Inkeraction performs text recognition and character segmentation specifically for text objects.
Segmentation. At the initial level of the GNN hierarchy, each node represents a single stroke or a non-stroke object (e.g., an image). Each stroke undergoes normalization, smoothing, and processing through a stroke embedding model to extract an initial fixed-size representation. The stroke embedding model uses a stack of Transformer layers [
72] followed by a fully connected layer. Non-stroke objects have their embeddings retrieved from a constant dictionary based on their type. At each level, initial edge embeddings are calculated based on heuristic features between nodes, such as relative distance and size, similar to the approach described in [
78].
The node embeddings are then iteratively updated. Each iteration begins by identifying a fixed number of neighbors for each node using the k-nearest neighbors algorithm, based on the Euclidean distance between nodes. Subsequently, each node’s embedding is updated through a multi-head attention mechanism over the edge embeddings and the embeddings of its neighbours followed by a residual update. Let
hi denote the node embedding for node
i, and
eij denote the edge embedding for the edge connecting node
i to its neighbor node
j. Each node embedding
hi is updated with the output of a multi-head attention mechanism, where
hi serves as the query, and the set of
eij|
hj act as the keys and values. The attention layer is followed by a reduction sum and a fully connected layer, which calculates the updated node embedding by combining the mean of the attention outputs with the original node embedding (see Figure
4b). We repeat the update iteration a fixed number of times.
After updating the node embeddings, merge logits are calculated for each neighbor of each node in the GNN. The merge logits between source node i and target node j are also computed over a multi-head attention mechanism, where the queries, keys, and values are hi, eij, and hj, respectively. The attention outputs are then reduced to a two-dimensional logit tensor representing the probability of merging or not merging. If the “merge” logit value is larger than the “no merge” logit value, we conclude that node i wants to merge with node j.
Cliques of
n nodes that mutually desire merging are fed to a merger model, which utilizes a multi-head attention layer followed by a reduction sum layer. The attention layer computes an embedding from each node by attending over other node embeddings and the corresponding edge embeddings. The merged node embedding is computed as an average over all the attention heads and outputs (see Figure
4c). After updating and merging node embeddings, the edge embeddings are recomputed. If any nodes have been merged, their associated edge embeddings are recalculated based on the heuristic features.
At each level, we iteratively repeat the node embedding update and merge process until no more cliques of nodes desire merging or a maximum number of iterations is reached. We then repeat the process with another set of model weights to create a higher-level GNN from the current node embeddings. For example, if we previously merged strokes into words, we would now merge words into lines of text. This iterative segmentation process enables us to hierarchically cluster strokes and non-stroke inputs into three GNNs: L1, L2, and L3.
Classification. The classification algorithms process the segmentation results to categorize objects and relationships. Each node in the GNNs is assigned a label (e.g., “word” at L1, “textline” at L2, and “list” at L3). The classification model is a single fully-connected layer that maps a node embedding to the logits for the given number of classes.
The relationships between objects are computed in the same way as the merge logits. Instead of producing the merge logits, the relationship model produces logits for each type of relationship (e.g. “no relationship”, “underlines”, and “points_to”) between two objects.
Text Handling. For text objects, Inkeraction runs text recognition using an LSTM-based model [
9]. The recognized text is then segmented into characters using a Transformer-based model [
33], which allows us to calculate text geometrics such as character sizes and word baselines.