1 Introduction
Polygonal meshes play an essential role in computer graphics, favored for their simplicity, flexibility, and efficiency. They can represent surfaces of arbitrary topology with non-uniform polygons, and support a wide range of downstream processing and simulation. Additionally, meshes are ideal for rasterization and texture mapping, making them efficient for rendering. However, the benefits of meshes rely heavily on their quality. For example, meshes with non-manifold connectivity or too many elements may break operations that leverage local structure, or make processing prohibitively expensive. Consequently, developing automatic algorithms and tools for generating high-quality meshes is an ongoing research focus.
It is no surprise that recent advancements in deep learning have led to growing interest in learning-based mesh creation. Generating meshes as output, however, is a notoriously challenging task for machine learning algorithms, as meshes have a complex combination of continuous and discrete structure. Not only do mesh vertices and edges form a graph, but mesh faces add additional interconnected structure, and furthermore those faces ought to be arranged locally for manifold connectivity. Existing approaches range from implicit function isosurfacing [Gao et al.
2022; Mescheder et al.
2019; Shen et al.
2021;
2023], which offers easy optimization and a guarantee of validity at the expense of restricting to a limited family of meshes, to directly generating faces as an array of vertex triplets [Alliegro et al.
2023; Nash et al.
2020; Siddiqui et al.
2023], a discrete-first perspective which cannot be certain to respect the constraints of local structure. This work seeks a solution that offers the best of all worlds: the ease and utility that comes from working in a continuous parameterization, a guarantee to produce meshes with manifold structure by construction, and the generality to represent the full range of possible meshes.
We present SpaceMesh, a representation for meshes built on continuous embeddings well-suited for learning and optimization, which guarantees manifold output and supports complex polygonal connectivity. Our approach derives from the halfedge data structure [Weiler
1986], which inherently represents manifold, oriented polygonal meshes—the heart of our contribution is a continuous parameterization for halfedge mesh connectivity.
The main idea is to represent mesh connectivity by first constructing a set of edges and halfedges, and then constructing the so-called next relationship among those halfedges to implicitly define the faces of the mesh. We introduce a parameterization of edge adjacency and next relationships with low-dimensional, per-vertex embeddings. These embeddings, by construction, always produce a manifold halfedge mesh without additional constraints. Moreover, the per-vertex embedding is straightforward to predict as a neural network output and demonstrates fast convergence during optimization. The continuous property of our representation facilitates new architectures for mesh generation, and enables applications like mesh repair with learning.
We validate our representation against alternatives for representing graph adjacency and meshes, and demonstrate superior significantly faster convergence, which is fundamentally important for learning tasks. Combined with a generative model for vertices, we showcase our representation in learning different surface discretization for meshing. Additionally, our representation enables mesh repair via deep learning, simultaneously predicting both vertices and topology.
2 Related Work
Initial deep learning-based mesh generation techniques focused on vertex prediction while maintaining fixed connectivity, which are challenging to adapt for complex 3D objects [Chen et al.
2019; Groueix et al.
2018; Hanocka et al.
2020; Litany et al.
2018; Liu et al.
2021; Ranjan et al.
2018; Tanwar et al.
2020; Wang et al.
2018; Zhang et al.
2020;
2021]. Although local topology modifications are possible through subdivision [Liu et al.
2020a; Wang et al.
2018] or remeshing [Palfinger
2022], these methods still struggle to represent general, complex 3D objects. Recent methods utilize intermediary representations that are converted into meshes using techniques like Poisson reconstruction on point clouds [Kazhdan et al.
2006; Peng et al.
2021] or isosurfacing on implicit fields [Chen et al.
2022; Gao et al.
2022; Lin et al.
2023; Shen et al.
2021;
2023]. However, these conversion processes lack precise control over mesh connectivity.
2.1 Generating Meshes
Much recent work has specifically studied approaches for generating surfaces meshes in learning-based pipelines.
Volumetric 3D Reconstruction. See Point2Surf [Erler et al.
2020], POCO [Boulch and Marlet
2022], NKSR [Huang et al.
2023] and BSPNet [Chen et al.
2020],
etc. These methods focus on reconstructing the geometric shape, rather than the mesh structure; output connectivity is always a marching-cubes mesh (or a union-of-planes in BSPNet). Our approach instead focuses on fitting particular discrete mesh connectivity structures from data. Figure
8 and
9 include a few representative methods from this family, although they generally target significantly different goals. A parallel class of methods leverages Voronoi/Delaunay-based formulations [Maruani et al.
2023;
2024]), but again these focus on fitting a surface’s geometric shape, rather than the particular mesh connectivity.
Direct Mesh Learning. See IER [Liu et al.
2020b], PointTriNet [Sharp and Ovsjanikov
2020], Delaunay Surface Elements (DSE) [Rakotosaona et al.
2021], DMesh [Son et al.
2024]. Like ours, these approaches aim to directly learn structured mesh connectivity. However, our approach offers a guarantee of manifoldness, and can encode general polygonal meshes. Additionally, we demonstrate the ability to encode concise artist/CAD-like tessellation via coupled learning of vertex positions and connectivity, rather than generating faces among a rough uniformly-sampled point set. Conversely, some of these methods scale to high resolution outputs, compared to our small-medium meshes. Figure
2 includes comparisons to DMesh [Son et al.
2024] as a representative method from this family, see also additional results from DSE in Figure
14 in the same setting.
Sequence Modeling. See Polygen [Nash et al.
2020], PolyDiff [Alliegro et al.
2023], MeshGPT [Siddiqui et al.
2023], and the concurrent MeshAnything [Chen et al.
2024]. These approaches use large-scale architectures to emit a mesh one face or vertex at a time. Unlike our method, they generally do not offer any guarantees of connectivity or local structure, and all but Polygen produce triangle soup, connecting faces together only by generating vertices at coarsely-discretized categorical coordinates. However, by building on proven paradigms from language modeling, these models have been successfully trained at very large scale. Additionally, many of these approaches support only unconditional generation, and some are not publicly available. We include a gallery of qualitative comparisons in Figure
15.
2.2 Graph Learning
Our approach draws inspiration from graph learning representations, which have shown success for graphs including gene expression [Marbach et al.
2012], molecules [Kwon et al.
2020], stochastic processes [Backhoff-Veraguas et al.
2020], and social networks [Gehrke et al.
2003]. Based on the seminal work of Gromov [
1987], Nickel and Kiela [
2017] showed that hyperbolic embedding has fundamental properties which Euclidean embedding lacks (a relationship which has been well-studied in physics [Bombelli et al.
1987; Kronheimer and Penrose
1967; Meyer
1993]), to exploit the geometry of
spacetime to represent graphs. In this paper, we leverage spacetime embeddings [Law and Stam
2020; Law and Lucas
2023] to put this perspective to work for generating meshes.
3 Representation
We propose a continuous representation for the space of manifold polygonal meshes, which requires no constraints and is suitable for optimization and learning.
3.1 Background
Manifold Surface Meshes. A surface mesh \(\mathcal {M}= (\mathcal {V},\mathcal {E},\mathcal {F})\) consists of vertices \(\mathcal {V}\), edges \(\mathcal {E}\), and faces \(\mathcal {F}\), where each vertex \(v\in \mathcal {V}\) has a position \(p_v\in \mathbb {R}^3\). In a general polygonal mesh, each face is a cyclic ordering of 3 or more vertices. Each edge is an unordered pair of vertices which appear consecutively in one or more faces.
We are especially concerned with generating meshes which are not just a soup of faces, but which have coherent and consistent neighborhood connectivity. As such, we consider manifold, oriented meshes. Manifold connectivity is a topological property which does not depend on the vertex positions: edge-manifoldness means each edge has exactly two incident faces, while vertex-manifoldness means the faces incident on the vertex form a single edge-connected component homeomorphic to a disk. In an oriented mesh, all neighboring faces have a consistent outward orientation as defined by a counter-clockwise ordering of their vertices.
Halfedge Meshes. There are many possible data structures for mesh connectivity; we will leverage halfedge meshes, which by-construction encode manifold, oriented meshes with possibly polygonal faces, all using only a pair of references per element. As the name suggests, halfedge meshes are defined in terms of directed face-sides, called halfedges (see inset). Each halfedge stores two references: a \(\texttt{twin}\) halfedge, the oppositely-oriented halfedge along the same edge in a neighboring face, and a \(\texttt{next}\) halfedge, the subsequent halfedge within the same face.
The \(\texttt{twin}\) and \(\texttt{next}\) operators can be interpreted as a pair of permutations over the set of halfedges, this group-theoretic perspective is studied in combinatorics as a rotation system. A pair of permutations can be interpreted as a halfedge mesh as long as (a) neither operator maps any halfedge to itself, and (b) \(\texttt{twin}\) operator is an involution, i.e. \(\texttt{twin}(\texttt{twin}(h)) = h\). The faces of the mesh are the orbits traversed by repeatedly following the \(\texttt{next}\) operator (see inset); we further require that these orbits all have a degree of at least three, to disallow two-sided faces. Our representation will construct a valid set of \(\texttt{twin}\) and \(\texttt{next}\) operators from a continuous embedding to define mesh connectivity.
3.2 Representing Edges
To begin, consider modeling a mesh simply as a graph
\(\mathcal {G}= (\mathcal {V}, \mathcal {E})\), later we will extend this model to capture manifold mesh structure via halfedge connectivity (Section
3.3). The vertex set
\(\mathcal {V}\) can be viewed as a particular kind of point cloud, and point cloud generation is a well-studied problem ([Nichol et al.
2022; Zeng et al.
2022]). Likewise, continuous representations for generating undirected graph edges is a classic topic in graph representation learning [Law and Stam
2020; Nickel and Kiela
2017]. A basic approach is to associate an
adjacency embedding \(x_v\in \mathbb {R}^k\) with each vertex, then define an edge between two vertices
i,
j if they are sufficiently close w.r.t. some distance function
\(\mathsf {d}\):
for some learned threshold
\(\tau \in \mathbb {R}\). Representing the vertices and edges of a mesh then amounts to two vectors for each vertex
v: a 3D position
\(p_v\in \mathbb {R}^3\) and an adjacency embedding
\(x_v\in \mathbb {R}^k\).
Spacetime Distance. We find that taking the adjacency features
x as Euclidean vectors under pairwise Euclidean distance
\(\mathsf {d}^\textrm {eu}(x_i, x_j) = ||x_i - x_j||_2\) is ineffective, with poor convergence in optimization and learning. There are many other possible choices of distance function for this embedding, but we find the recently proposed
spacetime distance [Law and Lucas
2023] to be simple and highly effective. This distance function has deep interpretations in special relativity, defining pseudo-Riemannian structures. In our setting the
spacetime distance
\(\mathsf {d}^\textrm {st}\) is computationally straightforward, splitting the components of
x into a subvector
\(x^\textrm {s}\in \mathbb {R}^{k^\textrm {s}}\) of space coordinates, and a subvector
\(x^\textrm {t}\in \mathbb {R}^{k^\textrm {t}}\) of time coordinates:
where [ ·, ·] denotes vector concatenation. Note that
\(\mathsf {d}^\textrm {st}\) is not a distance metric, and may be negative; this is of no concern, as we simply need to threshold it by some
\(\tau \in \mathbb {R}\) to recover edges, treating
τ as an additional optimized parameter. In Figure
4 we show that this significantly accelerates convergence, see Section
4 for details.
Loss Function. At training time, we fit the adjacency embedding by supervising the distances under a cross entropy loss:
where
σ is the logistic function (
i.e. a sigmoid),
\(\mathcal {E}_\textrm {gt}\) denotes the set of edges in the ground truth mesh, and
λ > 0 is a regularization parameter balancing positive and negative matches.
3.3 Representing Faces
To recover faces and manifold connectivity from a graph
\(\mathcal {G}= (\mathcal {V}, \mathcal {E})\), we further propose to parameterize halfedge connectivity for the mesh (Section
3.1). Given
\(\mathcal {V}\) and
\(\mathcal {E}\), we construct the halfedge set by splitting each edge
eij between vertices
i,
j into two oppositely-directed halfedges
hij,
hji. This pairing trivially implies the
\(\texttt{twin}\) relationships as
\(\texttt{twin}(h_{ij}) = h_{ji}\); we then only need to specify the
\(\texttt{next}\) relationships to complete the halfedge mesh and define the face set.
Neighborhood Orderings. The \(\texttt{next}\) operator defines a cyclic permutation with a single orbit on the halfedges outgoing from each vertex. Thus the task of assigning the \(\texttt{next}\) operator (and implicitly, the potentially-polygonal faces of the mesh) comes down to learning this permutation for each vertex.
Representing Neighborhood Orderings. For each vertex, we define a triplet of continuous
permutation features:
\(y^{\textrm {root}}, y^{\textrm {prev}}, y^{\textrm {next}}\in \mathbb {R}^{k^{\textrm {p}}}\). These are used to determine the local cyclic ordering of incident edges. Precisely, in the local neighborhood of each vertex
\(i \in \mathcal {V}\) with degree
D, for each pair of edges
eij,
eik, we combine the features of vertices
i,
j and
k via a scalar-valued function
\(F(y^{\textrm {root}}_i, y^{\textrm {prev}}_j, y^{\textrm {next}}_k)\) (see Section
4.3). Gathering these pairwise entries yields a nonnegative matrix in the local neighborhood of each vertex:
where each row corresponds to an incident edge. We then use Sinkhorn normalization [Sinkhorn
1964] to recover a doubly-stochastic matrix,
\(\bar{\Phi }^i\), representing a
softened permutation matrix [Adams and Zemel
2011].
Loss Function. At training or optimization time, we simply supervise the matrices
\(\bar{\Phi }\) directly with the ground truth permutation matrices using binary cross-entropy loss:
where
\(\mathcal {N}_\textrm {gt}\) is the set of all
\(\texttt{next}\) relationships in ground truth mesh such that
\(\texttt{next}(h_{ij}) = h_{jk}\). Note that we do not need to supervise the remaining entries of
\(\bar{\Phi }^i\), which is already Sinkhorn-normalized.
Extracting Meshes. At inference time to actually extract a mesh, for each vertex neighborhood we seek the lowest-cost matching under the pairwise cost matrix
\(-\bar{\Phi }^i\), among only those matchings which form a single orbit. To compute this matching, we first compute the optimal unconstrained lowest-cost matching [Jonker and Volgenant
1988]; often this matching already forms a single orbit, but when it does not we fall back on a greedy algorithm which starts at an arbitrary entry and repeatedly takes the next lowest-cost entry without violating the single-orbit constraint. These neighborhood matchings then imply halfedge connectivity as
This completes the halfedge mesh representation. Faces, potentially of any polygonal degree, can then be extracted as orbits of the \(\texttt{next}\) operator.
6 Discussion
Scalability and Runtime. Our approach represents discrete connectivity via a fixed-size continuous embedding per-vertex. Concrete results about the size of such an embedding needed to represent all possible discrete structures remain an open problem in graph theory [Nickel et al.
2014; Nickel and Kiela
2017]. In practice, we find low-dimensional embeddings
k < 10 to be sufficient to represent every mesh in our experiments. Encoding a 10,000-vertex mesh via direct optimization, as shown in Figure
2, converges in 600 iterations (approximately 2 minutes) with
kp = 6.
For learning, the bottleneck is memory usage in transformer blocks. We demonstrate generations up to 2,000 vertices in the auto-decoder setting; this is modest compared to high-resolution meshes, but it already captures many CAD and artist-created assets, and exceeds other recent direct mesh generation works (
e.g., around 200 vertices in MeshGPT [Siddiqui et al.
2023]). Our generative model takes less than 2 seconds to generate a single mesh, which is notably faster than recent auto-regressive models like MeshGPT, which require 30-90 seconds. All inference and optimization times are measured on an NVIDIA A6000 GPU.
Limitations. Although our representation guarantees manifold connectivity, it may contain other errors such as self-intersections, spurious high-degree polygons, or significantly non-planar faces. The frequency of such errors depends on how the representation is generated or optimized: often they have little effect on the approximated surface (Figure
9), but in other cases they may significantly degrade the generated geometry, as shown in Figure
13. Note that such artifacts are not always erroneous—meshes designed by artists often intentionally include self-intersections; if desired, we could potentially mitigate self-intersections by penalizing them with regularizers during training.
Our implementation does not handle open surfaces, this could be addressed by predicting a flag for boundary edges much like we predict a mask for padded vertices. Also, like other diffusion-based generative models, our large-scale learning experiments may produce nonsensical outputs for difficult or out-of-distribution input.
Future Work. Looking forward, we see many possibilities to build upon our representation for directly generating meshes in learning pipelines. In the short term, this could mean generating connectivity embeddings as well as vertex positions from a diffusion model, and in the longer term, one might even fit SpaceMesh generators in an unsupervised fashion using energy functions to remove the reliance on mesh datasets for supervision entirely.