research-article

Open access

SpaceMesh: A Continuous Representation for Learning Manifold Surface Meshes

Authors: Tianchang Shen, Zhaoshuo Li, Marc Law, Matan Atzmon, Sanja Fidler, James Lucas, Jun Gao, Nicholas SharpAuthors Info & Claims

SA '24: SIGGRAPH Asia 2024 Conference Papers

Article No.: 78, Pages 1 - 11

https://doi.org/10.1145/3680528.3687634

Published: 03 December 2024 Publication History

All formats PDF

Abstract

Meshes are ubiquitous in visual computing and simulation, yet most existing machine learning techniques represent meshes only indirectly, e.g. as the level set of a scalar field or deformation of a template, or as a disordered triangle soup lacking local structure. This work presents a scheme to directly generate manifold, polygonal meshes of complex connectivity as the output of a neural network. Our key innovation is to define a continuous latent connectivity space at each mesh vertex, which implies the discrete mesh. In particular, our vertex embeddings generate cyclic neighbor relationships in a halfedge mesh representation, which gives a guarantee of edge-manifoldness and the ability to represent general polygonal meshes. This representation is well-suited to machine learning and stochastic optimization, without restriction on connectivity or topology. We first explore the basic properties of this representation, then use it to fit distributions of meshes from large datasets. The resulting models generate diverse meshes with tessellation structure learned from the dataset population, with concise details and high-quality mesh elements. In applications, this approach not only yields high-quality outputs from generative models, but also enables directly learning challenging geometry processing tasks such as mesh repair.

1 Introduction

Polygonal meshes play an essential role in computer graphics, favored for their simplicity, flexibility, and efficiency. They can represent surfaces of arbitrary topology with non-uniform polygons, and support a wide range of downstream processing and simulation. Additionally, meshes are ideal for rasterization and texture mapping, making them efficient for rendering. However, the benefits of meshes rely heavily on their quality. For example, meshes with non-manifold connectivity or too many elements may break operations that leverage local structure, or make processing prohibitively expensive. Consequently, developing automatic algorithms and tools for generating high-quality meshes is an ongoing research focus.

Fig. 1:

It is no surprise that recent advancements in deep learning have led to growing interest in learning-based mesh creation. Generating meshes as output, however, is a notoriously challenging task for machine learning algorithms, as meshes have a complex combination of continuous and discrete structure. Not only do mesh vertices and edges form a graph, but mesh faces add additional interconnected structure, and furthermore those faces ought to be arranged locally for manifold connectivity. Existing approaches range from implicit function isosurfacing [Gao et al. 2022; Mescheder et al. 2019; Shen et al. 2021; 2023], which offers easy optimization and a guarantee of validity at the expense of restricting to a limited family of meshes, to directly generating faces as an array of vertex triplets [Alliegro et al. 2023; Nash et al. 2020; Siddiqui et al. 2023], a discrete-first perspective which cannot be certain to respect the constraints of local structure. This work seeks a solution that offers the best of all worlds: the ease and utility that comes from working in a continuous parameterization, a guarantee to produce meshes with manifold structure by construction, and the generality to represent the full range of possible meshes.

We present SpaceMesh, a representation for meshes built on continuous embeddings well-suited for learning and optimization, which guarantees manifold output and supports complex polygonal connectivity. Our approach derives from the halfedge data structure [Weiler 1986], which inherently represents manifold, oriented polygonal meshes—the heart of our contribution is a continuous parameterization for halfedge mesh connectivity.

The main idea is to represent mesh connectivity by first constructing a set of edges and halfedges, and then constructing the so-called next relationship among those halfedges to implicitly define the faces of the mesh. We introduce a parameterization of edge adjacency and next relationships with low-dimensional, per-vertex embeddings. These embeddings, by construction, always produce a manifold halfedge mesh without additional constraints. Moreover, the per-vertex embedding is straightforward to predict as a neural network output and demonstrates fast convergence during optimization. The continuous property of our representation facilitates new architectures for mesh generation, and enables applications like mesh repair with learning.

We validate our representation against alternatives for representing graph adjacency and meshes, and demonstrate superior significantly faster convergence, which is fundamentally important for learning tasks. Combined with a generative model for vertices, we showcase our representation in learning different surface discretization for meshing. Additionally, our representation enables mesh repair via deep learning, simultaneously predicting both vertices and topology.

2 Related Work

Initial deep learning-based mesh generation techniques focused on vertex prediction while maintaining fixed connectivity, which are challenging to adapt for complex 3D objects [Chen et al. 2019; Groueix et al. 2018; Hanocka et al. 2020; Litany et al. 2018; Liu et al. 2021; Ranjan et al. 2018; Tanwar et al. 2020; Wang et al. 2018; Zhang et al. 2020; 2021]. Although local topology modifications are possible through subdivision [Liu et al. 2020a; Wang et al. 2018] or remeshing [Palfinger 2022], these methods still struggle to represent general, complex 3D objects. Recent methods utilize intermediary representations that are converted into meshes using techniques like Poisson reconstruction on point clouds [Kazhdan et al. 2006; Peng et al. 2021] or isosurfacing on implicit fields [Chen et al. 2022; Gao et al. 2022; Lin et al. 2023; Shen et al. 2021; 2023]. However, these conversion processes lack precise control over mesh connectivity.

2.1 Generating Meshes

Much recent work has specifically studied approaches for generating surfaces meshes in learning-based pipelines.

Volumetric 3D Reconstruction. See Point2Surf [Erler et al. 2020], POCO [Boulch and Marlet 2022], NKSR [Huang et al. 2023] and BSPNet [Chen et al. 2020], etc. These methods focus on reconstructing the geometric shape, rather than the mesh structure; output connectivity is always a marching-cubes mesh (or a union-of-planes in BSPNet). Our approach instead focuses on fitting particular discrete mesh connectivity structures from data. Figure 8 and 9 include a few representative methods from this family, although they generally target significantly different goals. A parallel class of methods leverages Voronoi/Delaunay-based formulations [Maruani et al. 2023; 2024]), but again these focus on fitting a surface’s geometric shape, rather than the particular mesh connectivity.

Direct Mesh Learning. See IER [Liu et al. 2020b], PointTriNet [Sharp and Ovsjanikov 2020], Delaunay Surface Elements (DSE) [Rakotosaona et al. 2021], DMesh [Son et al. 2024]. Like ours, these approaches aim to directly learn structured mesh connectivity. However, our approach offers a guarantee of manifoldness, and can encode general polygonal meshes. Additionally, we demonstrate the ability to encode concise artist/CAD-like tessellation via coupled learning of vertex positions and connectivity, rather than generating faces among a rough uniformly-sampled point set. Conversely, some of these methods scale to high resolution outputs, compared to our small-medium meshes. Figure 2 includes comparisons to DMesh [Son et al. 2024] as a representative method from this family, see also additional results from DSE in Figure 14 in the same setting.

Sequence Modeling. See Polygen [Nash et al. 2020], PolyDiff [Alliegro et al. 2023], MeshGPT [Siddiqui et al. 2023], and the concurrent MeshAnything [Chen et al. 2024]. These approaches use large-scale architectures to emit a mesh one face or vertex at a time. Unlike our method, they generally do not offer any guarantees of connectivity or local structure, and all but Polygen produce triangle soup, connecting faces together only by generating vertices at coarsely-discretized categorical coordinates. However, by building on proven paradigms from language modeling, these models have been successfully trained at very large scale. Additionally, many of these approaches support only unconditional generation, and some are not publicly available. We include a gallery of qualitative comparisons in Figure 15.

2.2 Graph Learning

Our approach draws inspiration from graph learning representations, which have shown success for graphs including gene expression [Marbach et al. 2012], molecules [Kwon et al. 2020], stochastic processes [Backhoff-Veraguas et al. 2020], and social networks [Gehrke et al. 2003]. Based on the seminal work of Gromov [1987], Nickel and Kiela [2017] showed that hyperbolic embedding has fundamental properties which Euclidean embedding lacks (a relationship which has been well-studied in physics [Bombelli et al. 1987; Kronheimer and Penrose 1967; Meyer 1993]), to exploit the geometry of spacetime to represent graphs. In this paper, we leverage spacetime embeddings [Law and Stam 2020; Law and Lucas 2023] to put this perspective to work for generating meshes.

3 Representation

We propose a continuous representation for the space of manifold polygonal meshes, which requires no constraints and is suitable for optimization and learning.

3.1 Background

Manifold Surface Meshes. A surface mesh \(\mathcal {M}= (\mathcal {V},\mathcal {E},\mathcal {F})\) consists of vertices \(\mathcal {V}\), edges \(\mathcal {E}\), and faces \(\mathcal {F}\), where each vertex \(v\in \mathcal {V}\) has a position \(p_v\in \mathbb {R}^3\). In a general polygonal mesh, each face is a cyclic ordering of 3 or more vertices. Each edge is an unordered pair of vertices which appear consecutively in one or more faces.

We are especially concerned with generating meshes which are not just a soup of faces, but which have coherent and consistent neighborhood connectivity. As such, we consider manifold, oriented meshes. Manifold connectivity is a topological property which does not depend on the vertex positions: edge-manifoldness means each edge has exactly two incident faces, while vertex-manifoldness means the faces incident on the vertex form a single edge-connected component homeomorphic to a disk. In an oriented mesh, all neighboring faces have a consistent outward orientation as defined by a counter-clockwise ordering of their vertices.

Halfedge Meshes. There are many possible data structures for mesh connectivity; we will leverage halfedge meshes, which by-construction encode manifold, oriented meshes with possibly polygonal faces, all using only a pair of references per element. As the name suggests, halfedge meshes are defined in terms of directed face-sides, called halfedges (see inset). Each halfedge stores two references: a \(\texttt{twin}\) halfedge, the oppositely-oriented halfedge along the same edge in a neighboring face, and a \(\texttt{next}\) halfedge, the subsequent halfedge within the same face.

The \(\texttt{twin}\) and \(\texttt{next}\) operators can be interpreted as a pair of permutations over the set of halfedges, this group-theoretic perspective is studied in combinatorics as a rotation system. A pair of permutations can be interpreted as a halfedge mesh as long as (a) neither operator maps any halfedge to itself, and (b) \(\texttt{twin}\) operator is an involution, i.e. \(\texttt{twin}(\texttt{twin}(h)) = h\). The faces of the mesh are the orbits traversed by repeatedly following the \(\texttt{next}\) operator (see inset); we further require that these orbits all have a degree of at least three, to disallow two-sided faces. Our representation will construct a valid set of \(\texttt{twin}\) and \(\texttt{next}\) operators from a continuous embedding to define mesh connectivity.

3.2 Representing Edges

To begin, consider modeling a mesh simply as a graph \(\mathcal {G}= (\mathcal {V}, \mathcal {E})\), later we will extend this model to capture manifold mesh structure via halfedge connectivity (Section 3.3). The vertex set \(\mathcal {V}\) can be viewed as a particular kind of point cloud, and point cloud generation is a well-studied problem ([Nichol et al. 2022; Zeng et al. 2022]). Likewise, continuous representations for generating undirected graph edges is a classic topic in graph representation learning [Law and Stam 2020; Nickel and Kiela 2017]. A basic approach is to associate an adjacency embedding \(x_v\in \mathbb {R}^k\) with each vertex, then define an edge between two vertices i, j if they are sufficiently close w.r.t. some distance function \(\mathsf {d}\):

\begin{equation} \mathcal {E} := \big \lbrace \lbrace i,j\rbrace \, \textrm {such that}\, \mathsf {d}(x_i, x_j) \lt \tau \big \rbrace \end{equation}

(1)

for some learned threshold \(\tau \in \mathbb {R}\). Representing the vertices and edges of a mesh then amounts to two vectors for each vertex v: a 3D position \(p_v\in \mathbb {R}^3\) and an adjacency embedding \(x_v\in \mathbb {R}^k\).

Spacetime Distance. We find that taking the adjacency features x as Euclidean vectors under pairwise Euclidean distance \(\mathsf {d}^\textrm {eu}(x_i, x_j) = ||x_i - x_j||_2\) is ineffective, with poor convergence in optimization and learning. There are many other possible choices of distance function for this embedding, but we find the recently proposed spacetime distance [Law and Lucas 2023] to be simple and highly effective. This distance function has deep interpretations in special relativity, defining pseudo-Riemannian structures. In our setting the spacetime distance \(\mathsf {d}^\textrm {st}\) is computationally straightforward, splitting the components of x into a subvector \(x^\textrm {s}\in \mathbb {R}^{k^\textrm {s}}\) of space coordinates, and a subvector \(x^\textrm {t}\in \mathbb {R}^{k^\textrm {t}}\) of time coordinates:

\begin{equation} \mathsf {d}^\textrm {st}(x_i, x_j) = \mathsf {d}^\textrm {st}([x^\textrm {s}_i, x^\textrm {t}_i], [x^\textrm {s}_j, x^\textrm {t}_j]) := ||x^\textrm {s}_i - x^\textrm {s}_j||_2^2 - ||x^\textrm {t}_i - x^\textrm {t}_j||_2^2, \end{equation}

(2)

where [ ·, ·] denotes vector concatenation. Note that \(\mathsf {d}^\textrm {st}\) is not a distance metric, and may be negative; this is of no concern, as we simply need to threshold it by some \(\tau \in \mathbb {R}\) to recover edges, treating τ as an additional optimized parameter. In Figure 4 we show that this significantly accelerates convergence, see Section 4 for details.

Loss Function. At training time, we fit the adjacency embedding by supervising the distances under a cross entropy loss:

\begin{equation} \sum _{i,j \in \mathcal {E}_\textrm {gt}}\! \log \big (\sigma (\mathsf {d}(x_i, x_j)) - \tau \big) + \lambda \sum _{i,j \not\in \mathcal {E}_\textrm {gt}}\! \log \big (\sigma (\tau - \mathsf {d}(x_i, x_j))\big) \end{equation}

(3)

where σ is the logistic function (i.e. a sigmoid), \(\mathcal {E}_\textrm {gt}\) denotes the set of edges in the ground truth mesh, and λ > 0 is a regularization parameter balancing positive and negative matches.

3.3 Representing Faces

To recover faces and manifold connectivity from a graph \(\mathcal {G}= (\mathcal {V}, \mathcal {E})\), we further propose to parameterize halfedge connectivity for the mesh (Section 3.1). Given \(\mathcal {V}\) and \(\mathcal {E}\), we construct the halfedge set by splitting each edge e_ij between vertices i, j into two oppositely-directed halfedges h_ij, h_ji. This pairing trivially implies the \(\texttt{twin}\) relationships as \(\texttt{twin}(h_{ij}) = h_{ji}\); we then only need to specify the \(\texttt{next}\) relationships to complete the halfedge mesh and define the face set.

Neighborhood Orderings. The \(\texttt{next}\) operator defines a cyclic permutation with a single orbit on the halfedges outgoing from each vertex. Thus the task of assigning the \(\texttt{next}\) operator (and implicitly, the potentially-polygonal faces of the mesh) comes down to learning this permutation for each vertex.

Representing Neighborhood Orderings. For each vertex, we define a triplet of continuous permutation features: \(y^{\textrm {root}}, y^{\textrm {prev}}, y^{\textrm {next}}\in \mathbb {R}^{k^{\textrm {p}}}\). These are used to determine the local cyclic ordering of incident edges. Precisely, in the local neighborhood of each vertex \(i \in \mathcal {V}\) with degree D, for each pair of edges e_ij, e_ik, we combine the features of vertices i, j and k via a scalar-valued function \(F(y^{\textrm {root}}_i, y^{\textrm {prev}}_j, y^{\textrm {next}}_k)\) (see Section 4.3). Gathering these pairwise entries yields a nonnegative matrix in the local neighborhood of each vertex:

\begin{equation} \Phi ^i \in \mathbb {R}^{D\times D}, \qquad \Phi ^i_{jk} := e^{F(y^{\textrm {root}}_i, y^{\textrm {prev}}_j, y^{\textrm {next}}_k)}, \end{equation}

(4)

where each row corresponds to an incident edge. We then use Sinkhorn normalization [Sinkhorn 1964] to recover a doubly-stochastic matrix, \(\bar{\Phi }^i\), representing a softened permutation matrix [Adams and Zemel 2011].

Loss Function. At training or optimization time, we simply supervise the matrices \(\bar{\Phi }\) directly with the ground truth permutation matrices using binary cross-entropy loss:

\begin{equation} \sum _{\lbrace i,j,k\rbrace \in \mathcal {N}_\textrm {gt}} - \log (\bar{\Phi }^i_{jk}), \end{equation}

(5)

where \(\mathcal {N}_\textrm {gt}\) is the set of all \(\texttt{next}\) relationships in ground truth mesh such that \(\texttt{next}(h_{ij}) = h_{jk}\). Note that we do not need to supervise the remaining entries of \(\bar{\Phi }^i\), which is already Sinkhorn-normalized.

Extracting Meshes. At inference time to actually extract a mesh, for each vertex neighborhood we seek the lowest-cost matching under the pairwise cost matrix \(-\bar{\Phi }^i\), among only those matchings which form a single orbit. To compute this matching, we first compute the optimal unconstrained lowest-cost matching [Jonker and Volgenant 1988]; often this matching already forms a single orbit, but when it does not we fall back on a greedy algorithm which starts at an arbitrary entry and repeatedly takes the next lowest-cost entry without violating the single-orbit constraint. These neighborhood matchings then imply halfedge connectivity as

\begin{equation} \texttt{next}(h_{ij}) := h_{ki} \quad \textrm {for} \quad k = \textrm {match}_{\Phi ^i}(j). \end{equation}

(6)

This completes the halfedge mesh representation. Faces, potentially of any polygonal degree, can then be extracted as orbits of the \(\texttt{next}\) operator.

4 Validation

In this section, we evaluate the basic properties of our method, by directly optimizing to fit both individual meshes and collections of meshes, as well as ablating design choices.

Fig. 2:

4.1 Encoding a Given Mesh

The most basic task for a mesh representation is to directly fit it to encode a particular mesh. Though straightforward in principle, this optimization could fail if a representation is unable to represent all possible meshes, or if local minima and slow convergence make fitting ineffective in practice. We consider three different challenging meshes with thin parts, anisotropic faces, and varying geometric details. For each single shape, we optimize to encode its connectivity with our per-vertex embeddings \((x_i,y^{\textrm {root}}_i, y^{\textrm {prev}}_i, y^{\textrm {next}}_i)\) using the loss functions from Equation 3 and 5. In Figure 2 we show the result of this optimization with our approach, as well as with the recent DMesh [Son et al. 2024], which proposes a Delaunay-based mesh representation. Our method not only converges much faster to the correct connectivity, but also is applicable to polygonal meshes, making it more suitable for general mesh generation tasks. Further experimental details are provided in the Supplement.

Fig. 3:

4.2 Fitting Mesh Collections

As a next basic test of the ability of our method to encode collections of shapes in a learning setting, we train a simple auto-decoder architecture on a subset of 200 shapes from the Thingi10k dataset, a challenging set of real-world models originally for 3D printing [Zhou and Jacobson 2016]. To be clear, we do not aim to demonstrate downstream learning tasks with this experiment, we simply validate that our representation can simultaneously represent a variety of complex shapes, even when the embeddings are parameterized by a neural network, see Section 5 for large-scale learning and applications. In particular, here we allocate a latent code for each mesh, and optimize those latent codes as well as the parameters of a simple transformer model [Vaswani et al. 2017] that decodes each latent code into the mesh, in the form of per-vertex positions and connectivity embeddings of our representation. See the Supplement for further experimental details. As shown in Figure 3, our model faithfully overfits the shape collection. Quantitatively, the encoded meshes achieve a mean L2 loss of 0.00062, an F1 score of 0.99 for adjacency prediction, and an accuracy of 0.98 for permutation predictions. This is positive evidence that the representation is able to simultaneously represent many complex shapes, even with significant geometric complexity and the nonconvexity of the neural parameterization.

Fig. 4:

Fig. 5:

4.3 Ablating Design Choices

Spacetime Distance. We find spacetime distance to be a dramatically more effective representation than Euclidean or other metrics to define adjacency embeddings (Section 3.2), in the sense that it can be optimized much more easily. To demonstrate this, we fit the edges of the bridge mesh appearing on the bottom right of Figure 4 using each of three formulations for \(\mathsf {d}(x_i, x_j)\): (1) the spacetime distance introduced in Section 3.2, (2) the squared Euclidean distance \(\Vert x_i - x_j \Vert _2^2\), and (3) the negative dot product \(- x_i^{\top } x_j\). Figure 4 shows the speed of convergence—spacetime distance converges much faster compared to other distance formulations, which we observed consistently across all experiments.

Permutation Feature Reduction. We also investigate several choices for the permutation feature reduction function F (Equation 7), including elementwise maximum, addition, or concatenation. Figure 5 shows the results. We find elementwise multiplication followed by summation of all elements to be most effective. Precisely, we use

\begin{equation} F(y^{\textrm {root}}_i, y^{\textrm {prev}}_j, y^{\textrm {next}}_k) := \textrm {trace}\big (\textrm {diag}(y^{\textrm {prev}}_j) \textrm {diag}(y^{\textrm {root}}_i) \textrm {diag}(y^{\textrm {next}}_k) \big), \end{equation}

(7)

where diag denotes constructing a diagonal matrix from a vector.

5 Application: Learning to mesh

Equipped with a continuous representation for manifold polygonal meshes, we can then begin large-scale learning atop the representation. In this section, we integrate SpaceMesh with a 3D generative model to generate meshes conditioned on geometry provided as a point cloud. This conditioned model can then be directly applied to mesh repair without fine-tuning (Section 5.5).

Fig. 6:

5.1 Model Architecture

Our model architecture (Figure 6), consists of three modules: a point cloud encoding network for processing geometry information, a vertex diffusion model to generate 3D locations for vertices, and a connectivity prediction network to predict per-vertex embeddings.

Point Cloud Encoder. We encode the point cloud using PVCNN [Liu et al. 2019] to generate the feature volumes at multiple spatial resolutions. These feature volumes, as geometry context, guide the subsequent mesh generation. Note that this input point cloud is not the resulting mesh vertex set, it is conditioning information indicating the geometry that we are trying to generate a mesh of.

Vertex Position Generation Network. We re-purpose Point-E [Nichol et al. 2022], a diffusion transformer network that was originally designed for point cloud generation, to generate sparse mesh vertices conditioned on the geometry context from the encoder. Specifically, we first initialize the vertex position by sampling from a Gaussian distribution, and iteratively denoise the vertex location through the diffusion transformer. At each denoising step, we feed the input to the transformer by concatenating the vertices’ positions with features that are tri-linearly interpolated with the multi-resolution feature volumes from the encoder to capture the geometry information. If needed, we handle varying vertex counts by padding to a predefined maximum size, and additionally diffusing a binary mask at each vertex to indicate which vertices are artificial padding.

Vertex Connectivity Prediction Network. We leverage a transformer architecture [Vaswani et al. 2017] to predict the per-vertex connectivity embeddings given vertex positions. Similar to the vertex position generation network, we concatenate vertex position with the interpolated feature from the encoder for each vertex, and predict the adjacency embeddings x and permutation embeddings y^root, y^prev, y^next. We remove the positional embedding from the original transformer and predict the embeddings for all the vertices simultaneously by using the self-attention across the vertices.

Training Details. We train all the neural networks together. To train the vertex position generation network, we adopt the ϵ -prediction from the diffusion model [Ho et al. 2020; Nichol et al. 2022]. To train the connectivity generation model, we combine the losses Equation 3 and 5, supervising on meshes from the dataset. Further details are provided in the Supplement.

Fig. 7:

5.2 Basic Validation on a Synthetic Dataset

Our model learns to fit distributions of meshes; the tessellation pattern and element shapes of generated meshes will mimic the training population. We first demonstrate this behavior with a simple synthetic dataset, constructed by generating shapes as a union of randomly arranged cubes, tetrahedra, and spheres. For each shape, we extract a 3D iso-surface using Dual Marching Cubes [Nielson 2003], and mesh it according to several strategies: (1) isotropic remeshing [Hoppe et al. 1993] with Meshlab [Cignoni et al. 2008] (2) planar decimation from Blender [Community 2018] to create N-gon mesh. (3) QEM for surface simplification [Garland and Heckbert 1997] from Meshlab, and (4) InstantMesh [Jakob et al. 2015] to create a quad-dominant mesh with the official implementation

In Figure 7, we show how training on each of these datasets causes our model to generate different styles of meshes as outputs. The four models, when each given the same point cloud as input specifying the desired geometry, produce respectively (1) isotropic triangle meshes, (2) minimal planar-decimated meshes, (3) QEM-simplified meshes, and (4) quad-dominant meshes.

5.3 Learning Meshes from the ABC Dataset

To evaluate learning at scale on a realistic dataset, train our model on ABC dataset [Koch et al. 2019a], which consists of watertight triangle meshes of CAD shapes with isotropic triangle distribution. The meshes in the ABC dataset exhibit considerable diversity, featuring both sharp and smooth curved geometric features. We employed a benchmark [Koch et al. 2019b] subset of 10,000 shapes, all with 512 vertices, randomly split into 80% for training and 20% for testing. To obtain the input conditioning point cloud, we uniformly sampled 2048 points from the mesh surface.

Baselines. We compare our model against both classic and learning-based point cloud reconstruction methods. As a representative classic approach, we compare to Poisson Surface Reconstruction (PSR) [Kazhdan et al. 2006] as implemented in Open3D [Zhou et al. 2018], with meshes extracted via marching cubes [Lorensen and Cline 1998]. We also consider isotropic remeshing [Hoppe et al. 1993] on the output of Poisson reconstruction to obtain a more compact mesh tessellation, which is denoted PSR^*. For representative learning-based approaches, we choose Pixel2Mesh [Wang et al. 2018], which deforms a template sphere to generate a mesh, and OccNet [Mescheder et al. 2019], which predicts an implicit field and extracts the mesh using Marching Cubes [Lorensen and Cline 1998] afterwards. For a fair comparison among deep learning based methods, we adopt the same point cloud encoder as our approach.

Table 1:

Method	CD (10^{− 3})↓	F1↑	ECD(10^{− 2})↓	EF1↑	#V	#F	IN↓
PSR	46.35	0.44	56.81	0.03	2406	4736	63.31
PSR^*	46.72	0.42	51.86	0.03	494	968	61.61
OccNet	11.31	0.47	33.08	0.08	7344	14688	48.53
Pixel2Mesh	6.37	0.48	29.52	0.09	2466	4928	52.03
Ours	1.39	0.66	3.21	0.42	512	1818	34.54

Table 1: Accuracy and quality statistics for mesh reconstruction.

Fig. 8:

Metrics. Our primary goal is to evaluate the ability to capture the desired distribution of surface discretization, as measured by intrinsic mesh statistics such as edge lengths and corner angles for each polygon. Furthermore, although our method is not directly designed to minimize reconstruction error, we additionally evaluate our method against baselines on how well the generated meshes align with ground truth geometry. To this end, we follow the methodology from NDC [Chen et al. 2022] and compute Chamfer Distance (CD), F-Score (F1), Edge Chamfer Distance (ECD), Edge F-Score (EF1), and the percentage of Inaccurate Normals (IN> 10°) with respect to the ground truth mesh. A detailed description of these metrics is provided in the Supplement.

Results. As shown in Figure 9 and Table 1, both qualitative and quantitative results demonstrate that our method outperforms baselines under the target metrics, particularly in recovering sharp features. The vertices and edges align accurately with sharp features, highlighting the advantage of directly generating meshes as the output representation. As shown in Figure 8, the distribution of element shapes from our generated meshes aligns much better with the ground truth than the baselines, demonstrating the ability of our model to predict connectivity which aligns with the target training population. Note that although our representation guarantees manifold connectivity, there may still be geometric self-intersections between faces. We report the fraction of faces in each mesh with self-intersections in Figure 8, and provide further discussion in Section 6.

Fig. 9:

Fig. 10:

5.4 Learning Meshes from the ShapeNet Dataset

Following recent work on mesh generation [Alliegro et al. 2023; Gao et al. 2022; Nash et al. 2020; Siddiqui et al. 2023], we further evaluate our model on ShapeNet dataset [Chang et al. 2015].

Dataset Details. As in prior work [Nash et al. 2020; Siddiqui et al. 2023], we note that the raw meshes from ShapeNet consist largely of non-manifold meshes with duplicated faces and T-junctions at intersections, and thus we preprocess all shapes by removing duplicated faces and applying planar decimation with varying thresholds to simplify them into minimal polygonal meshes. After this preprocessing, the majority of the shape are still non-manifold, making them unsuitable for our goal of generating manifold meshes with clean connectivity. We thus remove all non-manifold shapes, resulting in a total of 20,255 shapes. We adhere to an 80-20 train-test split and randomly sample 2,048 surface points as geometry conditioning input. Additionally, we apply random scaling augmentation during training. Unlike previous autoregressive methods, our approach does not require quantization of vertices.

Baselines. Many relevant baselines [Alliegro et al. 2023; Nash et al. 2020; Siddiqui et al. 2023] do not have either training or inference code available, and regardless there are many differences in experimental protocols and target task. As such, we instead focus on qualitative comparisons to give intuition about the differences between these methods, primarily in regard to mesh quality.

Results. Figure 15 shows a variety of results generated by our method, as well as a sampling of published results from baselines. Our method generates sharp and compact polygonal meshes that match with the input condition and are guaranteed to be manifold. We also note a promising diversity in our outputs on this dataset: because our model uses a probabilistic diffusion model to generate vertices, we are able to produce distinct meshes conditioned on the same point cloud input by repeatedly sampling the model (Figure 10).

Fig. 11:

Fig. 12:

5.5 Mesh Repair

Lastly, we demonstrate the application of our model to the downstream geometry processing task of mesh repair. As illustrated in Figure 11, we envision a workflow where a user identifies a region of a mesh with poor tessellation such as self-intersections, skinny triangles, or non-manifold structures, and wishes to re-triangulate that region in a way that seamlessly blends with the surrounding mesh. We show that we can repurpose our model for this task without retraining, by viewing it as mesh inpainting, in the same sense that image models are used to inpaint undesired regions of images according to some conditioning while matching the surrounding context. We inpaint the mesh by sampling a point cloud from the desired geometry and applying our generative model, projecting during diffusion to ensure the fixed region of the input mesh is retained—see the Supplement for an in-depth explanation. Note that MeshGPT [Siddiqui et al. 2023] also demonstrated completion of a partial mesh; however, it was limited to bottom-up completion due to auto-regressive inference with sorted vertices.

Results. We visualize the results in Figure 12. Our approach generates high-quality patches to fill the removed regions in the partial meshes while preserving the geometry and connectivity of the input. For comparison, we also include the most similar results of which we are aware: a classic mesh repair framework, MeshFix [Attene 2010], and a recent learning-based method, SeMIGCN [Hattori et al. 2024]. However, note that this is not exactly an apples-to-apples comparison, our method additionally takes the surface point cloud of the complete shape as input, with a focus on re-generating surface discretization while preserving geometry. MeshFix is designed only for hole filling and cannot generate a repaired mesh conditioned on the geometry. In contrast, SeMIGCN re-meshes the shape for running GCN, resulting in an overly dense mesh that might not be desirable. We compare quantitatively with 100 randomly sampled examples from the ABC dataset validation shapes. SpaceMesh achieved a Chamfer Distance (CD) of 0.77 (10^{− 3}) and a 0.76 F1 score. The baselines, SeMIGCN and MeshFix, achieve a CD of 39.50 (10^{− 3}) and 31.59 (10^{− 3}), and an F1 score of 0.57 and 0.72, respectively.

Fig. 13:

Fig. 14:

6 Discussion

Scalability and Runtime. Our approach represents discrete connectivity via a fixed-size continuous embedding per-vertex. Concrete results about the size of such an embedding needed to represent all possible discrete structures remain an open problem in graph theory [Nickel et al. 2014; Nickel and Kiela 2017]. In practice, we find low-dimensional embeddings k < 10 to be sufficient to represent every mesh in our experiments. Encoding a 10,000-vertex mesh via direct optimization, as shown in Figure 2, converges in 600 iterations (approximately 2 minutes) with k^p = 6.

Fig. 15:

For learning, the bottleneck is memory usage in transformer blocks. We demonstrate generations up to 2,000 vertices in the auto-decoder setting; this is modest compared to high-resolution meshes, but it already captures many CAD and artist-created assets, and exceeds other recent direct mesh generation works (e.g., around 200 vertices in MeshGPT [Siddiqui et al. 2023]). Our generative model takes less than 2 seconds to generate a single mesh, which is notably faster than recent auto-regressive models like MeshGPT, which require 30-90 seconds. All inference and optimization times are measured on an NVIDIA A6000 GPU.

Limitations. Although our representation guarantees manifold connectivity, it may contain other errors such as self-intersections, spurious high-degree polygons, or significantly non-planar faces. The frequency of such errors depends on how the representation is generated or optimized: often they have little effect on the approximated surface (Figure 9), but in other cases they may significantly degrade the generated geometry, as shown in Figure 13. Note that such artifacts are not always erroneous—meshes designed by artists often intentionally include self-intersections; if desired, we could potentially mitigate self-intersections by penalizing them with regularizers during training.

Our implementation does not handle open surfaces, this could be addressed by predicting a flag for boundary edges much like we predict a mask for padded vertices. Also, like other diffusion-based generative models, our large-scale learning experiments may produce nonsensical outputs for difficult or out-of-distribution input.

Future Work. Looking forward, we see many possibilities to build upon our representation for directly generating meshes in learning pipelines. In the short term, this could mean generating connectivity embeddings as well as vertex positions from a diffusion model, and in the longer term, one might even fit SpaceMesh generators in an unsupervised fashion using energy functions to remove the reliance on mesh datasets for supervision entirely.

Acknowledgments

The authors are grateful to Yawar Siddiqui for providing the results of MeshGPT and Polygen, as well as the anonymous reviewers for their valuable comments and feedback.

References

[1]

Ryan Prescott Adams and Richard S Zemel. 2011. Ranking via sinkhorn propagation. arXiv preprint arXiv:https://arXiv.org/abs/1106.1925 (2011).

Abstract

1 Introduction

2 Related Work

2.1 Generating Meshes

2.2 Graph Learning

3 Representation

3.1 Background

3.2 Representing Edges

3.3 Representing Faces

4 Validation

4.1 Encoding a Given Mesh

4.2 Fitting Mesh Collections

4.3 Ablating Design Choices

5 Application: Learning to mesh

5.1 Model Architecture

5.2 Basic Validation on a Synthetic Dataset

5.3 Learning Meshes from the ABC Dataset

5.4 Learning Meshes from the ShapeNet Dataset

5.5 Mesh Repair

6 Discussion

Acknowledgments

References

Index Terms

Recommendations

Volume subdivision based hexahedral finite element meshing of domains with interior 2-manifold boundaries

Generating well-shaped d-dimensional Delaunay meshes

Delaunay Triangular Meshes in Convex Polygons

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations