1 Introduction
There is an increasing interest to experience apparel in 3D for virtual try-on applications and e-commerce as well as an increasing demand for 3D clothing assets for games, virtual reality and augmented reality applications. While there is an abundance of 2D images of fashion items online, and recent generative AI algorithms democratize the creative generation of such images, the creation of high-quality 3D clothing assets remains a significant challenge. In this work we explore how to transfer the appearance of clothing items from 2D images onto 3D assets, as shown in Figure
1.
Extracting the fabric material and prints from such imagery is a challenging task, since the clothing items in the images exhibit strong distortion and shading variation due to wrinkling and the underlying body shape, in addition to general illumination variation and occlusions. To overcome these challenges, we propose a generative approach capable of extracting high-quality physically-based fabric materials and prints from a single input image and transfer them to 3D garment meshes of arbitrary shapes. The result may be rendered using Physically Based Rendering (PBR) to realistically reproduce the garments, for example, in a game engine under novel environment illumination and cloth deformation.
Existing methods for example-based 3D garments texturing primarily focus on direct texture synthesis onto 3D meshes using techniques such as 2D-to-3D texture mapping [Gao et al.
2024; Majithia et al.
2022; Mir et al.
2020] or multi-view depth-aware inpainting by distilling a pre-trained 2D generative model [Richardson et al.
2023; Yeh et al.
2024; Zeng
2023]. However, these approaches often lead to irregular and low-quality textures due to the inherent inaccuracies of 2D-to-3D registration and the stochastic nature of generative processes. Moreover, they struggle to faithfully represent texture details or disentangle garment distortions, resulting in significant degradation in texture continuity and quality.
In this work, we seek to overcome these limitations by drawing inspiration from the real-world garment creation process in the fashion industry [Korosteleva and Lee
2021; Liu et al.
2023]: most 3D garments are typically modeled from 2D sewing patterns with normalized
1 and tileable texture maps. This allows us to approach the texturing process from a novel angle, where obtaining such texture maps enables more accurate and realistic garment rendering across various poses and environments. Interestingly, if we take the 3D mesh away from our task of texture transfer, there has been a long history of development in 2D exemplar-based texture map extraction and synthesis [Cazenavette et al.
2022; Diamanti et al.
2015; Efros and Freeman
2023; Efros and Leung
1999; Guarnera et al.
2017; Hao et al.
2023; Li et al.
2022; Lopes et al.
2024; Rodriguez-Pardo et al.
2023;
2019; Schröder et al.
2014; Tu et al.
2022; Wei et al.
2009; Wu et al.
2019; Yeh et al.
2022]. Nevertheless, there remains a significant gap in effectively correcting the geometric distortion or calibrating the appearance (e.g., lighting) of the fabric present in the input reference images.
How can we translate a clothing image to a normalized and tileable texture map? At first glance, solving this ill-posed inverse problem is challenging, and may require developing sophisticated frameworks to model the explicit mapping. Instead, we investigate a feed-forward pathway to simulate the texture distortion and lighting conditions from its normalized form to that on a 3D garment mesh. Then, we propose to train a denoising diffusion model [Ho et al.
2020; Rombach et al.
2022] using paired texture images (i.e., both the distorted and normalized) to generate normalized and tileable texture images. Such an objective makes the training procedure fairly straightforward, which we see as a key strength. As a result, generating normalized texture images becomes solving a supervised distribution mapping problem of translating distorted texture patches back to a unified normalized space.
However, acquiring such paired training data from real clothing at scale is infeasible. To address this issue, we develop a large-scale synthetic dataset comprising over 100k textile color images, 3.8k material PBR texture maps, 7k prints (e.g., logos), and 22 raw 3D garment meshes. These PBR textures and prints are carefully applied to the raw 3D garment meshes and then rendered using PBR techniques under diverse lighting and environmental conditions, simulating real-world scenarios. For each fabric captures from the textured 3D garment, we render a corresponding image using ground-truth PBR textures, which are applied to a flat mesh under a controlled illumination condition, i.e., orthogonal close-up views with a pointed lighting from above. The captured texture inputs along with their ground-truth flat mesh render are used to train our diffusion model. Figure
3 illustrates the pipeline of training data construction.
We name our method FabricDiffusion and systematically study the performance on both synthetic data and real-world scenarios. Despite being trained entirely on synthetic rendered examples, FabricDiffusion achieves zero-shot generalization to in-the-wild images with complex textures and prints. Furthermore, the outputs of FabricDiffusion seamlessly integrate with existing PBR material estimation pipelines [Sartor and Peers
2023], allowing for accurate relighting of the garment under different lighting conditions. In summary, FabricDiffusion represents a state-of-the-art approach capable of extracting undistorted texture maps from real-world clothing images to produce realistic 3D garments.
3 Method
We propose FabricDiffusion to extract normalized, tileable texture images and materials from a real-world clothing image, and then apply them to the target 3D garment. The overall framework is illustrated in Figure
2. We first introduce the problem statement in Section
3.1, followed by procedures for constructing synthetic training examples in Section
3.2. In Section
3.3, we detail our specific approach of texture map generation. Finally, we describe PBR materials generation and garment rendering in Section
3.4.
3.1 Problem Statement
Given an input clothing image I and a captured texture region x, which may exhibit various distortions and illuminations due to occlusion and poses present in the input image, our goal is learn a mapping function g that takes the captured patch x and outputs the corresponding normalized texture map \(\tilde{x}\), effectively correcting the distortions. The texture map \(\tilde{x}\) needs to retain the intrinsic properties of the original captured region, such as color, texture pattern, and material characteristics.
As mentioned in Section
1, we formulate the generation of normalized texture maps from a real-life clothing patch as a distribution mapping problem. Specifically, the mapping function
g can be modeled by a generative process:
where the generative model
Gθ, parameterized by
θ, takes the input patch
x as a condition and samples from Gaussian noise to generate the distortion-free texture map
\(\tilde{x}\) in a canonical space. To train the generator
G, we must create a large number of paired training examples (
x,
x0) across various types of textures. Here
x is the input capture and
xo is the corresponding ground-truth normalized texture. After the model training, we expect to align the sampled output
\(\tilde{x}\) with the distribution of normalized textures.
3.2 Synthetic Paired Training Data Construction
Collecting paired training examples with real clothing poses significant challenges. In contrast, we found that PBR textures — the fundamental unit for appearance modeling in 3D apparel creation — are much more accessible from public sources (see Section
4.1 for details on dataset collection). Given these observations, we propose to build synthetic environments for constructing distorted and flat rendered training pairs using the PBR material model [McAuley et al.
2012]. Figure
3 illustrates the overall pipeline.
3.2.1 Paired training examples construction.
For each material, we collect the ground-truth diffuse albedo (\(k_d \in \mathbb {R}^3\)), normal (\(k_n \in \mathbb {R}^3\)), roughness (\(k_r \in \mathbb {R}^2\)), and metallic (\(k_m \in \mathbb {R}^2\)) material maps. To create distorted rendered images that mimic real-world surface deformation and lighting, we map these material maps onto a raw garment mesh sampled from 22 common garment types. The PBR textures are tiled appropriately and illuminated using four environment maps with white lights to avoid color biases. During rendering, we capture frontal views of the garment and randomly crop patches from the rendered images to match the original fabric texture size.
Separately, we render the same texture material on a plane mesh to create flat rendered images as ground-truths (image
x0 in Figure
3). For illumination, we use a fixed point light above the surface center and a fixed orthogonal camera for rendering. This method is highly beneficial as it provides supervision to align the distorted rendered images on the 3D garment to a canonical space of normalized, flat images with a unified lighting condition.
In fact, our flat image rendering and capturing approach may be reminiscent of the input format used in well-known SVBRDF material estimation methods [Sartor and Peers
2023; Zhou et al.
2023b;
2022; Zhou and Kalantari
2021], which require orthogonal close-up views of the materials and/or a flashing image as input. As will be described in Section
3.4, the output normalized textures from our method can be effectively integrated with SVBRDF material estimation models to generate high-quality PBR material maps.
3.2.2 Paired prints (e.g., logos) construction.
In additional to general textures, we aim to transfer clothing details by creating warped and flat pairs of print images. We map the print to a random location on the garment mesh and blend it with a uniformly colored background texture. Unlike flat texture generation on a plane mesh, we use the original print image with a transparent background as the flat image.
3.2.3 Scaling up training data with Pseudo-BRDF materials.
While the texture material maps are easier to acquire than real clothing, we raise the question: Do we really need a large amount of real BRDF material maps for paired training data construction, and what if we cannot obtain enough data?
In this work, we are able to collect a BRDF dataset comprises 3.8k assets in total (see Section
4.1 for details), covering a broad spectrum of fabric materials. However, the texture patterns in this dataset exhibit limited diversity because it is not large enough to model the appearance of fabric textures in our real life, given the vast range of colors, patterns, and materials. To address this, we augmented the dataset by gathering 100k textile color images featuring a wide array of patterns and designs, which are then used to generate pseudo-BRDF
2 materials. Specifically, the color image served as the albedo map, while the roughness map was assigned a uniform value
α sampled from the distribution
\(\mathcal {N}(0.708, 0.193^2)\), with 0.708 and 0.193 representing the population mean and standard deviation of the mean roughness values of the real BRDF dataset, respectively. The metallic map was assigned a uniform value max (
β, 0), where
\(\beta \sim \mathcal {U}(-0.05,0.05)\), and the normal map was kept flat.
We use a combination of real (3.8k) and pseudo-BRDF (100k) materials to create paired rendered images for training our texture generation model. During paired training examples construction, both real and pseudo-BRDF have
x and
x0 (as illustrated in Figure
3), representing distorted and flat textures, respectively. Intuitively, the primary goal of our texture generator is to eliminate geometric distortions, and our generated pseudo rendered images, serve this purpose effectively.
3.3 Normalized Texture Generation via FabricDiffusion
Given the paired training images, we build a denoising diffusion model to learn the distribution mapping from the input capture to the normalized texture map. Next, we detail our training objective, model architecture and training, and the design for tileable texture generation and alpha-channel-enabled
3 prints generation.
3.3.1 Training objective of conditional diffusion model.
Diffusion models [Ho et al.
2020; Sohl-Dickstein et al.
2015] are trained to capture the distribution of training images through a sequential Markov chains of adding random noise into clean images and denoising pure noise to clean images. We leverage Latent Diffusion Model (LDM) [Rombach et al.
2022] to improve the efficiency and quality of diffusion models by operating in the latent space of a pre-trained variational autoencoder [Kingma and Welling
2013] with encoder
\(\mathcal {E}\) and decoder
\(\mathcal {D}\). In our case, given the paired training data (
x,
x0), where
x is the distorted patch and
x0 is the normalized texture, the feed-forward process is formulated by adding random Gaussian noise into the latent space of image
x0:
where
xt is a noisy latent of the original clean input
x0,
\(\epsilon \sim \mathcal {N}(0, \mathbf {I})\),
t ∈ [0, 1], and
γ(
t) is defined as a noise scheduler that monotonically descends from 1 to 0. By adding the distorted image
x as the condition, the reverse process aims to denoise Gaussian noises back to clean images by iteratively predicting the added noises at each reverse step. We minimize the following latent diffusion objective:
where ϵ
θ denotes model parameterized by a neural network,
xt is the noisy latent for each timestep
t, and
\(\mathcal {E}(x)\) is the condition.
Recalling Equation
1, the above formulation incorporates input-specific information (i.e., the captured patch
x) into the training process for generating normalized textures. As will be shown in the experimental results in Section
4.2, this design is the key to producing faithful texture maps that differs from existing per-example optimization-based texture extraction approaches [Lopes et al.
2024; Richardson et al.
2023].
3.3.2 Model architecture and training.
Any diffusion-based architecture for conditional image generation can realize Equation
3. Specifically, we use Stable Diffusion [Rombach et al.
2022], a popular open-source text-conditioned image generative model pre-trained on large-scale text and image pairs. To support image conditioning, we use additional input channels to the first convolutional layer, where the latent noise
xt is concatenated with the conditioned image latent
\(\mathcal {E}(x)\). The model’s initial weights come from the pre-trained Stable Diffusion v1.5, while the newly added channels are initialized to zero, speeding up training and convergence. We eliminate text conditioning, focusing solely on using a single image as the prompt. This approach addresses the challenge of generating normalized texture maps, which text prompts struggle to describe accurately [Deschaintre et al.
2023].
3.3.3 Circular padding for seamless texture generation.
To ensure the generated texture maps are tileable, we employ a simple yet effective circular padding strategy inspired by TileGen [Zhou et al.
2022]. Unlike TileGen, which uses a StyleGAN-like architecture [Karras et al.
2020] and needs to replace both regular and transposed (e.g., upsampling or downsampling) convolutions, we only apply circular padding to all regular convolutional layers, thanks to the flexibility of diffusion models.
3.3.4 Transparent prints generation.
The vanilla Stable Diffusion model can only output RGB images, lacking the capability to generate layered or transparent images, which is in stark contrast to our demand for prints transfer. Instead of redesigning the existing generative model [Zhang and Agrawala
2024], we propose a simple and effective recipe to post-process the generated RGB print images for computing an additional alpha channel. We hypothesize that the alpha map for prints can be approximated as binary – either fully transparent or fully opaque. Based on this assumption, we assign a new RGB value for each pixel (
i,
j) as follows:
where
\(\tilde{x}\) is the generated texture (Equation
1). The alpha channel value at each pixel (
i,
j) is thus determined by the following criteria:
This approach assigns full opacity (alpha value of 1) to pixels where the initial value exceeds a certain threshold, and scales down the alpha value for other pixels, designating them as transparent background. As will be shown in Section
4.2 and Figure
5, our method can handle complex prints and logos and output RGBA print images that can be overlaid onto the fabric texture.
3.4 PBR Materials Generation and Garment Rendering
Our FabricDiffusion model is able to generate a normalized texture map that is tileable, flat, and under a unified lighting, ensuring compatibility with the SVBRDF material estimation method. The goal of this work is not to develop a new material estimation method but to demonstrate the compatibility of our approach with existing methods. MatFusion [Sartor and Peers
2023] is a state-of-the-art model trained on approximately 312k SVBRDF maps, most of which are non-fabric or non-clothing materials. We fine-tune this model using our dataset of real fabric BRDF materials. Specifically, we use our normalized textures as inputs, with the material maps (
kd,
kn,
kr,
km) as ground-truths for model fine-tuning.
The generated PBR material maps can be used for tiling in the garment sewing pattern. The remaining question is how to determine the scale for tiling? We consider two specific strategies: (1) Proportion-aware tiling. We use image segmentation to calculate the proportion of the caputured region relative to the segmented clothing, maintaining a similar ratio when tiling the generated texture onto the sewing pattern. (2) User-guided tiling. We emphasize that an end-to-end automatic tilling method may not be optimal, as user involvement is often necessary to resolve ambiguities and provide flexibility in fashion industries.
5 Discussion, Limitation, and Conclusion
In this paper, we introduce FabricDiffusion, a new method for transferring fabric textures and prints from a single real-world clothing image onto 3D garments with arbitrary shapes. Our method, trained entirely using synthetic rendered images, is able to generate undistorted texture and prints from in-the-wild clothing images. While our method demonstrates strong generalization abilities with real photos and diverse texture patterns, it faces challenges with certain inputs, as shown in Figure
11. Specifically, FabricDiffusion may produce errors when reconstructing non-repetitive patterns and struggles to accurately capture fine details in complex prints or logos, especially since our focus is on prints with uniform backgrounds, moderate complexity, and moderate distortion. In the future, we plan to address these challenges by enhancing texture transfer for more complex scenarios and improving performance on difficult fabric categories, such as leather. Additionally, we plan to expand our method to handle a broader range of material maps, including transmittance, to further extend its applicability.