8000 Chong Zeng · Yue Dong · Pieter Peers · Hongzhi Wu · Xin Tong
Project Page
|
arXiv
|
Paper
|
Model
|
Official Code
RenderFormer is a neural rendering pipeline that directly renders an image from a triangle-based representation of a scene with full global illumination effects and that does not require per-scene training or fine-tuning. Instead of taking a physics-centric approach to rendering, we formulate rendering as a sequence-to-sequence transformation where a sequence of tokens representing triangles with reflectance properties is converted to a sequence of output tokens representing small patches of pixels. RenderFormer follows a two stage pipeline: a view-independent stage that models triangle-to-triangle light transport, and a view-dependent stage that transforms a token representing a bundle of rays to the corresponding pixel values guided by the triangle-sequence from the the view-independent stage. Both stages are based on the transformer architecture and are learned with minimal prior constraints. We demonstrate and evaluate RenderFormer on scenes with varying complexity in shape and light transport.
- System: The code is tested on Linux, MacOS and Windows.
- Hardware: The code has been tested on both NVIDIA CUDA GPUs and Apple Metal GPUs. The minimal GPU memory requirement is 8GB.
First set up an environment with PyTorch 2.0+. For CUDA users, you can install Flash Attention from https://github.com/Dao-AILab/flash-attention.
The rest of the dependencies can be installed through:
git clone https://github.com/microsoft/renderformer
cd renderformer
pip install -r requirements.txt
python3 -c "import imageio; imageio.plugins.freeimage.download()" # Needed for HDR image IO
Model | Params | Link | Model ID |
---|---|---|---|
RenderFormer-V1-Base | 205M | Hugging Face | microsoft/renderformer-v1-base |
RenderFormer-V1.1-Large | 483M | Hugging Face | microsoft/renderformer-v1.1-swin-large |
Note on the released models
We found a shader bug in the training data that we used in the submission. We re-trained the models with the corrected shader and released the new models. Thus the model performance and output might be different from the results in the paper.
We put example scene config JSON files at examples
. To render a scene, first convert a scene config JSON file into our HDF5 scene format:
python3 scene_processor/convert_scene.py examples/cbox.json --output_h5_path tmp/cbox/cbox.h5
python3 infer.py --h5_file tmp/cbox/cbox.h5 --output_dir output/cbox/
You should now see output/cbox/cbox_view_0.exr
and output/cbox/cbox_view_0.png
under your output folder. .exr
is the HDR Linear output from RenderFormer, and .png
is the LDR version of the rendered image. You can enable different tone mappers through --tone_mapper
to achieve better visual results.
The script will automatically fallback to use torch scaled dot product attention if Flash Attention is not found on the system. We also provide an environment ATTN_IMPL
for you to choose which attention implementation to use:
# Use SDPA intentionally
ATTN_IMPL=sdpa python3 infer.py --h5_file tmp/cbox/cbox.h5 --output_dir output/cbox/
Please check the image render shell script for more examples.
--h5_file H5_FILE Path to the input H5 file
--model_id MODEL_ID Model ID on Hugging Face or local path
--precision {bf16,fp16,fp32}
Precision for inference (Default: fp16)
--resolution RESOLUTION
Resolution for inference (Default: 512)
--output_dir OUTPUT_DIR
Output directory (Default: same as input H5 file)
--tone_mapper {none,agx,filmic,pbr_neutral}
Tone mapper for inference (Default: none)
You can achieve batch rendering with RenderFormerRenderingPipeline
by providing a batch of input scene and rendering camera parameters.
Minimal example (without meaningful inputs, just for testing):
import torch
from renderformer import RenderFormerRenderingPipeline
pipeline = RenderFormerRenderingPipeline.from_pretrained("microsoft/renderformer-v1.1-swin-large")
device = torch.device('cuda')
pipeline.to(device)
BATCH_SIZE = 2
NUM_TRIANGLES = 1024
TEX_PATCH_SIZE = 32
NUM_VIEWS = 4
triangles = torch.randn((BATCH_SIZE, NUM_TRIANGLES, 3, 3), device=device)
texture = torch.randn((BATCH_SIZE, NUM_TRIANGLES, 13, TEX_PATCH_SIZE, TEX_PATCH_SIZE), device=device)
mask = torch.ones((BATCH_SIZE, NUM_TRIANGLES), dtype=torch.bool, device=device)
vn = torch.randn((BATCH_SIZE, NUM_TRIANGLES, 3, 3), device=device)
c2w = torch.randn((BATCH_SIZE, NUM_VIEWS, 4, 4), device=device)
fov = torch.randn((BATCH_SIZE, NUM_VIEWS, 1), device=device)
rendered_imgs = pipeline(
triangles=triangles,
texture=texture,
mask=mask,
vn=vn,
c2w=c2w,
fov=fov,
resolution=512,
torch_dtype=torch.float16,
)
print("Inference completed. Rendered Linear HDR images shape:", rendered_imgs.shape)
# Inference completed. Rendered Linear HDR images shape: torch.Size([2, 4, 512, 512, 3])
Please check infer.py
and rendering_pipeline.py
for detailed usages.
We put example video input data on Hugging Face. You can download and unzip them with this script.
python3 batch_infer.py --h5_folder renderformer-video-data/submission-videos/cbox-roughness/ --output_dir output/videos/cbox-roughness
Please check the video render shell script for more examples.
--h5_folder H5_FOLDER
Path to the folder containing input H5 files
--model_id MODEL_ID Model ID on Hugging Face or local path
--precision {bf16,fp16,fp32}
Precision for inference
--resolution RESOLUTION
Resolution for inference
--batch_size BATCH_SIZE
Batch size for inference
--padding_length PADDING_LENGTH
Padding length for inference
--num_workers NUM_WORKERS
Number of workers for data loading
--output_dir OUTPUT_DIR
Output directory for rendered images (default: same as input folder)
--save_video Merge rendered images into a video at video.mp4.
--tone_mapper {none,agx,filmic,pbr_neutral}
Tone mapper for inference
RenderFormer uses a JSON-based scene description format that defines the geometry, materials, lighting, and camera setup for your scene. The scene configuration is defined using a hierarchical structure with the following key components:
scene_name
: A descriptive name for your sceneversion
: The version of the scene description format (currently "1.0")objects
: A dictionary of objects in the scene, including both geometry and lightingcameras
: A list of camera configurations for rendering
Each object in the scene requires:
mesh_path
: Path to the .obj mesh filematerial
: Material properties including:diffuse
: RGB diffuse color [r, g, b]specular
: RGB specular color [r, g, b] (We currently only support white specular, and diffuse + specular should be no larger than 1.0)roughness
: Surface roughness (0.01 to 1.0)emissive
: RGB emission color [r, g, b] (We currently only support white emission, and only on light source triangles)smooth_shading
: Whether to use smooth shading on this objectrand_tri_diffuse_seed
: Optional seed for random triangle coloring, if none, use the diffuse color directlyrandom_diffuse_max
: Maximum value for random diffuse color assignment (max diffuse color + specular color should be no larger than 1.0)random_diffuse_type
: Type of random diffuse color assignment, either per triangle or per shading group
transform
: Object transformation including:translation
: [x, y, z] positionrotation
: [x, y, z] rotation in degreesscale
: [x, y, z] scale factorsnormalize
: Whether to normalize object to unit sphere
remesh
: Whether to remesh the objectremesh_target_face_num
: Target face number of the remeshed object
Each camera requires:
position
: [x, y, z] camera positionlook_at
: [x, y, z] target pointup
: [x, y, z] up vectorfov
: Field of view in degrees
We recommend start from the examples/init-template.json
and modify it to your needs. For more complex examples, refer to the scene configurations in the examples
directory.
The HDF5 file contains the following fields:
triangles
: [N, 3, 3] array of triangle verticestexture
: [N, 13, 32, 32] array of texture patchesvn
: [N, 3, 3] array of vertex normalsc2w
: [N, 4, 4] array of camera-to-world matricesfov
: [N] array of field of view
We use the same camera coordinate system as Blender (-Z = view direction, +Y = up, +X = right), be mindful of the coordinate system when implementing your own HDF5 converter.
Please refer to scene_processor/to_h5.py
for more details.
We provide a simple remeshing tool in scene_processor/remesh.py
. You can use it to remesh your objects before putting them into the scene.
We also provide fields in the scene config JSON file (remesh
and remesh_target_face_num
) to allow you to remesh the object during scene conversion process.
python3 scene_processor/remesh.py --input path/to/your/high_res_mesh.obj ----output remeshed_object.obj --target_face_num 1024
We provide a Blender Extension to simplify the process of setting up a scene for RenderFormer. Please refer to the Blender Extension for more details.
- Always start from the
examples/init-template.json
. - Please limit the scene in our training data range, extrapolation can work but not guaranteed.
- Camera distance to scene center in [1.5, 2.0], fov in [30, 60] degrees
- Scene bounding box in [-0.5, 0.5] in x, y, z
- Light sources: up to 8 triangles (please use the triangle mesh at
examples/templates/lighting/tri.obj
), each scale in [2.0, 2.5], distance to scene center in [2.1, 2.7], emission values summed in [2500, 5000] - Total number of triangles: training data covers up to 4096 triangles, but extending to 8192 triangles during inference usually still works.
- All training objects are water-tight and simplified with QSlim. Uniform triangle sizes are preferred. If you find your object not working, try to remesh it with our provided script or other remeshing tools.
We borrowed some code from the following repositories. We thank the authors for their contributions.
In addition to the 3D model from Objaverse, we express our profound appreciation to the contributors of the 3D models that we used in the examples.
- Shader Ball: by Wenzel Jakob from Mitsuba Gallery
- Stanford Bunny & Lucy: from The Stanford 3D Scanning Repository
- Cornell Box: from Cornell Box Data, Cornell University Program of Computer Graphics
- Utah Teapot: from Utah Model Repository
- Veach MIS: From Eric Veach and Leonidas J. Guibas. 1995. Optimally combining sampling techniques for Monte Carlo rendering
- Spot: By Keenan Crane from Keenan's 3D Model Repository
- Klein Bottle: By Fausto Javier Da Rosa
- Constant Width: Original mesh from Small volume bodies of constant width. Derived mesh from Keenan's 3D Model Repository
- Jewelry: By elbenZ
- Banana, Easter Basket, Water Bottle, Bronco, Heart: By Microsoft
- Lowpoly Fox: By Vlad Zaichyk
- Lowpoly Crystals: By Mongze
- Bowling Pin: By SINOFWRATH
- Cube Cascade, Marching Cubes: By Tycho Magnetic Anomaly
- Dancing Crab: By Bohdan Lvov
- Magical Gyroscope: By reddification
- Capoeira Cube: By mortaleiros
- P.U.C. Security Bot: By Gouhadouken
RenderFormer model and the majority of the code are licensed under the MIT License. The following submodules may have different licenses:
- renderformer-liger-kernel: Redistributed Liger Kernel for RenderFormer integration. It's derived from original Liger Kernel and licensed under the BSD 2-Clause "Simplified" License.
- simple-ocio: We use this tool to simplify OpenColorIO usage for tone-mapping. This package redistributes the complete Blender Color Management directory. The full license text is available at ocio-license.txt and the headers of each configuration file. The package itself is still licensed under the MIT License.
If you find this work helpful, please cite our paper:
@inproceedings {zeng2025renderformer,
title = {RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination},
author = {Chong Zeng and Yue Dong and Pieter Peers and Hongzhi Wu and Xin Tong},
booktitle = {ACM SIGGRAPH 2025 Conference Papers},
year = {2025}
}