If you are interested in doing a research project (“semester project”) or a master’s project at IVRL, you can do this through the Master Programs in Communication Systems or in Computer Science. Note you must be accredited to EPFL. This page lists available semester/master’s projects for the Spring 2025 semester.
For any other type of applications (research assistantship, internship, etc), please check this page.
Description:
Recent advances in neural rendering-based 3D scene reconstruction have demonstrated strong capacity in representing visually plausible real-world scenes. However, most of them rely heavily on dense multi-view captures and therefore restricted from broader applicability.
In this project, we aim to exploit the strong priors of latent-based video diffusion model for synthesizing high-fidelity novel views of generic scenes from single or sparse input captures. We adopt radiance field as the scene representation and explore the implicit 3D understanding and intra-frame attention-correlation exhibited in video diffusion models in place for the multi-view capture input in prior works.
Type of work:
- MS Level: semester project/master project
- 65% research, 35% development
Prerequisite:
- Knowledge in deep learning frameworks (e.g., PyTorch or Tensorflow), image processing and Computer Vision.
- Experience with 3D vision is required (e.g. course taken, independent projects, etc. )
Supervisor:
- Dongqing Wang, [email protected]
Reference Literature:
- High-resolution image synthesis with latent diffusion models
- ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
- FSGS: Real-time few-shot view synthesis using gaussian splatting
Startup company Innoview has developed a software framework to create hidden watermarks printed on paper and to acquire and decode them by a smartphone. The acquisition by smartphone comprises many separate parametrizable parts. The project consists in improving some of the parts of the acquisition pipeline in order to optimize the recognition rate of the hidden watermarks (under Android).
Deliverables:
- Report and running prototype.
Prerequisites:
- basic knowledge of image processing and computer vision,
- Coding skills in Java Android, C#, and/or Matlab
Level: BS or MS semester project
Supervisors:
Dr Romain Rossier, Innoview Sàrl, [email protected], tel 078 664 36 44
Prof. Roger D. Hersch, BC110, [email protected], cell: 077 406 27
Startup company Innoview has developed arrangements of lenslets that can be used to create document security features. The goal is to improve these security features and to optimize them by simulating the interaction of light with these 3D lenslet structures, using the Blender software.
Deliverables:
- Report and running prototype (Matlab). Blender lenslet simulations.
Prerequisites:
- knowledge of computer graphics, interaction of light with 3D mesh objects,
- basic knowledge of Blender,
- Coding skills in Matlab
Level: BS or MS semester project
Supervisors:
Prof. Roger D. Hersch, BC110, [email protected], cell: 077 406 27
Dr Romain Rossier, Innoview Sàrl, [email protected], tel
078 664 36 44
This project aims to explore whether there is any semantic information encoded by off-the-shelf diffusion model that helps us and other deep learning models understand what is the content of an image or the relationship between images.
Diffusion models [1] have been the new paradigm for generative modeling in computer vision. Despite its success, it remains to be a black box during generation. At each step, it provides a direction, namely the score, towards the data distribution. As shown in recent work [2], the score can be decomposed into different meaningful components. The first research question is: does the score encode any semantic information of the generated image?
Moreover, there is evidence that the representation learned by diffusion models is helpful to discriminative models. For example, it can boost the classification performance by knowledge distillation [3]. Furthermore, diffusion model itself can be used as a robust classifier [4]. It can be seen that discriminative information can be extracted from the diffusion model. Then the second question is: What is the information about? Is it about the object shape? Location? Texture? Or other kinds of information.
This is an exploratory project. We will try to interpret the black box in diffusion model and dig semantic information that it encodes. Together, we will also brainstorm the application of diffusion model other than image generation. This project can be a good chance for you to develop interest and skills in scientific research.
References:
[1] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[J]. Advances in neural information processing systems, 2020, 33: 6840-6851.
[2] Alldieck T, Kolotouros N, Sminchisescu C. Score Distillation Sampling with Learned Manifold Corrective[J]. arXiv preprint arXiv:2401.05293, 2024.
[3] Yang X, Wang X. Diffusion model as representation learner[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 18938-18949.
[4] Chen H, Dong Y, Shao S, et al. Your diffusion model is secretly a certifiably robust classifier[J]. arXiv preprint arXiv:2402.02316, 2024.
Deliverables: Deliverables should include code, well cleaned up and easily reproducible, as well as a written report, explaining the models, the steps taken for the project and the results.
Prerequisites: Python and PyTorch. Basic understanding of diffusion models.
Level: MS research project
Number of students: 1
Contact: Yitao Xu, [email protected]
Introduction
3D mesh generation plays a pivotal role in virtual reality, gaming, and digital content creation, but generating high-quality, detailed meshes remains a challenging task. Traditional methods often fail to capture fine-grained details or optimize computational efficiency, especially for complex, textured surfaces. This proposal seeks to enhance 3D mesh generation by incorporating frequency decomposition models, leveraging multi-resolution analysis to capture both broad structural features and intricate details.
Objective
The primary goal of this research is to develop a frequency-based decomposition model for 3D mesh generation, enabling precise control over the detail level of generated meshes. By decomposing spatial and frequency components, we aim to improve mesh quality, reduce processing times, and enhance texture and surface detail.
Methodology
-
Frequency Decomposition: Apply discrete wavelet transforms (DWT) on the spatial and normal maps of 3D meshes, separating high-frequency components (surface details) from low-frequency components (broad structural shapes).
-
Component-specific Optimization: Tailor the mesh generation model to optimize specific frequency components. For example, low-frequency structures can be prioritized for smooth topology, while high-frequency details can be preserved in texture-rich areas.
-
Multi-level Reconstruction: Iteratively reconstruct the mesh from frequency components using an inverse wavelet transform (IDWT), allowing for customizable levels of detail depending on the desired quality.
-
Evaluation and Benchmarking: Compare the proposed approach against existing methods on benchmarks, measuring structural consistency, texture fidelity, and computational efficiency.
Expected Contributions
- A novel, frequency-based approach for enhancing 3D mesh quality.
- A multi-level decomposition and reconstruction framework that allows selective detail optimization.
An efficient algorithm capable of handling complex surfaces without compromising mesh detail.
Prerequisites: Python and PyTorch. Basic understanding of diffusion models.
Level: MS research project
Number of students: 1
Contact: Yufan Ren, [email protected]
Introduction:
Images captured under low-light conditions often suffer from significant noise. Existing deep-learning-based denoising networks [1,2,3,4] typically require a large dataset of paired noisy-clean samples for effective training. However, collecting such paired data is both labor-intensive and time-consuming. This project aims to address this challenge by synthesizing paired noisy-clean data using generative models, such as diffusion models [5]. Noise in RAW images can be broadly classified into signal-dependent and signal-independent components. We plan to model these components separately and then combine them to simulate realistic noise for clean images.
Objective:
The primary goal of this project is to develop a robust method for noise synthesis that enables the generation of high-quality paired data. This synthesized data will be used to train denoising networks, allowing us to evaluate its impact on denoising performance.
Methodology:
- Collecting Dark Frames: To model the signal-independent noise component, we will capture dark frames in a controlled darkroom environment. These frames provide data on noise inherent to the sensor, such as thermal noise, read noise and banding pattern noise [6,7].
- Modeling Signal-Independent Noise: Using the collected dark frames, we will train a generative model (e.g., diffusion models) to learn the distribution of signal-independent noise.
- Simulating Signal-Dependent Noise: For the signal-dependent noise component, we will use the Poisson noise model, which effectively captures the particle-like nature of light. This approach is well-supported by existing research [6,8].
- Combining Noise Components: By merging the signal-dependent and signal-independent noise components, we will synthesize realistic noisy images from clean ones, enabling us to generate an unlimited number of paired noisy-clean samples.
Type of work:
master semester project
65% research, 35% development
Prerequisite:
Proficiency in deep learning frameworks (e.g., PyTorch)
Familiarity with image processing and computer vision
(Optional) Prior knowledge of diffusion models is advantageous
Supervisor:
Liying Lu ([email protected])
Reference:
[1]. Abdelhamed, Abdelrahman, Stephen Lin, and Michael S. Brown. “A high-quality denoising dataset for smartphone cameras.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[2]. Anaya, Josue, and Adrian Barbu. “Renoir–a dataset for real low-light image noise reduction.” Journal of Visual Communication and Image Representation 51 (2018): 144-154.
[3]. Chen, Chen, et al. “Learning to see in the dark.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[4]. Flepp, Roman, et al. “Real-World Mobile Image Denoising Dataset with Efficient Baselines.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
[5]. Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851.
[6]. Wei, Kaixuan, et al. “Physics-based noise modeling for extreme low-light photography.” IEEE Transactions on Pattern Analysis and Machine Intelligence 44.11 (2021): 8520-8537.
[7]. Costantini, Roberto, and Sabine Susstrunk. “Virtual sensor design.” Sensors and Camera Systems for Scientific, Industrial, and Digital Photography Applications V. Vol. 5301. SPIE, 2004.
[8]. Zhang, Yi, et al. “Rethinking noise synthesis and modeling in raw denoising.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
Introduction
With the advancement of Vision-Language Models (VLMs), generating coherent, contextually relevant narratives from images has become an exciting yet challenging frontier. Current models struggle to maintain narrative consistency, often introducing contradictory details or missing contextually vital elements when interpreting a sequence of images. This proposal seeks to enhance the storytelling capabilities of VLMs by introducing a self-consistency mechanism, aimed at reinforcing coherence, maintaining character continuity, and upholding narrative flow across multiple image inputs.
Objective
The main objective of this research is to develop a self-consistency framework for VLMs that enables improved narrative coherence in visual storytelling tasks. This mechanism will monitor and enforce consistency in story elements such as character traits, setting, actions, and progression, producing narratives that align closely with human expectations for logical and consistent storytelling.
Methodology
-
Self-Consistency Module: Integrate a self-consistency module within the VLM architecture, which will cross-reference details across sequential images, ensuring that entities, actions, and story elements remain logically consistent. This module will evaluate consistency by tracking character attributes, scene elements, and temporal relationships, adjusting model outputs to rectify inconsistencies.
-
Memory and Reference Mechanisms: Implement a memory-based mechanism to store narrative elements identified in each image, maintaining a “story memory” that captures the main characters, locations, and story arcs. This will allow the VLM to reference earlier parts of the story and avoid contradictions or omissions as it progresses.
-
Training with Self-Supervision: Use self-supervised learning to fine-tune the model on datasets where story coherence is crucial. During training, the model will be penalized for introducing inconsistencies in narrative elements or disrupting logical story progression.
-
Evaluation and Benchmarking: Develop a new visual storytelling benchmark focused on self-consistency, assessing narrative coherence, character consistency, and story progression. The model will be evaluated on metrics such as narrative accuracy, coherence, and alignment with human story interpretation.
Expected Contributions
- A novel self-consistency mechanism for VLMs to enhance coherence in multi-image storytelling tasks.
- A memory-based reference model that maintains continuity across scenes, characters, and settings.
- A new benchmark and evaluation framework for testing and measuring consistency in visual storytelling.
Prerequisites: Python and PyTorch. Basic understanding of diffusion models.
Level: MS research project
Number of students: 1
Contact: Yufan Ren, [email protected]
Description
Diffusion models have advanced the field of image generation, enabling one to create realistic and detailed images of almost any scene. However, these models depend more on rote learning of example scenes rather than a true understanding of a scene and its geometry. As a consequence, generated images can feature incorrect perspectives and geometric features. On the contrary, natural photographs feature specific geometric features. In particular, lines that are parallel in a scene converge on the photograph to a vanishing point, and all vanishing points derived from lines on parallel planes lie on the same vanishing line. When these principles are broken on generated images, the images can lack realism. In augmented or virtual reality systems, breaking these issues can lead to a disrupted viewer immersion. Geometric accuracy is furthermore crucial for applications such as architectural visualization. Thus, improving perspective in generated images would not only enhance the aesthetic quality of images, but also expand the utility of generative models in professional domains. Addressing this challenge could push the boundaries of what generative models can achieve. On the other hand, current geometric artefacts could be analysed to distinguish generated images from real ones and detect deepfakes.
In this project, we aim to investigate the geometry of images, both real and generated. We will review geometry analysis methods, most notably vanishing points detection. This problem has been studied in the literature for a long time, both with geometric and algorithmic methods and with more recent learning-based tools. However, it remains to be seen which of these methods still apply when geometrical correctness cannot be assumed in the first place. Furthermore, many generative models focus on generating faces, which are more difficult to analyse due to the absence of straight lines.
The developed tools will be used to assess and quantify the geometry inaccuracies obtained with various diffusion models. Then, depending on the interests and early results, this project could fork into two possible topics. A first application would be to develop a deepfake detection tool based on geometry analysis. Beyond deepfake detection, one could also seek to improve diffusion model generation to ensure geometric correctness of the generated images.
Deliverables
The final report should contain a review, both experimental and theoretical, of existing vanishing point detection methods. It might be relevant to implement one or several of the older methods, for which code is not always available. The review should focus as well on the specificities of generated image analysis detailed above. It should contain experiments assessing to which extent different diffusion models create geometric artefacts.
The report should also detail the proposed innovations on at least one of the following topics:
- Improvements made to vanishing points detection
- Deepfake detection using geometric analysis
- Improving the perspective of generated images.
Overall, the report should be structured, and present experiments done during the project and conclusions that can be drawn from them. New proposed methods, as well as reimplemented ones, should be explained in a reproducible manner, for example with pseudo-code. If any training is involved, training details should be comprehensively explained.
In addition to the report, a clean, well-documented code enabling the reproduction of experiments will be expected.
Prerequisites
- Strong skills in geometry
- Proficiency in writing clean code, ideally with Python and Pytorch
- Depending on the directions of this project, statistics and probability and/or a basic understanding of diffusion models (ideally both)
Type of work and number of students
Either one Master’s thesis student, or one or two (ideally two) MS research project students (semester projects)
Supervision
Quentin Bammey, [email protected]
Main references
As the first reference contains important details pertaining to this project, please read it before applying.
- Farid, Hany. “Perspective (in) consistency of paint by text.” arXiv preprint arXiv:2206.14617 (2022). https://arxiv.org/abs/2206.14617
- Desolneux, Agnes, Lionel Moisan, and Jean-Michel Morel. From gestalt theory to image analysis: a probabilistic approach. Vol. 34. Springer Science & Business Media, 2007. (Chapter 8 is of particular interest to this project, but the whole book will be relevant if focusing on deepfake detection)
- Almansa, Andrés, Agnes Desolneux, and Sébastien Vamech. “Vanishing point detection without any a priori information.” IEEE Transactions on Pattern Analysis and Machine Intelligence 25.4 (2003): 502-507.
- Upadhyay, Rishi, et al. “Enhancing diffusion models with 3d perspective geometry constraints.” ACM Transactions on Graphics (TOG) 42.6 (2023): 1-15.
- Santana-Cedrés, Daniel, et al. “Automatic correction of perspective and optical distortions.” Computer Vision and Image Understanding 161 (2017): 1-10.
- Tehrani, Mahdi Abbaspour, Aditi Majumder, and M. Gopi. “Correcting perceived perspective distortions using object specific planar transformations.” 2016 IEEE International Conference on Computational Photography (ICCP). IEEE, 2016.
- Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. “Adding conditional control to text-to-image diffusion models.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
- Tutorial on diffusion models: https://cvpr2022-tutorial-diffusion-models.github.io/
While many functions simply take an image-like signal (and parameters) as input and return a processed image, some of the newer methods may input or output multiple images (e.g., burst imaging, high dynamic range imaging, multi-sensor imaging), make use of masks to process regions in different manners, or handle metadata. Consequently, modern image processing pipelines can no longer be represented as a simple series of functions but rather as an acyclic graph of different methods.
Adding to this challenge, many of these functions are learning-based and have varied software requirements, such as different Python versions or libraries, that makes interoperability difficult.
The goal of this project is to develop image signal processing pipeline software that enables users to design or run a pipeline in a modular way, specifying their own functions either from stock ones or from their own code. The pipeline should be representable in a way that is legible both visually (as an acyclic graph) and in a text-based format. Users should be able to add and share their own encapsulated functions into the pipeline system with as few technical changes as possible, possibly using Docker containers or the uv Python packaging system.
The pipeline should support the application of masks (computed or user-provided) in an integrated way, allowing different parts of the pipeline to process different regions based on the mask. It should include a scientific image viewer, such as vpv (https://github.com/kidanger/vpv), capable of displaying and comparing intermediate results.
If successful, this software could have applications beyond raw image processing, such as in image restoration or image forensics toolboxes.
They will deliver a clean, maintainable and well-documented code.
While most methods in the pipelines are expected to be written in Python, the developed software interfacing the code should be written in a cross-platform, robust and future-proof language such as Rust.
Developing parts of the pipeline to integrate in the software is not part of this project, but the students should demonstrate how to interface example methods
- Proficiency in the language used to develop the interface (preferably but not necessarily Rust)
- Some proficiency in Python and deep-learning libraries are expected. While the students will not have to develop the software in Python itself, the developed software should be able to interface methods developed in Python or other languages
- Experience writing clean, documented and maintainable code
- A basic understanding of image processing is desirable, for instance through following the Computational Photography course (CS413)
Startup company Innoview has developed arrangements of lenslets that can be used to create document security features. The goal is to improve these security features and to optimize them possibly by 3D printing of the lenslet arrangements at a large scale.
Deliverables: Report and running prototype (Matlab), Blender lenslet simulations, 3D mesh objects in the wavefront .obj format, 3D printed mesh objects,
Prerequisites:
– knowledge of computer graphics, interaction of light with 3D mesh objects,
– basic knowledge of Blender,
– Coding skills in Matlab
Level: BS or MS semester project
Supervisors:
Prof. Roger D. Hersch, BC110, [email protected], cell: 077 406 27 09
Dr Romain Rossier, Innoview Sàrl, [email protected], tel: 078 664 36 44
Description:
Diffusion models generate images or data by iteratively denoising an initial noise sample, typically sampled from pure independent white Gaussian noise.
Recent works in our lab, such as Diffusion in Style [1], Signal Leak Bias [2], and Covariance Mismatch [3], as well as studies by others [4, 5], have shown that a more careful choice of the initial noise distribution can lead to generations that align better with desired outcomes.
Many academic works in the diffusion model literature are based on the HuggingFace 🤗 Diffusers library [6]. The goal of this project is to develop a modular method to initialize the generation process from specific noise distributions, and contribute it to the 🤗 Diffusers library via a Pull Request on their Github repo.
Background and details:
The HuggingFace 🤗 Diffusers library is primarily organized around two/three main components:
1. Pipelines [7]: These encapsulate the entire generation process. For example, the StableDiffusionPipeline integrates a denoising model (typically a U-Net [8, 9]) and a scheduler (see next point).
2. Schedulers [10] (e.g., DDPM [11, 12], DDIM [13, 14]): These define the step-by-step generation procedure, from the initial noise sample and the predicted denoising directions.
3. In addition, Loaders [15] (e.g., LoRA [16, 17], Textual Inversion [18, 19], IP-Adapters [20, 21]) are modular components used to modify pipeline behavior. For instance, LoRA adjusts the denoising model weights, Textual Inversion adds personalization capabilities, and IP-Adapters introduce image conditioning. One way to allow users to start the generation process from a specified noise distribution (rather than the default white Gaussian noise) would be to implement an “Initial Noise Sampler” Loader. The proposed loader should accept arguments such as a method to define or compute the specific noise distribution (from some statistics), and a repository of precomputed statistics or a list of images for on-the-fly computation.
To illustrate the use of this loader, we will reimplement together the methods from Signal Leak Bias [2] and DDIM Inversion [22].
References:
[1] Everaert, Martin Nicolas, et al. “Diffusion in style.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[2] Everaert, Martin Nicolas, et al. “Exploiting the signal-leak bias in diffusion models.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024.
[3] Everaert, Martin Nicolas, et al. “Covariance Mismatch in Diffusion Models.” Infoscience preprint https://infoscience.epfl.ch/handle/20.500.14299/242173 . 2024.
[4] Zhang, Jeffrey, et al. “Preserving Image Properties Through Initializations in Diffusion Models.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024
[5] Wu, Tianxing, et al. “Freeinit: Bridging initialization gap in video diffusion models.” European Conference on Computer Vision. Springer, Cham, 2025.
[6] von Platen, P., et al. “Diffusers: State-of-the art diffusion models.” URL: https://github.com/huggingface/diffusers [7] https://huggingface.co/docs/diffusers/en/api/pipelines/overview
[8] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation.” Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer International Publishing, 2015
[9] https://huggingface.co/docs/diffusers/en/api/models/unet2d
[10] https://huggingface.co/docs/diffusers/en/api/schedulers/overview
[11] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851.
[12] https://huggingface.co/docs/diffusers/en/api/schedulers/ddpm
[13] Song, Jiaming, Chenlin Meng, and Stefano Ermon. “Denoising Diffusion Implicit Models.” International Conference on Learning Representations, 2021.
[14] https://huggingface.co/docs/diffusers/en/api/schedulers/ddim
[15] https://huggingface.co/docs/diffusers/main/en/api/loaders
[16] Hu, Edward J., et al. “LoRA: Low-Rank Adaptation of Large Language Models.” International Conference on Learning Representations, 2022.
[17] https://huggingface.co/docs/diffusers/en/api/loaders/lora
[18] Gal, Rinon, et al. “An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion.” The Eleventh International Conference on Learning Representations, 2023.
[19] https://huggingface.co/docs/diffusers/en/api/loaders/textual_inversion
[20] Ye, Hu, et al. “Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.” arXiv preprint arXiv:2308.06721 (2023).
[21] https://huggingface.co/docs/diffusers/en/api/loaders/ip_adapter
[22] https://huggingface.co/learn/diffusion-course/en/unit4/2
Deliverables: Deliverables should include documentation and code, well cleaned up and documented, as well as a written report, explaining the implementation and the steps taken for the project.
Prerequisites: Python, ideally familiarity with PyTorch. Experience with HuggingFace/Diffusers is a strong plus.
Level: Ideally BS research project (semester project), potentially MS research project (semester project)
Number of students: 1
Supervisor: Martin Nicolas Everaert (martin.everaert [at] epfl.ch)
Description
Recent work by Kaplan et al. (2020) on “Scaling Laws for Neural Language Models” [1] demonstrated striking power-law relationships between model size, dataset size, and performance for large language models. These insights highlight that model improvements follow consistent trends as we scale up compute and parameters.
Implicit Neural Representations (INRs)—such as neural fields used to represent continuous signals (images, shapes, videos, etc.) Unlike traditional discrete sampling, INRs parameterize signals as neural networks that map coordinates (e.g., spatial or temporal) to signal values (e.g., color or density). Examples include Neural Radiance Fields (NeRF) [2] for 3D scenes or SIREN [3] for 2D images. However, the relationship between model size, training compute, signal complexity, and final reconstruction quality for INRs remains largely unexplored.
Type of Work
- MS Level: semester project/master project
- 65% research, 35% development
Goal
This project aims to investigate and characterize “scaling laws” for implicit neural representations. We will attempt to answer questions such as:
- How does reconstruction quality (e.g., PSNR, SSIM, or other error metrics) scale with network parameters for a given signal complexity?
- How does the signal’s intrinsic complexity (measured by entropy, Fourier spectrum, fractal dimension, etc.) affect the scaling curves?
- What is the optimal balance between model size and number of samples (compute budget) for training INRs on different classes of signals?
Prerequisites
- Proficiency in Python and experience with PyTorch.
- Familiarity with neural network architectures and basic principles of machine learning.
- Some background in signal processing (Fourier transforms, entropy measures, etc.) is beneficial.
Supervisor
Zhuoqian (Zack) Yang, [email protected]
References
[1] Kaplan, Jared, et al. “Scaling laws for neural language models.” arXiv preprint arXiv:2001.08361 (2020).
[2] Mildenhall, Ben, et al. “Nerf: representing scenes as neural radiance fields for view synthesis (2020).” arXiv preprint arXiv:2003.08934 (2020).
[3] Sitzmann, Vincent, et al. “Implicit neural representations with periodic activation functions.” Advances in neural information processing systems 33 (2020): 7462-7473.
Description
Film photography’s distinctive “look” is partly due to its ability to record and compress light information of high dynamic range, especially in the highlights, without clipping [1]. By preserving subtle gradations in highlight and shadow areas and compressing, film naturally reveals rich color nuances, which is a key contributor to its signature aesthetic.
Digital film emulation has become increasingly popular, but most applications (e.g., Dazz, Dehancer, VSCO) assume availability of high-quality captures, while working off of images captured by relatively limited consumer camera sensors. These images tend to have a low dynamic range and lose highlight and shadow detail that film retains, making it impossible for current emulators to reproduce nuanced tones via compression.
This project aims to explore an approach that recovers or generates a high dynamic range RAW-equivalent image from the limited RGB input. By doing so, we can feed a simulated sensor output with higher bit-depth and more accurate color response into the film simulation pipeline, ensuring that the final result retains the highlight compression and color nuances that define the “film look.”
Type of work:
- BS / MS Level: semester project/master project
- 65% Development 35% Research
Goal
The goal of this semester project is to build a framework that recovers the lost highlight and shadow detail from standard RGB images—effectively synthesizing a RAW-like image—and then apply physically-inspired film simulation techniques on top of this enhanced data. We will investigate state-of-the-art RAW synthesis methods (e.g., diffusion-based [2] U-Net-based [3-5]) and compare them with alternative approaches. The final framework should enable a more faithful reproduction of film’s high dynamic range properties, highlight compression and color “feel.”
Prerequisites
- Proficiency in Python and experience with PyTorch.
- Familiarity with digital imaging pipelines and RAW image formats.
- Interest in photography and knowledge of film characteristics.
Supervisor
Zhuoqian (Zack) Yang, [email protected]
References
[1] Attridge, G. G. “The characteristic curve.” The Journal of photographic science 39.2 (1991): 55-62.
[2] Reinders, Christoph, et al. “RAW-Diffusion: RGB-Guided Diffusion Models for High-Fidelity RAW Image Generation.” arXiv preprint arXiv:2411.13150 (2024).
[3] Brooks, Tim, et al. “Unprocessing images for learned raw denoising.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[4] Zamir, Syed Waqas, et al. “Cycleisp: Real image restoration via improved data synthesis.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
[5] Kim, Woohyeok, et al. “Paramisp: learned forward and inverse ISPS using camera parameters.” arXiv preprint arXiv:2312.13313 (2023).