research-article

Open access

MemoVis: A GenAI-Powered Tool for Creating Companion Reference Images for 3D Design Feedback

Authors:

Nadir WeibelAuthors Info & Claims

ACM Transactions on Computer-Human Interaction, Volume 31, Issue 5

Article No.: 67, Pages 1 - 41

https://doi.org/10.1145/3694681

Published: 10 November 2024 Publication History

PDF eReader

Abstract

Providing asynchronous feedback is a critical step in the 3D design workflow. A common approach to providing feedback is to pair textual comments with companion reference images, which helps illustrate the gist of text. Ideally, feedback providers should possess 3D and image editing skills to create reference images that can effectively describe what they have in mind. However, they often lack such skills, so they have to resort to sketches or online images that might not match well with the current 3D design. To address this, we introduce MemoVis, a text editor interface that assists feedback providers in creating reference images with generative AI driven by the feedback comments. First, a novel real-time viewpoint suggestion feature, based on a vision-language foundation model, helps feedback providers anchor a comment with a camera viewpoint. Second, given a camera viewpoint, we introduce three types of image modifiers based on pre-trained 2D generative models to turn a text comment into an updated version of the 3D scene from that viewpoint. We conducted a within-subjects study with \(14\) feedback providers, demonstrating the effectiveness of MemoVis. The quality and explicitness of the companion images were evaluated by another eight participants with prior 3D design experience.

1 Introduction

Providing asynchronous feedback is a critical step in the 3D design workflow. Exchanging feedback allows all stakeholders, such as collaborators and clients, to review the design collaboratively, highlight issues, and propose improvements [27, 39, 92]. However, creating effective and actionable feedback is often challenging for 3D design. Feedback providers typically convey suggestions about the 3D design via texts. This practice is similar to doing design review in many 2D domains, such as videos [71, 78] and documents [100]. However, browsing 3D models requires users to navigate information using a viewing camera with six Degrees of Freedom (6DoF). Typically, it is more tedious for users to convey where and how changes should be applied on a 3D canvas compared to writing feedback on 2D media. Additionally, describing certain types of design changes, such as material and texture, can be challenging without extensive 3D knowledge. These issues make it especially challenging for individuals with different skill levels to collaborate effectively, like when a designer and a client need to exchange feedback.

To make feedback more instructive, attaching reference images to the textual feedback comments is a common approach for feedback providers to illustrate the texts and externalize the critiques [18, 39]. The process of creating reference images could also inspire feedback providers to find alternative design problems and generate more insights [45, 48]. Ideally, feedback providers should possess basic 3D and image editing skills to create reference images that can effectively describe their thoughts. But 3D design is time-consuming, and some users might not have the proficiency to express their thoughts using 3D software [27]. As a result, they often resort to sketches or images found on the internet. For example, when designing the appearance of a 3D bedroom, it might be quicker for a client to search for bedroom images on websites such as Pinterest than to use 3D software like Blender to adjust model’s geometry, materials, and textures. However, searching for reference images on the internet can also be challenging. Finding images that precisely match the viewpoint and 3D structure of the current 3D model is time-consuming and often nearly impossible. The disparity can lead to misunderstandings as designers try to understand the feedback. Moreover, online search engines often yield images in similar styles due to various biases in indexing algorithms, potentially causing bias in the feedback and influencing it in unintended ways [77].

Recent Generative Artificial Intelligence (GenAI) and Vision-Language Foundation Models (VLFMs) offer unique opportunities to address these challenges. Text-to-image generation workflow enabled by generative models has been deployed in commercial tools (e.g., Firefly [10] and Photoshop [11, 14]) and are being actively studied in Human-Computer Interaction with applications in both 3D [66] and 2D [26, 89] design ideation. However, it remains unclear how these text-to-image GenAI models may support the reference image creations in the 3D design review workflow. Using editing tools like Photoshop [11, 14] can create high-fidelity reference images. However, this process is time-consuming and requires feedback providers to possess professional image editing skills. While it is also feasible for feedback providers to generate reference images using existing text-to-image GenAI tools (e.g., [10, 66, 75]), the synthesized images are usually not contextualized on the current design, making it hard for the designers to interpret the feedback. Our formative studies indicate that additional controls for 3D scene navigation and alignment of the generated output with the 3D scene structure are crucial for reviewers to create effective reference images.

To explore how text-to-image GenAI can be integrated into the 3D design review workflow, we introduce MemoVis, a novel browser-based text editor interface for feedback providers to easily create reference images for textual comments. Figure 1 presents an overview of the workflow. MemoVis enables novice users to write textual feedback, quickly identify relevant camera views of the 3D scene, and use text prompts to construct reference images. Importantly, these images are aligned with the chosen 3D view, enabling feedback providers to more effectively illustrate their intentions in their written comments.

Fig. 1.

MemoVis realizes this by introducing a real-time viewpoint suggestion feature to help feedback providers anchor a textual comment with possible associated camera viewpoints. MemoVis also enables users to generate images using text prompts based on the depth map of the chosen camera viewpoint. To provide users with more control over the generation process, MemoVis offers three distinct modifier tools that complement text prompt input. The text \(+\) scribble modifier allows users to focus the generation on a specific object in the 3D scene. The grab’n go modifier assists in composing objects from the generated images into the current 3D view. Lastly, the text \(+\) paint modifier utilizes inpainting [84] to aid users in making minor adjustments and fine-tuning the generated output.

To understand the design considerations (DCs) of MemoVis, we conducted two formative studies by interviewing two professional designers and analyzing real-world online 3D design feedback. With three key considerations identified from the formative studies, we then prototyped MemoVis by leveraging recent pre-trained GenAI models [52, 98, 106, 108]. Through a within-subjects study with \(14\) participants, we demonstrated the feasibility and effectiveness of using MemoVis to support an easy workflow for visualizing 3D design feedback. A second survey study with eight participants with prior 3D design experience demonstrated the straightforwardness and explicitness of using reference images created by MemoVisto convey 3D design feedback.

In summary, our research around MemoVis explores a potential path and solution to integrate text-to-image GenAI into the 3D design review workflow. Our key contributions include:

—

Formative studies, exploring (i) the integration of GenAI into the companion image creation process for 3D design feedback and (ii) the characteristics of real-world 3D design feedback.

—

Design of MemoVis, a browser-based text editor interface with a viewpoint suggestion feature and three image modifiers to assist feedback providers to create and visualize 3D design feedback.

—

User studies, analyzing the user experience with MemoVis, as well as the usefulness and explicitness of the reference images created by MemoVis to convey 3D design feedback.

2 Related Work

This section discusses the key related work while designing MemoVis. We first look into prior research that explored the supporting tools for creating design feedback (Section 2.1). We then introduce multiple GenAI and VLFMs related to MemoVis (Section 2.2) and discuss how they could be integrated into 2D and 3D design workflow (Section 2.3).

2.1 Creating Effective Design Feedback

Designers often receive feedback along the progress of their design from their collaborators, managers, clients, or even online community [57]. Feedback allows designers to gather external perspectives to avoid mistakes, improve the design, and examine whether the design meets the objectives [39]. Feedback can also help designers foster new insights and creativity [72].

In nearly all creative design processes, asynchronous feedback serves as a critical medium to communicate ideas between designers and various stakeholders [92]. The modalities of feedback could affect the communication efficiencies [64] and the feedback interpretability [33]. While creating text-only feedback is an easy and widely-adopted method, using visual references is often more effective. Prior works have explored the integration of reference images with textual feedback. Herring et al. [44] demonstrated the importance of using reference images during client-designer communications, creating a more effective way to enable designers to internalize client needs. Paragon [48] argued that the visual examples could encourage feedback providers to create more specific, actionable, and novel critique. This finding guides an interface design that allows feedback providers to browse examples for visual poster design using metadata. Robb et al. [83] showed a visual summarization system that could crowd-source a small set of representative image as feedback, which could then be consumed at a glance by designers.

Similar to 2D visual design, the system for creating feedback is also urgently needed for 3D design. In sectors like the manufacturing industry, the capability to provide feedback and comments is becoming an essential feature in today’s 3D Design For Manufacturability (DFM) tools. Many existing research and 3D software have designed features for feedback providers to create textual comments and draw annotations in a specific viewpoint, or on the 3D model directly. The ModelCraft demonstrates how freehand annotations and edits can help in the ideation phase during early 3D design stage [90]. Professional DFM software, e.g., Autodesk Viewer, allows adding textual comments of a specific viewpoint and markups to specific parts of 3D assets [16] . Browser-based tools such as TinkerCAD [93] also enable feedback providers with less professional 3D skills to add textual comments. As Virtual Reality (VR) headsets advance, recent research has also delved into feedback creations within the context of VR-based 3D design review workflow [101].

While textual feedback is simple to create and might be useful in many cases, Bernawal et al. [18] showed that the graphical feedback could significantly improve performance and reduce mental workload for design engineers compared to textual and no feedback in manufacturing industry settings. However, creating reference images is tedious and challenging. For certain design with less-common perspective, searching reference images that are well matched with the specific viewpoints is time- consuming and sometimes nearly impossible. While rapid sketching or using image editing tools might work, such process is tedious and needs feedback providers to have professional image editing and 3D skills—an impractical expectation for many stakeholders such as managers and clients. MemoVis shows a novel approach that aims to leverage the power of recent GenAI to help feedback providers easily create companion reference images for 3D design feedback.

2.2 Generative AI (GenAI) and VLFMs

Language and vision serve as two primary channels for information [102]. Recent AI research has advanced the capabilities of visual and text-based foundation models, many of which have been successfully integrated into commercially available products. Since the introduction of Generative Adversarial Network (GAN) [41] and Deep Dream [1, 91], many recent GenAI models, such as Stable Diffusion [84], Midjourney [4], and DALL\(\cdot\)E \(3\) [76], can understand textual prompts and generate images, using models pre-trained by billions of text-image pairs. Beyond text-to-image generations, several VLFMs attempt to bring natural language processing innovations into the field of computer vision. The Contrastive Language-Image Pre-training (CLIP) model demonstrates a “zero-shot” capabilities to use texts to achieve various image classification tasks without the need for directly optimizing the model for a specific benchmark [3, 81]. The CLIP model also possesses capabilities to represent text and image embeddings in the same space, allowing for direct comparisons between the two modalities [3]. The Bootstrapping Language-Image Pre-training (BLIP) [61, 62] model is another example of pre-trained VLFMs that can perform a wide variety of multi-modal tasks, such as visual question answering and image captioning. Grounding DINO [66] shows how the transformer-based detector DINO can be integrated into grounded pre-training to detect arbitrary objects with human input. Using Grounding DINO [66] and the Segment Anything Model (SAM) [52] unlocks the opportunities for inferring segmentation mask(s) based on text input. Similarly, other pre-trained models, e.g., Tag2Text [46] and RAM [107], show to generate textual tags with input images.

While text-guided image synthesis is promising, generating images just based on texts may fail to satisfy users’ needs due to lack of additional “control.” ControlNet [106] shows the feasibility of adding additional control to text-to-image diffusion models. They show how large diffusion models, such as Stable Diffusion can be augmented by additional conditional input, such as edges and depths. This opens possibilities for many follow-up works, e.g., Uni-ControlNet [108] and LooseControl [20], that demonstrate the integration of multimodal conditions. InstructEdit [98] and early work like EditGAN [63] show how users can perform fine-grained editing based on text-instruction.

While innovating GenAI and VLFMs models is out of our scope, MemoVis contributes a novel interaction workflow to enable feedback providers easily creating reference images for 3D design feedback by leveraging the strengths of today’s GenAI models.

2.3 GenAI-Powered 2D and 3D Design Workflow

Previous research has explored novel techniques for discovering the design space and controlling the generation process of image through GAN-based GenAI [41], resulting in an increased efficiency of various visual design workflow. For example, Zhang et al. [105] illustrates how a selection of image galleries may be generated using a novel sampling methods, along with an interactive GAN exploration interface. Koyama et al. [55, 56] proposes the Bayesian optimization-based approach that allows designers to search and discover the design space through a set of slider bars. Additionally, much prior research proposes novel interaction experiences that allow users to input additional control. For example, GANzilla [35] further demonstrates how iterative scatter/gather techniques [79] allow users to discover editing directions—the user-defined control to steer generative models to create content with different characteristics. Follow-up research, GANravel [36], shows the techniques of global editing (by adding weights to example images) and local editing (with the scribbled masks) for disentangling editing directions (i.e., achieve user-defined control while ensuring that unintended attributes remain unchanged in the target image). GANCollage [96] shows a StyleGAN-driven [49, 50] mood board that allows users to define possible controls through sticky notes.

Recent advances in text-to-image GenAI have significantly benefited previous research on integrating these technologies into 2D visual design workflows. For example, the feature of “Generative fill” [12] introduced by Photoshop shows how texts with simple scribbling could easily in-paint and out-paint the target image (e.g., adding and/or removing objects). Reframer [59] demonstrates a novel human-AI co-drawing interface, where the creator can use a textual prompt to assist with the sketching workflow. The study conducted by Ko et al. [53] demonstrates versatile roles of text-to-image GenAI for visual artists from \(35\) distinct art domains to support automating art creation process, expanding ideas, and facilitating or arbitrating in communications. In another study with \(20\) designers, Wang et al. [97] demonstrates the merits of AI-generated images for early-stage ideation. Recent works such as GenQuery [89] and DesignAID [26] demonstrate how GenAI could be useful for early-stage ideation during 2D graphic design workflow. Similarly applied to the ideation stage, CreativeConnect [29] further shows how GenAI can may support reference images recombination. Beyond visual design, BlendScape [82] demonstrates how text-to-image GenAI and emergent VLFMs can be used to customize the video conferencing environment by dynamically changing background and speaker thumbnail layout.

As for 3D design, 3DALL-E [66] introduces a new plugin for Fusion \(360\) that uses text-to-image GenAI for early-stage ideation of 3D design workflow. LumiMood [74] shows an AI-driven Unity tool that can automatically adjust lighting and post-processing to create moods for 3D scenes. Lee et al. [60] focused on 3D GenAI with different input modalities and revealed that the prompts can be useful for stimulating initial ideation, whereas multimodal input like sketches play a crucial role in embodying design ideas. Blender Copilot [75] is a Blender plugin that enables designers to easily generate textures and materials by text. Vizcom [9] introduces an early-stage ideation tool for automotive designers that leverages ControlNet to convert designers’ sketches into reference images. In terms of 3D design review, ShowMotion [25] demonstrates how possibly optimal views could be searched from a set of pre-recorded shots by selecting a specific 3D element in the scene. However, the closed-form searching algorithm only considered the \(L2\)-distances of each shots to the selected target, neglecting the textual feedback comments.

Inspired by these works, MemoVis demonstrates how to integrate text-to-image GenAI models into the 3D design feedback creation workflow. Unlike early-stage ideation, modeling, or image editing tools, MemoVis is a review tool that aims to enable feedback providers who might not have professional 3D and image editing skills to easily create companion images for the textual comments contextualized on the viewpoints of initial 3D design. MemoVis’s overarching tenet is to help feedback providers focus on text typing—the primary tasks while creating design feedback.

3 Formative Studies

We conducted two formative studies to understand the DCs. Specifically, we first interviewed with professional 3D designers to understand current practices and the challenges of creating companion images for 3D design review (Section 3.1). We then analyzed design feedback from an online forum to understand unique 3D design characteristics that feedback writers want to convey (Section 3.2). This process resulted in three DCs (Section 3.3).

3.1 Formative Study 1: Preliminary Needfinding Study

The first study aims to understand the current professional review practices and pain-points in creating companion reference images. Professional 3D designers were recruited due to their extensive experience as both 3D designers and feedback providers.

Participants. We recruited FP1 and FP2 as the Formative study Participants (FP) from Nissan Design America. FP1 is a modeler and digital design lead with approximately \(20\) Years of Experience (YoE) of 3D design and \(12\) YoE in professional 3D design review process. FP1 also has extensive experience for 3D game design as a freelance. FP2 is currently a senior designers specialized in 3D texture design. FP2 has approximately \(23\) YoE of 3D design and \(15\) YoE in 3D design review. While both participants have strong experience on using professional 3D software for automotive design, such as Autodesk VRED, they also considered themselves as experts in most generic 3D software, incl. Blender, Adobe Substance Collections as well as Autodesk Maya and Alias. Although FP2 has tried Midjourney [4], both participants have not used GenAI in the professional settings.

Methods. Participants first described their demographics, including past 3D designs and design review experiences. We then conducted semi-structured interviews with each participants, focusing on two guiding questions: “what does the typical 3D design workflow look like” and “what are the potential pain points when creating reference images for design feedback.” For the second question, we further probed participants to discuss how GenAI tools could help with this task. The interview was open-ended, and participants were encouraged to discuss their thoughts based on their professional experience. All interviews were conducted remotely, which were then analyzed thematically using deductive and inductive coding approach [21] (see Figure H1 in Appendix H for the generated codebook). This interview on average took \(41\) min with each participants.

Findings. Overall, we identified three key findings.

—Creating reference images starts with finding supporting 3D camera views. Both participants recognized the importance of finding supporting camera views in the 3D scene to aid written feedback. This task is typically done through manual exploration of the screen and taking screenshots. For example, FP1 described:

“Most of the time, we’ll simply take screenshots. We can then markup over the screenshots. Or sometimes the managers will actually take screenshots and markup what they want, or give examples of what they want and send back. […] I typically use Zoom to record my feedback, because in our regular design software, there is not a lot of functionality like this right now” (FP1).

FP1 further emphasized the importance of having these supporting 3D camera views when discussing with non-technical collaborators like managers or clients. For example: “sometimes, we will just hold an online meeting. They’ll talk about like from rear view, or from the top. And we will navigate the design and show them to better understand their feedback.”

—Conveying changes requires additional work on the reference images. When it comes to creating an effective reference image for 3D design feedback, participants reported three main approaches: scene editing, using existing example images, or GenAI.

(1) Scene editing. Participants reported spending time to mock up changes directly on the collected screenshots. FP2 described the complexities of preparing reference images for material design: “for making reference images for interior and definitely for color materials, where they apply fabrics to the seats and the door panels and like a leather grain to the instrument panel, stuff like that, it can be a little bit involved. It’s not so quick. So I usually do it quickly in Photoshop. But ideally, you can also do it in the visualization software, like the VRED. I do think preparing these reference images takes the longest.”

(2) Using existing reference images. Some design review stakeholders might resort to using some existing photographic examples of what they want to convey. For example, FP1 mentioned that “when we get to the more advanced review with the stakeholders outside our studio, sometimes, they might simply show some other example images to demonstrate what they want. But we would generally work with them iteratively, just to make sure there is no miscommunication.”

(3) GenAI. Emerging GenAI tools are seen as a promising way to generate reference images. Both participants recognized the potential benefits for visual reference creations and creativity support that GenAI tools could provide in 3D design review workflow. A key benefit that FP1 emphasized is the low barrier of entry for non-technical users: “using texts to generate image seem to be flexible and simple for those who do not know image editing software.” FP2 emphasized how images generated using GenAI technologies could inspire new ideas: “GenAI is like Pinterest on steroids. You’re already using Pinterest, but with GenAI you can create even more interesting inspiration. I think you tend to see the same images once in a while on Pinterest. Because if people are picking the same type of images, and you’re kind of like in this echo chamber kind of thing. I remember the first few months when I started playing with Midjourney, my brain just got kind of warmed and hot. It was getting massaged! Because I’m seeing these crazy visuals that my brain isn’t used to, like these weird combinations of things. I think it was a very good way for ideation. It can be also very stimulating for feedback providers to think and create suggestions.”

—Controls for image generation. Despite the potential of using GenAI to generate reference images, both participants mentioned that it can be frustrating to generate the right image using only text prompts: “it’s like a slot machine. I think designers kind of occasionally like happy surprises. You do 20 and maybe one is cool. But after a while, I think it gets a little frustrating and you really want more control over the output. I know there’s a lot of this control on that kind of stuff, for example helping you control a little bit more perspective and painting, and stuff like that, however it is still very hard to visualize the many feedback suggestions, for example, with a small change of a particular assets components” (FP2).

3.2 Formative Study 2: Analysis of Real-World 3D Design Feedback

The first study only allows us to understand how reference images are used in current 3D design review workflow, we still need to examine what information reviewers typically encode in their feedback and how visual references are used to convey such information. While our semi-structured interview offers insightful thoughts from professional designers, it is difficult to analyze real-world design feedback in an ecologically valid settings, as most of design feedback in professional settings are not publicized. Additionally, the feedback providers may also include those beyond professional 3D designer, who might not have prerequisite skills for using 3D and image editing software.

Data Curation. We collected 3D design feedback from Polycount [7], one of the largest free online community for 3D professionals and hobbyists. Polycount allows its members to post 3D artwork and receive design feedback from the community. Although Polycount posts tend to be centered around 3D game design, the 3D assets that are discussed broadly can include a wide variety of 3D design cases like characters, objects, and scenes design. These assets are also common in other 3D design domains. Therefore, the outcomes yielded from our analysis could also be generalized into broader 3D applications. Next, We selected threads within one month starting from June 2023. Since our focus is to understand the characteristics of real-world design feedback, we only focus on the section of “3D Art Showcase & Critiques” [8]. Irrelevant topics such as announcements and discussions related to 3D software were excluded. Our data curation led to \(15\) discussion threads, including \(99\) posts from \(15\) creators and \(36\) feedback providers. Among \(15\) threads, eight threads focus on the design of characters (e.g., human and gaming avatars), three threads focus on the design of objects, and four threads focus on the scene design.

Analysis Methods. We used inductive and deductive coding approach [21] to label each design feedback posts thematically. We aim to understand what were the primary focus of feedback and how the feedback was externalized. Our codebook (see Figure H2 in Appendix H) was generated through five iterations.

Findings. Our analysis leads to findings under five themes (Figure H2). We use OFP <Thread_ID>- <Feedback_Provider_ID> to annotate Online design Feedback Providers (OFP). For example, OFP1-2 indicates the second feedback provider in the first discussion thread.

—Reference images are important to complement textual comments, but creators might need additional “imaginations” to transfer the gist of the visual imagery. One common approach for creating visual reference is to use internet-searched images. However, online images tend to be very different to the original design context, so feedback providers usually write additional texts to help designers understand the gist of reference images contextualized on the original design. For example, while suggesting to change the color tones of the designed character (Figure 2(a)), OFP6-6 used a searched figure (Figure 2(b)) as a reference image and suggest: “in terms of tone, maybe look here [refer to Figure 2(b)] too, just for its much softer tones across the surface.” Similarly, as for scene design, OFP8-2 used Figure 2(d) and (e) to suggest feedback for a medieval dungeon design (Figure 2(c)): “the rack and some of the barrels and other assets look very new and doesn’t have the same amount of wear or appear to have even been used. Need to think of the context and the aging of assets that would have similar levels of wear or damage […] [for the wall design] I’d probably suggest to use more variation/decals, with some areas of wet or slight moss, or cobwebs.” OFP8-1 also similarly found another internet searched image shown in Figure 2(f) as the example to suggest the design of shadow and color tone: “right now the shadows are way too dark, almost completely black which makes it really hard to tell what is going on, and causes losing a lot of detail in some parts […] I would look into adding in some cool colored elements into your environment to help balance the hot orange lighting you have. Here is an example with less extreme lighting.” However, the reference images attached by feedback providers were not contextualized on the initial design (cf. Figure 2(b) vs. 2(a) and cf. Figure 2(d)–(f) vs. (c)), and could possibly lead to confusion for the creators.

Fig. 2.

—The suggestions conveyed by the design feedback can be either the revision of specific part(s) or the redesign of entire assets. While much feedback such as Figure 2(d)–(f) focused on the major revision (or even redesign) of the entire environment, we found much design feedback only emphasizes the change of specific part(s) of the design assets while keeping the rest of design remains unchanged. For example, for Figure 2(g), OFP2-1 suggested only the change of the knives without commenting on the other part of asset: “the knife storage seems a bit unpractical. As they are knives, one would probably like to grab them by the handle? Maybe use a belt instead, with the blades going in, or some knife block as can be found in some kitchens.” Similarly, in the design of Figure 2(h), OFP13-3 suggested the additional modeling while satisfying the rest of the design: “looks cool! With close up shots it would be nice if bolts/screws were modeled.” Although such intended changes are often involved with specific part(s) of the assets, some design feedback explicitly request designers to “visualize” how the assets might be integrated into a different environment. This feedback helps inspire creators to better polish their artwork. For example, while providing feedback for polishing the tip and edge of the gun in Figure 2(h), OFP13-2 asked the creator to “just think about how this gun is used, and where wear and damage would be.”

—Although some design feedback provides actionable solutions, others offer potential exploratory directions. Despite some design feedback contains specific and actionable steps, many critiques only suggested the problems, without offering suggestions on how to address the problem. For example, for an ogre character design, OFP1-4 suggested: “I think it could use some better material separation. The cloth and the metal (maybe even the skin too) seem to have all the same kind of noise throughout.” Some design feedback may indicate the potential exploratory directions that might require the creators to explore by trial and error. For example, while designing a demon character, OFP3-2 wrote: “I think that the shape is too round and baroque. More angular form should work better.”

3.3 Summary of DCs

We highlighted the significance of using reference images in 3D design feedback, applicable to both synchronous and asynchronous design review sessions discussed in Sections 3.1 and 3.2, respectively. Both studies showed that the process of creating good reference images still remains difficult. These challenges come from the complexities of 3D design, requiring feedback providers to be well-versed in 3D skills to describe changes efficiently. While emerging GenAI are deemed promising to synthesize reference images for 3D design, we have seen from FP2’s testimonies that using such tools directly without dedicated interface support can hardly meet the needs of 3D design feedback visualizations. To integrate text-to-image-based GenAI into 3D design feedback creation workflow, we propose three key DCs:

DC1: Design controlled visualization for both local and global changes. Our second formative study showed the importance of having a controlled way to synthesize reference images for both local (i.e., only specific part(s) of the asset need to be changed while keeping the remaining part consistent with current design) and global changes (i.e., major revision or even redesign of the most parts of the artwork). Demonstrated in Figure 2 and discussed by FP2, although searching for images is commonly used, finding an exact or sufficiently similar design online is difficult. To address this, feedback providers often have to write elaborate and detailed comments explaining the differences and areas of focus. While vanilla text-to-image GenAI methods offer powerful tools for image generation, the resulting images are usually not contextualized in the initial design. Therefore, it is critical to develop a novel interface that could leverage the power of Gen AI to generate in-context imagery at various scales seamlessly integrating this into 3D design.

DC2: Provide ideation and creativity support for exploratory design feedback. We learned that feedback providers often would like to use the reference image as a source of inspiration, as it enables them to think about the alternative ways to improve to the 3D design. For example, some design feedback, like OFP13-2, requested the creators to imagine how the artwork would look like after being integrated into a bigger environment. Therefore, reference images should also demonstrate how the 3D model might look like when used in difference scenarios, inspiring both creators and feedback providers.

DC3: Offer ways for feedback providers, including those without 3D and image editing skills to accompany their comments with meaningful in-context visualizations. Feedback providers do not have to possess design skills, as shown in Formative Study 1. Thus, feedback providers may also encompass individuals without professional 3D or image editing skills, such as the clients. Without these skills, one can spend tremendous effort on even the simplest tasks, such as selecting a good view by navigating the scene with an orbit camera, inserting a novel object, or changing a texture or color of an existing object [27]. Therefore, the interface design should provide simple ways for feedback providers to navigate the 3D scene and create reference images without 3D and image editing skills.

4 MemoVis

Overall, MemoVis includes a 3D explorer (Figure 1(a)), where feedback providers can explore different viewpoints of the 3D model, and a rich-text editor (Figure 1(b)) where the textual feedback is typed. Unlike image editing (e.g., Photoshop [11, 12]) and ideation tools (e.g., Vizcom [9]), MemoVis is a review tool for 3D design. The key tenet is to enable feedback providers to focus on text typing—the primary task for creating 3D design feedback—while using AI-augmented rich-text editor to create companion images that illustrate the text. As a review tool, MemoVis aims to enable feedback providers to create reference images that could efficiently convey the gist of the textual comments instead of images that are visually aesthetic. MemoVis’s main features include a viewpoint suggestion system and three image modification tools.

4.1 Real-Time Viewpoint Suggestions

In MemoVis, users control an orbit viewing camera with 6DoF to render the 3D model on screen space similar to mainstream 3D software. However, navigating and exploring a 3D scene with a mouse can be tedious for users lacking experience in 3D software. To enable feedback providers without 3D and image editing skills more efficiently create in-context visualizations for the textual feedback (DC3), MemoVis automatically recommends viewpoints as the feedback provider continuously typing feedback. Figure 1(c) shows an example where a closer (and possibly better) viewpoints, shown as thumbnails, of the standing desk are recommended. The viewpoint being selected will be anchored with the textual comment and instantly reflected on the 3D explorer. Feedback providers can ignore the suggestions when the suggested views are less helpful.

To achieve this, we use the pre-trained CLIP model [3] to find viewpoints with highest cosine similarities to the textual feedback. Indeed, CLIP is trained on \(\sim 400\) millions text-image pairs with a constrastive loss [30]. The system benefits from the human biases in image acquisition [43, 95]. For example, despite the vagueness of the text, e.g., “office desk,” the pre-trained CLIP model would score the desk from front and top view higher than that from the back and bottom view, as it is more common to take front-facing pictures of desks. Formally, we parameterize the orbit camera with six parameters: the 3D point that the camera is looking at \((t_{x},t_{y},t_{z})\) as well as its distance \(r\) to the camera, and the longitudinal and latitudinal rotation \(\alpha\), \(\beta\). Each viewpoint can thus be represented by the tuple \(\boldsymbol{v}=(\alpha,\beta,r,t_{x},t_{y},t_{z})\), with \(\alpha\in[0,\pi]\) and \(\beta\in[0,2\pi]\). Our goal is to search \(\hat{\boldsymbol{v}}=\mathbf{argmax}_{\boldsymbol{v}\in\boldsymbol{V}}cos\{f_{ text}(\boldsymbol{t}),f_{image}(\boldsymbol{I}_{\boldsymbol{v}})\}\), where \(f_{text}(\cdot)\) and \(f_{image}(\cdot)\) represent the CLIP encoding for text (\(\boldsymbol{t}\)) and screen space image (\(\boldsymbol{I}_{\boldsymbol{v}}\)) that is associated with a specific viewpoint \(\boldsymbol{v}\). During pre-processing, we sample multiple possible viewpoints and encode their corresponding renderings via CLIP to create a database of viewpoints. At inference time, we encode the textual comment via CLIP and perform a nearest-neighbor query in the database, which can be done without breaking the interactive flow.

—Pre-processing. We first compute the bounding box of the 3D model. We then discretize the \(x\)-, \(y\)-, and \(z\)-axis into five bins, leading to \(5^{3}=125\) sampled target position \((t_{x},t_{y},t_{z})\). Similarly, we sample \(\alpha\) and \(\beta\) with \(30^{\circ}\) intervals into \(12\times 6=72\) possibilities. We sample \(r\) from \(\{0.5,1.0,1.5\}\) to create close, medium, and far views. For each 3D model, this pre-processing phase takes around 3–5 minutes, and results in a \(\boldsymbol{D}\in\mathbb{R}^{27K\times 500}\) matrix, including \(27\) k view points, each encoded via CLIP into a \(500\) dimensional feature vector.

—Real-time inference. As the feedback provider typing, the textual comment (\(\boldsymbol{t}\)) is encoded, and the top-\(4\) nearest views under cosine similarity are retrieved. The feature runs every time the user stops typing for \(500\) ms and takes under a second to compute.

Figure 3 shows examples of suggested viewpoints of the pegboard in an office (Figure 3(a)–(e), as an interior design example), the headlight of a car (Figure 3(f)–(j), as an exterior design example), and the headband of a samurai boy (Figure 3(k)–(o), as a character design example). Although MemoVis provides real-time viewpoint suggestions, it still allows feedback providers to manually navigate the view. For instance, they can manually find the view before writing textual comments or adjust the view based on the suggestions.

Fig. 3.

4.2 Creating Reference Images with Rapid Image Modifiers

Guided by DC2, providing ideation and creativity support is crucial for creating design feedback, MemoVis therefore uses the recent text-to-image GenAI to create reference images. However, the generated reference images should match both textual comments and current 3D design. Critically, MemoVis must be able to generate images with local modification of the scene if the feedback is targeted at a specific part, or images with global edits when the feedback is a global redesign of the scene. This design goal aims to address DC1, emphasizing on the controlled visualization of the textual feedback for both local and global changes. MemoVis introduces three image modifiers, which operate on rapid design layers. Feedback providers can use one or multiple modifiers (Figure 1(a)) to easily compose and create reference image, on the associated rapid design layer, rendered from a selected viewpoint. We now describe our system design and demonstrate how the modifier might interact with each rapid design layers.

Modifier 1: Text \(+\) Scribble Modifier with Scribble Design Layer. We leverage ControlNet [106, 108] for two scenarios. In the simpler case, the feedback is just a global texture edit on the scene that does not suggest any geometry modification. In this case, we use a depth-conditioned ControlNet [106] to generate an image. The depth guidance ensures that the generated image is anchored in the current design, while the textual prompt generates an image that matches the feedback.

If the feedback suggests to modify the geometry of part of the scene (e.g., the review from OFP2-1 for the 3D design of Figure 2(g)), directly editing the depth image [32] is impractical for feedback providers without graphic knowledge. Instead, we leverage a depth- and scribble-conditioned ControlNet in a somewhat more involved strategy¹ [106, 108]. For example, let’s say the design feedback is to replace the computer display with a curved one in the scene depicted in Figure 4(a) with the depth map shown in Figure 4(b), the feedback provider only has to scribble the rough shape of the new curved computer display to be added, as in Figure 4(c), and provide a text prompt. MemoVis starts by creating the input conditions to ControlNet. It extracts the scribble from the initial design (Figure 4(a)) using Holistically-nested Edge Detection (HED) [103] and aggregated it with the manually-drawn scribbles (Figure 4(e)). The depth map (Figure 4(d)), aggregated scribbles and text prompt, are then fed to ControlNet to generate an image (Figure 4(f)). Empirically, we set the strengths for scribble to \(0.7\) and depth condition to \(0.3\) emphasizing the higher importance of the newly added object(s).

Fig. 4.

In addition to changing the geometry, the generated image might modify the texture of the current design, which is undesirable. To address this, we leverage automatic segmentation techniques to merge the generated object from the generated image \(\boldsymbol{I}_{syn}\) i.e., the “the curved computer display,” back into the original render \(\boldsymbol{I}_{init}\). To achieve this, we compute the bounding box of the user scribbles (red box in Figure 4(e)) and find the most salient object within the bounding box using SAM [52], leading to a segmentation mask \(\boldsymbol{I}_{seg}\) (Figure 4(g)). MemoVis then creates the final reference image shown in Figure 4(j) by composition: \(\boldsymbol{I}_{syn}\odot\boldsymbol{I}_{seg}+\boldsymbol{I}_{init}\odot(\mathbf{1}-\boldsymbol{I}_{seg})\), where \(\odot\) indicates broadcasting and element-wise multiplication.

This approach allows to visualize the newly added objects, but part of the initial object, i.e., the current display can remain visible, leading to unpleasing visual artifacts, circled in red in Figure 4(j). To address this, MemoVis detects the mesh primitives to be removed from the image bounding box and depth map, and re-render \(\boldsymbol{I}^{\prime}_{init}\), which the same image as \(\boldsymbol{I}_{init}\) but without the object to be removed. Replacing \(\boldsymbol{I}_{init}\) by \(\boldsymbol{I}^{\prime}_{init}\) in the composition leads to the final results, displayed to the user, and visualized in Figure 4(i). Algorithm C1 in Appendix C recaps the algorithm.

Modifier 2: Grab’n Go Modifier with GenAI Design Layer. The grab’n go modifier is an easy selection tool that allows the user to compose an object from the rendered image into a generated image. For instance, considering the car in Figure 5(a), we can generate an image of this car staged in various backgrounds (Figure 5(b) and 5(d)) using the depth-conditioned ControlNet [106]. However, the car might have undesirable texture variations. The feedback providers can simply draw red rectangles to select the car object and replace it with the exact car in the current design, thus staging it in the desired environment.

Fig. 5.

The grab’n go modifier can also be used to do the reverse, i.e., composing objects from the generated images into the current design. For example, if the feedback providers wants to also include the generated keyboard in our previous example with the curved display in Figure 6(b), they can simply draw a crop on the GenAI design layer. To achieve this, similar to the first modifier, we give the bounding box drawn by the user as input to SAM [52], and select the highest scored region as the output (Figure 6(c)). MemoVis then computes a final segmentation mask by union-ing Figure 6(c) and (g) that would be used to create final reference image (Figure 6(e)). Formally, the final segmentation mask after \(i\) times of applying grab’n go modifier could be computed by \(\boldsymbol{I}_{seg}=\boldsymbol{I}_{seg_{i-1}}\cup\boldsymbol{I}_{seg_{i}}\). Note how this is significantly simpler than professional image editing software which commonly use the Lasso tool [13]. As suggested by DC3 emphasizing the needs for users without image editing skills, the interactions around grab’n go modifier enable feedback providers without professional image editing skills to efficiently create companion images.

Fig. 6.

Modifier 3: Text \(+\) Paint Modifier with Painting Design Layer. MemoVis integrates Stable Diffusion Inpainting [84] as a text \(+\) paint modifier. The user paints a selection mask on the canvas, provides a text prompt, and the model generates an image. This can be used to add simple objects; remove objects; or rapidly fix the glitches caused by text \(+\) scribble and grab’n go modifiers. This tool complements the text \(+\) scribble modifier. Figure 7 shows an example of adding a wall clock with text \(+\) paint modifier (Figure 7(a)–(c)). When trying to add a curved computer display, text \(+\) paint modifier fails to convey the intention of the design feedback (see Figure 7(d)). We thus argue for the necessity of user scribbling to add more complicated objects, which would be hard to describe in details with text.

Fig. 7.

MemoVis can also fix artifacts in text \(+\) scribble and grab’n go modifiers. Those artifacts occasionally arises in Algorithm C1, when the residual area occupies more than \(70\%\) of the areas of the corresponding meshes (i.e., \(r \gt r_{th}\)). For example, to change the water dispenser in an office to a storage drawer (Figure 8(a)), the feedback provider might leverage the text \(+\) scribble modifier to roughly draw the shape of a drawer (Figure 8(b)), leading to the drawer being extracted from synthesized image (Figure 8(c)) and added to the initial design (Figure 8(d)). While Figure 8(d) may convey the ideas of adding a storage drawer next to the standing desk, the water dispenser is misleading and needs to be removed. Feedback provider can continuously use the text \(+\) paint modifier to easily remove the residual areas of the top of water dispenser, leading to a more faithful final reference image (Figure 8(e)).

Fig. 8.

4.3 Interactions and Implementations

Figure 1 shows the design of MemoVis. With the text \(+\) scribble modifier, the feedback provider scribbles while holding the mouse click. The feedback provider can use the left and right mouse buttons to indicate the geometry of new objects and the unwanted areas, respectively. With grab’n go modifier, the feedback providers can easily draw a bounding box using either the left or the right mouse button. The left and right mouse button indicate keeping and removing the key object(s) inside the enclosed box of the synthesized image in the reference image, respectively. Finally, with the text \(+\) paint modifier, MemoVis enables feedback providers to click and drag left and right mouse button to specify the areas for adding (or changing) and removing objects of interest, respectively.

MemoVis was prototyped as a browser-based application to reduce the need for feedback providers to install large-scale standalone 3D software, as partial engineering efforts to address DC3. We used Babylon.js v\(6.0\) [5] to implement the 3D explorer for rendering the 3D model. A flask server was implemented to process the inference workloads, deployed on a GPU-enabled cloud server. Further details of the pre-trained models we used can be referred to Appendix D.

5 Evaluations

We conducted two user studies to evaluate MemoVis. The first study aims to understand the experience of feedback providers while visualizing textual comments using MemoVis, whereas the second study aims to explore how the created images by MemoVis could effectively convey the gist of 3D design feedback while being consumed by designers. We use PF# and PD# to index Participants for acting as Feedback providers (PF) and Designers (PD), respectively.

5.1 Study 1: Creating Reference Images

Our first study was structured as a within-subjects design. We aim to tackle two Research Questions (RQs):

—

(RQ1) How the real-time viewpoint suggestion could help feedback providers navigate the 3D scene while writing feedback?

—

(RQ2) How the three types of image modifiers could help feedback providers visualize the textual comments for 3D design?

Participants. PF1 - PF14 (age, \(M=23.36\), \(SD=2.52\), incl. seven males and seven females) were recruited as the feedback providers. While all participants have experience of writing design feedback, most participants had limited experience working with 3D models. We believe that the design feedback providers do not necessary to possess design skills. Among recruited participants, only PF2, PF10, and PF14 were confident of their 3D software skills. Most of participants did not have experience of creating prompts and using GenAI, although PF1 and PF2 considered themselves as experts of using LLM; PF13 was confident of his proficiency of using text-to-image GenAI. Details of participants’ demographics and power analysis can be referred to Appendix E.

Interface Conditions. Participants were invited to use two interface Conditions (C1–C2) to create and visualize textual comments:

—

(C1) Baseline. Participants were required to create reference image(s) using search and/or hand sketching. We aim to mock up current design review practices based on the findings from Formative Study 1. Participants were not required to use existing GenAI-based image editing tools like Photoshop [14]; while FP2 acknowledged GenAI as a promising approach to generate reference images, its lack of control makes it impractical (Section 3.1); unlike FP2 who are professional designers, feedback providers may not possess skills on using image editing software. Participants were instructed to use their preferred search engine for finding well-matched images. PowerPoint was optionally used for sketching and annotations.

—

(C2) MemoVis. Participants were invited to use MemoVis to create reference images along with textual comments.

Tasks. Each participant was instructed to complete three different design critique Tasks (T1–T3) with C1 and/or C2. To prevent learning effect, T1–T3 used different 3D models, created by different creators. For each design critique task, participants were instructed to create at least two design comments, with each comments being accompanied by at least one reference images. We used T1 to help participants get familiar with C1 and C2, with which participants were instructed to critique a character model of a samurai boy. All data collected from T1 was excluded from all analysis. For T2 and T3, participants were asked to make the bedroom and the car more comfortable to live in and drive with, respectively. While T2 and T3 used different 3D models, the skills needed to create design feedback are the same. Details of the study tasks could be referred to Figure F1 in Appendix F.

Procedures. Participants were invited to complete a questionnaire to report demographics, 3D software, 3D design, and GenAI experiences. They were then introduced to MemoVis, and were given sufficient time to learn and get familiar with both C1 and C2 while completing T1. For those without prompt writing experience for GenAI, sufficient time was provided to go through examples on Lexica [6] and practice using FireFly [10]. Upon feeling comfortable with C1 and C2, participants were invited to complete T2 and T3. To prevent the sequencing effects, we counterbalanced the order of interface conditions. Specifically, PF1–PF7 were required to complete T2 with C2, followed by completing T3 with C1. Whereas, PF8–PF14 were required to complete T2 with C1, followed by complete T3 with C2. Comparing feedback creations for T2 and T3 was out of our scope, we therefore did not counterbalance the order of the tasks. After each task, participants were invited to rate their agreement of four Questions (Q1–Q4) in a \(5\)-point Likert scale, with a score \(\gt 3\) being considered as a positive rating.

—

(Q1) Navigating the Viewpoint: “it was easy to locate the viewpoint and/or target objects while creating feedback.”

—

(Q2) Creating Reference Images: for the task completed by MemoVis (C2), we used the statement “it was easy to create reference images with the image modifiers for my textual comments.” For the task completed by baseline (C1), we used the statement “it was easy to create reference images with the method(s) I chose.”

—

(Q3) Explicitness of the Reference Images: “the reference images easily and explicitly conveyed the gist of my design comments.”

—

(Q4) Creativity Support: “the reference images helped me discover more potential design problems and new ideas.”

A semi-structured interview was also conducted focusing on participants’ rationales while evaluating Q1 to Q4. The study on average took \(57.34\) min (\(SD=7.72\) min).

Measures and Data Analysis. To address RQ1, we measured the navigating time for each textual comment, defined by the time that feedback providers spent while navigating and exploring the 3D model. With the Shapiro–Wilk test [88], we verified the normal distribution of the measurements under each condition (\(p \gt .05\)). One-way ANOVA [40] (\(\alpha=.05\)) was therefore used for statistical significance analysis. The eta square (\(\eta^{2}\)) was used to evaluate the effect size, with \(.01\), \(.06\) and \(.14\) being used as the empirical thresholds for small, medium and large effect size [31]. We used thematic analysis [23] as the interpretative qualitative data analysis approach, along with deductive and inductive coding approach [21] to analyze participants’ responses during semi-structured interviews, to better understand participants’ thoughts and uncover the reasons behind the measurements and survey responses. We used the initial codes from Q1 to Q4. We first read the transcripts independently and identified repeating ideas using initial codes derived from Q1 to Q4. Next, we inductively come up with new codes and iterate on the codes as sifting through the data. The final codebook can be found in Figure H3 in Appendix H.

Results and Discussions. Most participants found our viewpoint suggestion features useful (RQ1) and the image modifiers easy to use to visualize textual comments (RQ2). Most participants also believed the reference images created with MemoVis could easily and explicitly convey the gist of 3D design feedback. Patterns of usage of each image modifier could be referred to Figure G1 in Appendix G.

—How the real-time view suggestions could help feedback providers navigate inside 3D explorer (RQ1)? Overall, \(13/14\) participants (except PF14) leveraged the viewpoint suggestion features while visualizing the textual comments. More participants positively believed that it was easy to locate the viewpoints and target objects with MemoVis, compared to the baseline (\(13\) vs. \(4\), Figure 9(a)). Figure 9(b) shows a significant reduction (\(F_{1,24}=7.398\), \({\rm p}=.018\), \(\eta^{2}=.236\)) of navigating time while using MemoVis (\(M=44.06\)s, \(SD=18.94\)s) vs. the baseline (\(M=66.67\)s, \(SD=16.94\)s). Notably, the normality of the measurement was verified (\(p_{baseline}=0.54\), \(p_{MemoVis}=0.17\)). Our qualitative analysis suggested two benefits brought by the viewpoint suggestion feature.

Fig. 9.

Providing guides to find the viewpoint contextualized on the design comments. Most participants appreciated the help and guides brought by the viewpoint suggestion feature. For example, “the view angle is good that it’s trying to give you a context” (PF1) and “it helped me a lot to find a nice view where I could create reference image” (PF2). PF8 also suggested the potential benefits for feedback providers to make faster decision: “it helps me to make faster decisions. Sometimes I don’t know which views might be better. And it gives me options that I could choose” (PF11). After trying baseline, PF11 emphasized: “[without viewpoint suggestion] I have to decide on how to look. I have to make bunch of decisions in the way. It is pretty cognitively demanding.” PF7, who did not have prior 3D experience, initially felt “confused about the directions of the viewpoint,” while attempting to navigate the 3D scene using the 3D explorer. After using the viewpoints suggested by MemoVis, she felt “it’s a much better view, as the bed could be seen from a nice view angle.”

Locating target object(s) in a scene. Despite occasional failures and the needs of minor adjustments, most participants appreciated the benefits of being able to quickly locating target objects. For example, PF8 acknowledged: “I feel that around 85% of the time that the system could give me the right view that I expected, although I sometimes might still need to adjust like a zoom.” PF4 justified the reason of not strongly agreeing with Q1: “although it was helpful, like the system gave me a nice view suggestion. But I had to move it manually. Although that was a good starting point, I still need to make adjustments by myself.” During the interview, PF2 believed that “the view suggestions would be more useful for the bigger scene.” With the past experience of designing 3D computer games, he further commented: “sometimes I work in video games. And video game maps could become really large. And there are multiple things. For example, there’s a very specific area that I find to go to, and to edit it. And then if I can type and say, for example, the boxes on the second floor of the map. And it instantly teleport me to there. Then, there is gonna save me a lot of time to find it in the hierarchy.” After PF14 critiqued a car design with MemoVis, he commented: “I think moving around with car was easier just because it’s a car instead of the room. But I believe for the bedroom it would be much helpful to have the view suggestion feature that kind of guide.”

—How MemoVis could better support feedback providers to create and visualize textual comments for 3D design (RQ2)? Figure 9(a) shows that compared to the baseline, more participants rated positively in terms of reference image creations (\(6\) vs. \(1\) for Q2), explicitness of the images (\(10\) vs. \(4\) for Q3), and creativity support (\(13\) vs. \(3\) for Q4). Feedback provider participants have overall used \(48\) times (\(42.11\)%) of text \(+\) scribble modifier, \(30\) times (\(26.32\)%) of grab’n go modifier, and \(36\) times (\(31.57\)%) of text \(+\) paint modifier (see Figure G1 in Appendix G for the detailed usages of the image modifiers). While completing the baseline task, despite the challenges of searching well-matched internet images, participants used three main strategies to make the reference images more explicit. Specifically, four participants used annotations to highlight the areas that the textual feedback focus; One participant (PF3) used two reference images to demonstrate a good and bad design, and expected the designers to understanding gist by contrasting two extreme examples; And two participants (PF2, PF13) provided multiple reference images and hoped the designers could extract different gist from separate images.

Image explicitness. Participants appreciated the explicitness of the reference images created by MemoVis, while keeping the contexts of the initial design. For example, “I like how it keeps the context around this picture. It could be much easier for people to understand my thoughts” (PF2), “it allows me to generate reference image in the same scenario, and not some random scenario that I’ve pulled from the internet” (PF9), and “this image references are pretty easy and explicitly. It just conveys my point. That’s what matters. Now it’s up to the person to make the decision to how have to make it better” (PF11). Some participants like PF6 prefer to the MemoVis-created reference images, compared to the searched and hand-sketched images. PF11 highlighted the easy and convenience workflow: “this process was pretty easy. The reference image is just like pop up to me! This is something I love. Like when I provide feedback while I was teaching, I didn’t explicitly tell the students like, your typography is really hard to read, you should change it to this font. But something like a bigger front, a different style, just like that.” In contrast, after creating feedback with baseline task, PF13 complained: “you can find tones of image on the Google. They are very realistic. They are very decorative. But it’s just not related to my model, it’s not in the context […] when I use an internet image, there are more details, but there is even more confusing part. So many times, I just tweak my textual feedback, to minimize the possible confusions for the designers […] I think if I were doing by myself, I would just spend some more time and use Photoshop to edit the internet images.” P14 also commented: “I typical have a specific image in mind. I think to come up with that design, [the MemoVis] is much much better. When I search for something, it is typically very generic. I never search something that is very specific like, a bedroom with blue walls or something. It’s easy if you’re coming in with a specific design.”

Unexpected inspirations and creativity support. Most participants recognized the benefits of MemoVis for inspiring new ideas, which is similar to FP2’s comments (Section 3.1). For example, “there are just so many possibilities. If you search through Google, your thoughts are limited by your experience. But this tool could give me so much unexpected surprises, which is a good thing and they are many times actually better than what I thought. I think it works quite well to help inspire more new ideas” (PF12). PF13 particularly enjoyed the mental experience of getting inspired iteratively while refining the textual prompts: “when I was typing like the description, it was just a text like a description. I don’t really have a solid image in my brain. But [MemoVis] helps me shape my idea, and helps me better think iteratively […] it’s like when you’re writing papers, instead of starting from the scratch, you have a draft, so it’s easier to discuss and revise based on it […].” Figure 10 demonstrates three examples with unexpected creativity that were appreciated by PF4. For example, upon seeing Figure 10(b), PF4 thought out loud that: “ I wasn’t expecting the plant to have like the little orange leaves, but I think it was a good idea for the final design.”

Fig. 10.

Simple and easy image creation workflow. Most participants appreciated the simple workflow for creating reference images with MemoVis, and felt “very easy to use” (PF14). This was also evident by the responses for Q2 reported in Figure 9. Participants highlighted the merits of easy workflow facilitated by the conveniences of all image modifiers. For example, “the system is very easy to use, although it might require a trial or two to become familiar with the modifiers. Once I understand, it is really handy and convenient to instantly create reference images that I want, which also match the design and my comments. All image modifiers are really useful, essential and indispensable!” (PF1). PF9 particularly appreciated the option of “select and extract” supported by grab’n go modifier: “I think the workflow is pretty good. […] The image generation is not the only part. But I’m getting the option to select and extract [the new objects that I am trying to suggest], instead of just using some random image on Google [PF9 refers to the baseline interface condition].” Participants also valued the simplicity and effectiveness of composing text-to-image prompts. For example, even without prior GenAI experience, PF14 commented “it was really easy to write prompt. I love the flexibility to write prompts. It also helped me think.”

Scenarios when feedback providers fail to create explicit and satisfying companion reference images. We observed multiple scenarios where participants failed to create satisfying visual references that can explicitly convey the textual feedback. Our analysis unveiled three key reasons.

(1) Unsatisfactory generation. Text-to-image generation is still an emerging technology and therefore is not perfect [65]. MemoVis could fail completely to generate a reasonable shape (Figure 11(b)) or was able to generate a reasonable image but failed to match what the users wanted (Figure 11(d) and (e)). These setbacks may result in less explicit reference images, potentially causing misunderstandings for the designers and contributing to the negative ratings shown in Figure 9(a). For example, PF4 thought aloud while examining Figure 11(b): “it doesn’t look like a plant. I don’t think this is explicit enough for the designer to understand.” In these cases, we observed that participants tend to try for several attempts or with a different view angle until they can generate the desirable image (Figure 11(c)). While PF4 positively rated the image explicitness, she disagreed that it was easy to create reference images. PF2 made a similar comment with respect to Figure 11(e): “the chair does not look like the one I wanted.”

Fig. 11.

(2) Difficulty of writing prompts. Few participants emphasized the importance of prompts and the challenges in writing them. For example, “sometimes, I need several times of revision on the prompt to make the generated image better. […] So while my feedback takeaways could be visualized, it still needs several attempts” (PF12). A small number of participants also described the needs for extracting the knowledge of the existing design: “if I make a prompt, it should have the knowledge of this existing design. For example, if I say like, keep the same blanket, then it should be same” (PF5). PF11 suggested that MemoVis should automatically create prompts based on the textual feedback: “I thought when I write the feedback, the prompt will be automatically generated so that it can bring me the reference image. But it’s more like I write the feedback and then I also create the prompt […] It was initially hard for me to distinguish the prompt and the feedback itself, that I need to tell the AI versus also tell the designer.” PF7, a non-native English speaker, found it challenging to phrase the prompt of “lap desk,” leading to the complete failure of the final reference image created by MemoVis (Figure 11(f)–(h)). This led him to rate the implicitness of the reference images (Q3) as strongly disagree.

(3) Lack of alternative design exploration. While most participants believed that the images created by MemoVis are more explicit and could provide inspirational support compared to the baseline, some participants commented on the necessity of being able to see multiple inference results similar to mainstream search engines. For example: “compared to the Google image, I feel there’s something very inspiring about seeing like 20 images all at once from like different creators” (PF6), and “when you search for an image on Google, it has the big long list of different images. That helps me to see different ideas. And because it’s Google, it’s pulling from a bunch of different websites. So I think that also helps me to think like, oh, this is what someone else thought. So yeah, I think if I’m thinking about it that way” (PF4).

5.2 Study 2: Assessing Reference Images

Our second study aims to tackle the key RQ: how the reference images created by MemoVis could convey the gist of the 3D design feedback, compared to the images created by the baseline condition (i.e., internet searched images and/or hand sketches)?

Participants. We recruited PD1–PD8 (age, \(M=23.75\), \(SD=2.55\), incl. four males and four females) as the designer participants, with prior 3D design experience, from an institutional 3D design & eXtended Reality student society. No designer participants were involved in the formative studies and the first study. Details of the demographic backgrounds of designer participants can be referred to Appendix E.

Procedures. The second study was structured as a survey study, where participants were invited to complete an online questionnaire. We first collected all reference images collected from first study, including \(44\) and \(39\) design feedback, created by C1 and C2, respectively. Each design feedback contains a textual comment, and one (or multiple) reference image(s). We also captured the viewpoints of the initially designed 3D models for each reference image. Next, we randomly shuffled all design feedback, which were then divided into eight groups. All groups, except for the last one containing six design feedback, consist of \(11\) design feedback each. On average, each group comprises \(46.21\%\) (\(SD=9.54\%\)) design feedback created by MemoVis, and \(74.05\%\) (\(SD=9.21\%\)) design feedback contributed by different feedback provider participants from Study 1. We then assigned the eight surveys to PD1 to PD8 for evaluation. Participants were invited to rate each reference image in a \(5\)-point Likert scale, regarding how well the reference image can convey the gist of the textual comment. Participants were encouraged to put their justifications as the textual comments for each rating. Each questionnaire takes approximately \(10\)–\(15\) min to complete.

Analysis. We analyzed the Likert-scale score for each reference image rated by one designer participants. Qualitatively, we also collected the textual comments that participants provided to justify their ratings. These include \(33\) and \(31\) textual comments for design feedback created by baseline and MemoVis, respectively. Inductive coding approach [21] were used to analyze participants’ qualitative responses.

Results and Discussions. Figure 12(a) shows the Likert rating for each feedback. We found that around \(66.67\%\) of the reference images created by MemoVis were rated positively, versus only around \(38.64\%\) of images created by baseline approach were rated positively (Figure 12(b)).

Fig. 12.

When evaluating reference images and feedback produced using MemoVis, participants liked the explicitness of the generated reference images, e.g.,“clear and focused” (PD6), “the reference image perfectly captures all the elements mentioned in the design comments” (PD4) and “the image clearly shows what the text is trying to say” (PD1). Figure 14(a) shows an example of how PD1 believed that “it is easily to identify the new picture and the desired location” in the initial bedroom 3D model.

For the images produced in the baseline condition, comments from the designer participants point to a common drawback of mismatched contexts: the contextual difference between the reference images and the textual comments can cause confusion. For example, Figure 13(a) shows an example of the reference image created by PF10, where PD2 judged: “the structure between the bed and the floor can easily be mistaken for legs upon a cursory glance.” Figure 13(c) shows an example when both the viewpoint and the requested change in the reference image are dramatically different to the initial design. The designer participant did not feel fully confident in grasping the feedback: “the idea of a pet seat is clear, but it seems so different from the original image that it would be confusing” (PD1). We also found that some feedback provided attempted to reduce ambiguity by annotating on the initial design to highlight the changes (Figure 13(b)). This approach, however, is not as effective when the requested change is not explicit. In this case, the feedback provider was asking for a new material design on the car body. PD1 commented: “I’m not sure what the circles are trying to show.”

Fig. 13.

Although most of participants favored the reference images created by MemoVis, participants also highlighted few setbacks of the MemoVis-generated images. We identified three key reasons from the designers’ perspective.

(1) Mismatched contexts. Similar to the baseline condition, occasionally, designer participants pointed out context mismatches, although these did not affect their understanding of the design feedback. Figure 14(b) shows an example of how the overall bedroom design was changed though the focus of the textual feedback is the bed, which might be caused by the failure of the grab’n go modifier. Despite this, PD5 noted: “even the background is a little different, the reference image still preserves the angle of the view, and the new environment setting.”

Fig. 14.

(2) Missing details. While some reference images can reflect the suggested edits, designer participants believed that the missing details might cause misunderstanding. For example, regarding Figure 14(c), PD4 disagreed that the reference image can convey the textual comment because “while the display has been added as per the design comments, some panels have been removed. I don’t know if these removals align with the design comments.” Such confusion might result in incorrect translations of the design feedback into the final 3D model. While mismatched contexts were identified as common problems, designer participants did not highlight missing details for the reference images created by baselines.

(3) Complete failure. In very few cases, designer participants highlighted that the reference image can be entirely unsuccessful. For example, PD8 commented on the design feedback of Figure 14(d): “I don’t understand what is that white thing, doesn’t look like a curtain.” Such reference images might cause designers to misunderstand the gist of the design feedback, potentially necessitating further communication with the feedback providers.

6 Discussions

Having demonstrated the MemoVis system as an effective GenAI-powered tool for creating companion reference images for 3D design feedback, this section discusses the practical implications (Section 6.1) and future improvements (Section 6.2) informed by our explorations.

6.1 Practical Implications

Overall, our studies showed that MemoVis can assists feedback providers in efficiently creating companion reference images for 3D design feedback. The design of MemoVis validates the feasibility of using GenAI and VLFMs to support a more simple and efficient workflow of asynchronous 3D design review. This section discusses key practical implications drawn from our findings.

Real-time viewpoint suggestions. To assist in creating reference images, MemoVis needs to first allow feedback providers to efficiently locate 3D camera viewpoints pertinent to the written textual comments. Our study showed that the real-time viewpoint suggestions can support this task by analyzing the written comment and suggesting semantically-relevant views in the 3D scene. The effectiveness of this feature has also been demonstrated by \(11\) feedback providers without proficient 3D software skills. Despite having 3D skills, few participants like PF2 found this feature valuable when it comes to seeking relevant views on a potentially large-scale 3D model. While earlier studies suggested closed-form solutions for viewpoint selection based on area [80], silhouette [37, 94], and depth [22] attributes, along with their combinations [86], the relationships between the chosen view and textual semantics are often absent. MemoVis introduces a novel paradigm, enabling feedback providers to effortlessly identify relevant views for contextualizing the reference images they intend to create. Further, looking at the qualitative feedback from Section 5.1, a number of participants, exemplified by PF2, believed that viewpoint suggestion would be even more useful in larger-scale 3D models such as those used in video game design. While prior research, such as IsoCam [67], has explored the viability of employing a touch-based controller for navigating the camera in large projection setups, it is impractical to apply such complex hardware setups to 3D design review workflow. Instead, MemoVis leverages VLFMs to infer what types of views that the feedback providers might be interested in exploring, enabling feedback providers to spend less time on maneuvering the viewing camera and more time on the main feedback writing task.

Image modifiers. We highlighted the advantages of MemoVis’s rapid image creation workflow with three types of image modifiers for facilitating creating companion reference images while writing 3D design feedback. It is worthwhile to emphasize that MemoVis is a review and feedback creation tool instead of an image editing tool like the recent GenAI-powered Photoshop [11, 14], or an ideation tool like Vizcom [9]. Hence, advancing techniques to realize aesthetic and high-quality images are beyond our scope. Rather, our key focus is to prioritize the clarity of the reference images and their alignment with textual comments. Our results have shown the explicitness of the reference images created and the capabilities to maintain the contexts of the anchored viewpoint, confirmed by both feedback providers (Section 5.1) and designers (Section 5.2). Additionally, MemoVis’s approach can also help with creative thinking. The images generated with the modifiers can sometime offer new ideas and inspirations for feedback writers. We also demonstrated that a larger number of participants found the reference image creation workflow using MemoVis to be easier compared to today’s methods involving image searching and/or sketching (Figure 9(a)). While the design of MemoVis is intended to encourage feedback providers to focus on the feedback typing task, we showed that it is still viable to request feedback providers to engage in simple rough sketching and painting directly on top of the 3D explorer. Our discovery indicates that incorporating simple non-textual input modalities like sketches and painting will not significantly raise the interaction cost [24]. Instead, it gives feedback providers added control to ensure consistency in the reference image within the context of the initial design, as highlighted in our formative studies. In theory, this finding could be linked to Kohler’s recommendations [54] for generic user experience design: granting users autonomy through customization may enhance the sense of ownership, potentially improving the overall interaction experience. Nevertheless, dedicating time and resources to customization development could also elevate interaction costs, possibly diminishing the user experience [19, 24, 58, 73]. MemoVis presents an exemplary design that balances low interaction costs while affording feedback providers additional controls in the creation of reference images.

Integration with the State-of-the-Art (SOTA) models. We demonstrated the feasibility of using the MemoVis system, powered by year-2023’s pre-trained models, to assist feedback providers in efficiently creating reference image for 3D design feedback. With continuous advancement of SOTA performance of today’s VLFMs [28, 69], we believe our contribution toward designing a novel interaction workflow and experience for efficient 3D design review will continue to hold. We consider the underlying GenAI and VLFMs as the engineering primitives to drive the novel interaction experience, where the feedback providers could efficiently create companion reference images for the feedback comments while focusing on text typing. The enhancements of the inference quality of recent text-to-image SOTA models could help generalizing the practical applicability of MemoVis through potentially more photorealistic synthesized images [17], simpler and more intuitive prompts [42], as well as a reduced inference latency [104]. For example, SOTA pipelines, such as Promptist [42] for optimizing text-to-image GenAI prompts, could potentially reduce the failures when feedback providers write low-quality prompts. Other 3D-related GenAI SOTA pipelines like InseRF [87] that enable text-driven 3D object insertions might also be integrated to enhance features for feedback providers with prior 3D experience, enabling deeper exploration.

Integration with real-world workplace applications. The interaction designs of MemoVis can be integrated into larger workplace applications as a form of lightweight plugins. Feedback providers could focus on their primary task, feedback typing, instead of editing images or 3D models (Section 4). These MemoVis-translated plugins enable feedback providers to inspect 3D models, create and visualize textual feedback comments, and send them to designers just like the memo notes. Such interaction experience would be simple and fluid [34] that facilitates smooth discussion and design collaborations between feedback providers and designers by encouraging the feedback providers focusing on thinking and typing textual comments. For example, MemoVis can be integrated as an add-on for Gmail. This integration allows feedback providers to effortlessly create accompanying reference images while typing textual comments within an email. MemoVis can also be implemented as a plugin for iMessage. Feedback providers could use this plugin to conveniently examine 3D models and create reference images for textual comments while engaging in text conversations with their designers, making the process as straightforward as creating a memoji [47]. Although MemoVis was contextualized in an asynchronous 3D design review workflow, this plugin is invaluable in expediting the creation of reference images during synchronous conversations in Instant Messaging (IM) applications, effectively minimizing the duration of silence [51, 61]. Beyond supporting 3D design feedback, MemoVis can still be generalized to broader multi-modal GenAI-based applications that requires efficient visualizations of texts. For example, MemoVis can be integrated with existing collaborative writing tools like Notion. MemoVis’s capabilities of efficiently visualizing textual content shows promise in enhancing synchronous discussions without disrupting the flow of conversation.

6.2 Improving MemoVis

After exploring the practical implications of MemoVis, this section explores potential key directions for future improvements, drawing insights from our findings.

Boosting AI inferences with human feedback. In Study 1, we observed that the feedback providers might need to adjust the viewpoint based on MemoVis’s suggestion and/or make multiple attempts to create the reference images that could convey the gist of the design feedback. One future direction is to understand how we could leverage the behavioral actions from feedback providers to boost future AI inferences. Similar ideas have been successfully used in many large language model applications through designing prompts with few-shot learning [2] and integrating the strategies of reinforcement learning from human feedback [109]. For example, instead of searching possible viewpoints solely based on pre-trained CLIP model [3], MemoVis might incorporates the past view preferences from the feedback providers to find the most likely viewpoint with which the feedback providers want to anchor the textual comments. Realizing this may require the design of an effective cost function with respect to CLIP inferences and the preferences from feedback providers that the MemoVis could optimize.

Eliminating manual prompt writing toward simple and fluid reference images creation experience. MemoVis necessitates feedback providers to create additional prompts for creating reference images, which can be redundant. Sometimes, prompts may require feedback providers to include additional context beyond the immediate objects of focus. Although most participants were satisfied with the experience of using MemoVis to create companion reference images for textual feedback, we found participants occasionally found it difficult (PF5) and tedious (PF12) to write additional prompt or confused of the difference between feedback comments and prompts (PF11). While prior research, e.g., [68], has demonstrated novel prompt optimization algorithms for vanilla text-to-image GenAI, we opted not to incorporate this feature into the current implementation due to lack of validations on creating photorealistic images using conditioned text-to-image GenAI (Section 2.2). Additionally, granting flexibility to feedback providers to write prompts also enables them to iteratively enhance the prompts upon the less optimal reference images. Although Liu et al. [65] have discussed the key guidelines of crafting text-to-image prompt, future work might explore how to optimize the text-to-image prompts for conditioned text-to-image GenAI, and how to integrate broader contexts from the written feedback, without laborious trial-and-error process for creating prompts, as highlighted by PF10. While advancing techniques for prompt creations and explorations is beyond our scope, future work may explore the feasibility of integrating interactive prompt engineering techniques [38, 99] to assist feedback providers in streamlining the feedback creation process while allowing the exploration of the nuances of prompt creation. Ultimately, we envision MemoVis being able to automatically generate the prompts for text-to-image GenAI without the awareness of feedback providers, leading to a simple and fluid interaction experience [34], where the candidate reference images could be responsively created and updated as the feedback providers typing the textual comments.

Integrating GenAI with searching. While our research on MemoVis shows how text-to-image GenAI can be a potential path to enhance the 3D design review workflow, Study 1 indicated that GenAI alone might be insufficient. Despite most participants are satisfied with the experience of using MemoVis to create reference image for textual comments, few participants (e.g., PF6) emphasized the opportunity to integrate rather than to replace internet search and hand annotation approaches with MemoVis. While MemoVis is helpful when feedback providers do not have a specific design in mind, few participants (e.g., PF4, PF11) emphasized the usefulness of using online images when a specific design suggestion in the mind. With these observations, a compelling research path naturally emerges: how could we woven today’s image searching approach into MemoVis’s pipeline. Although this direction is similar to recent works like GenQuery [89], which showed how to integrate GenAI and search to help instantiate designers’ early-stage abstract idea, and DesignAID [26], which emphasized the importance of “augmenting humans rather than replacing them” while attempting to use GenAI to help exploring visual design spaces, reviewing and providing feedback for 3D design is fundamentally different from 2D graphic ideation workflow due to the complexities of 3D models and the convoluted thinking process while feedback providers are attempting to create reference images (Section 5.1). Future direction might consider how to integrate existing search engine into MemoVis, opening new paradigms of efficiently co-visualizing textual feedback alongside human interactions, internet searches, and GenAI. For example, despite the imperfections and the potential risk of causing misunderstandings, our results indicate that a few feedback providers still prefer to use online images tailored to their specific design needs. While tools like Photoshop [11, 14] enable feedback providers to edit online images, such workflows are often not streamlined, time-consuming, and inefficient for those lacking image editing skills. Although MemoVis empowers feedback providers to use image modifiers to create reference images primarily from text, future work may further explore how these image modifiers can be extended and integrate the contexts of the searched image(s).

7 Limitations

Having demonstrated the promising of MemoVis, we also acknowledge multiple key limitations with respect to system design and evaluations.

Inference latency. MemoVis took around \(30\) seconds before generating a synthesized image. While MemoVis is asynchronous, through which the feedback providers could continuous explore the 3D model, we found most participants still prefer to shorten the wait time for a more streamlined feedback creation workflow. Although reducing latency is beyond our scope, some participants (e.g., PF6) found it as a critical setbacks in terms of software engineering design perspective compared to the internet searching. We speculate future advancement of SOTA GenAI and GPU parallel computing research would help overcome this limitation.

Participants. Our first formative study was conducted with only two participants as it required individuals with extensive design experience. The evaluation studies were conducted with only \(14\) feedback providers and eight designer participants due to the lengthy duration in Study 1 and limited resources for Study 2 which requires participants to have prior 3D design experience. Despite the validity of our results that were mainly qualitative based, future work might further explore the usability of MemoVis with more participants coming from different backgrounds to minimize the bias from the recruited participants. For instance, in our study tasks, the feedback provider participants recruited can only represent the clients engaged with the workflow of design feedback creations. Future research may recruit feedback provider participants who can represent other types of feedback providers like collaborators and managers.

Evaluation conditions and tasks. First, our studies were only based on a bedroom and a car model (Appendix F). Future researchers might explore the generalizability of our system to a wider ranges of 3D models. As discussed in Section 6.1 and speculated by PF2 that the viewpoint suggestions could be “much helpful” for a larger scene, future work might also investigate how MemoVis could help feedback providers to navigate and create reference images for a larger environment 3D models. Second, the baseline condition in Study 1 (Section 5.1) required participants to create reference images using searching and/or hand sketching. Although, through our formative studies, it is possible to simulate the prevailing practices of most participants in creating reference images for 3D design feedback, future research might explore comparing MemoVis with a broader range of GenAI-powered baseline, such as creating reference images using vanilla text-to-image GenAI tools [10] and professional image editing software [11].

Evaluations in an ecologically valid 3D design review workflow. Despite the potential of integrating MemoVis as part of larger collaborative design workflow and workplace applications (Section 6.1), our current evaluation was based on a monolithic browser-based application in a controlled laboratory setting. Future research might deploy MemoVis and investigate the user experience in a realistic real-world 3D design review workflow. One future direction might investigate the feasibility and effectiveness of MemoVis, after being integrated and deployed into today’s mainstream workplace and IM applications such as Gmail and iMessage (Section 6.1).

8 Conclusion

We designed and evaluated MemoVis, a browser-based text editor interface that assists feedback providers in easily creating companion reference images for textual 3D design comments. MemoVis integrates several AI tools to enable a novel 3D review workflow where users could quickly locate relevant design context in 3D and synthesize images to illustrate their ideas. A within-subject study with \(14\) feedback providers demonstrates the effectiveness of MemoVis. The quality and explicitness of the companion images were evaluated by another eight participants with prior 3D design experience.

Acknowledgments

We thank the insightful feedback from the anonymous reviewers. We appreciate the discussions with fellow researchers from Adobe Research, including Mira Dontcheva, Anh Truong, Joy O. Kim, and Zongze Wu, as well as Rima Cao from UC San Diego. We thank Zeyu Jin for providing the text-to-voice GenAI pipeline to synthesize narrations in the companion videos.

Footnote

An inference example using depth- and scribble-conditioned ControlNet could be referred to Figure B1 in Appendix B.

Supplemental Material

SRT File - preview.srt

Subtitle file of the 30-second preview video.

Download
.82 KB

MP4 File - preview.mp4

A 30-second preview video.

Download
2.30 MB

MP4 File - teaser.mp4

A 5-minute teaser video.

Download
28.77 MB

SRT File - teaser.srt

Subtitle file of the 5-minute teaser video.

Download
6.92 KB

References

[1]

Google. 2015. DeepDream - A Code Example for Visualizing Neural Networks. Retrieved August 8, 2023 from https://ai.googleblog.com/2015/07/deepdream-code-example-for-visualizing.html

Abstract

1 Introduction

2 Related Work

2.1 Creating Effective Design Feedback

2.2 Generative AI (GenAI) and VLFMs

2.3 GenAI-Powered 2D and 3D Design Workflow

3 Formative Studies

3.1 Formative Study 1: Preliminary Needfinding Study

3.2 Formative Study 2: Analysis of Real-World 3D Design Feedback

3.3 Summary of DCs

4 MemoVis

4.1 Real-Time Viewpoint Suggestions

4.2 Creating Reference Images with Rapid Image Modifiers

4.3 Interactions and Implementations

5 Evaluations

5.1 Study 1: Creating Reference Images

5.2 Study 2: Assessing Reference Images

6 Discussions

6.1 Practical Implications

6.2 Improving MemoVis

7 Limitations

8 Conclusion

Acknowledgments

Footnote

Supplemental Material

References

A Ethical and Copyright Disclaimer

B Examples of ControlNet with Depth and Scribble Conditions

C Supplementary Algorithm Design

D Pre-Trained Models

E Participants Recruitment for User Studies

F Study Tasks

G Usages of the Image Modifiers

H Codebook and Themes from Qualitative Data Analysis

Cited By

Index Terms

Recommendations

Selection of reference imaged for image-based scene representations

Non-reference assessment of sharpness in blur/noise degraded images

Full-reference image quality metric for blurry images and compressed images using hybrid dictionary learning

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations