We conducted two user studies to evaluate MemoVis. The first study aims to understand the experience of feedback providers while visualizing textual comments using MemoVis, whereas the second study aims to explore how the created images by MemoVis could effectively convey the gist of 3D design feedback while being consumed by designers. We use PF# and PD# to index Participants for acting as Feedback providers (PF) and Designers (PD), respectively.
5.1 Study 1: Creating Reference Images
Our first study was structured as a
within-subjects design. We aim to tackle two
Research Questions (RQs):
—
(RQ1) How the real-time viewpoint suggestion could help feedback providers navigate the 3D scene while writing feedback?
—
(RQ2) How the three types of image modifiers could help feedback providers visualize the textual comments for 3D design?
Participants. PF1 - PF14 (age,
\(M=23.36\),
\(SD=2.52\),
incl. seven males and seven females) were recruited as the feedback providers. While all participants have experience of writing design feedback, most participants had limited experience working with 3D models. We believe that the design feedback providers do not necessary to possess design skills. Among recruited participants, only PF2, PF10, and PF14 were confident of their 3D software skills. Most of participants did not have experience of creating prompts and using GenAI, although PF1 and PF2 considered themselves as experts of using LLM; PF13 was confident of his proficiency of using text-to-image GenAI. Details of participants’ demographics and power analysis can be referred to
Appendix E.
Interface Conditions. Participants were invited to use two interface
Conditions (C1–C2) to create and visualize textual comments:
—
(C1) Baseline. Participants were required to create reference image(s) using search and/or hand sketching. We aim to mock up current design review practices based on the findings from Formative Study 1. Participants were not required to use existing GenAI-based image editing tools like Photoshop [
14]; while FP2 acknowledged GenAI as a promising approach to generate reference images, its lack of control makes it impractical (
Section 3.1); unlike FP2 who are professional designers, feedback providers may not possess skills on using image editing software. Participants were instructed to use their preferred search engine for finding well-matched images. PowerPoint was optionally used for sketching and annotations.
—
(C2) MemoVis. Participants were invited to use MemoVis to create reference images along with textual comments.
Tasks. Each participant was instructed to complete three different design critique
Tasks (T1–T3) with C1 and/or C2. To prevent learning effect, T1–T3 used
different 3D models, created by
different creators. For each design critique task, participants were instructed to create at least two design comments, with each comments being accompanied by at least one reference images. We used T1 to help participants get familiar with C1 and C2, with which participants were instructed to critique a character model of a samurai boy. All data collected from T1 was excluded from all analysis. For T2 and T3, participants were asked to make the bedroom and the car more comfortable to live in and drive with, respectively. While T2 and T3 used different 3D models, the skills needed to create design feedback are the same. Details of the study tasks could be referred to
Figure F1 in
Appendix F.
Procedures. Participants were invited to complete a questionnaire to report demographics, 3D software, 3D design, and GenAI experiences. They were then introduced to MemoVis, and were given sufficient time to learn and get familiar with both C1 and C2 while completing T1. For those without prompt writing experience for GenAI, sufficient time was provided to go through examples on Lexica [
6] and practice using FireFly [
10]. Upon feeling comfortable with C1 and C2, participants were invited to complete T2 and T3. To prevent the sequencing effects, we counterbalanced the order of interface conditions. Specifically, PF1–PF7 were required to complete T2 with C2, followed by completing T3 with C1. Whereas, PF8–PF14 were required to complete T2 with C1, followed by complete T3 with C2. Comparing feedback creations for T2 and T3 was out of our scope, we therefore did not counterbalance the order of the tasks. After each task, participants were invited to rate their agreement of four
Questions (Q1–Q4) in a
\(5\)-point Likert scale, with a score
\(\gt 3\) being considered as a positive rating.
—
(Q1) Navigating the Viewpoint: “it was easy to locate the viewpoint and/or target objects while creating feedback.”
—
(Q2) Creating Reference Images: for the task completed by MemoVis (C2), we used the statement “it was easy to create reference images with the image modifiers for my textual comments.” For the task completed by baseline (C1), we used the statement “it was easy to create reference images with the method(s) I chose.”
—
(Q3) Explicitness of the Reference Images: “the reference images easily and explicitly conveyed the gist of my design comments.”
—
(Q4) Creativity Support: “the reference images helped me discover more potential design problems and new ideas.”
A semi-structured interview was also conducted focusing on participants’ rationales while evaluating Q1 to Q4. The study on average took \(57.34\) min (\(SD=7.72\) min).
Measures and Data Analysis. To address RQ1, we measured the
navigating time for each textual comment, defined by the time that feedback providers spent while navigating and exploring the 3D model. With the Shapiro–Wilk test [
88], we verified the normal distribution of the measurements under each condition (
\(p \gt .05\)). One-way ANOVA [
40] (
\(\alpha=.05\)) was therefore used for statistical significance analysis. The eta square (
\(\eta^{2}\)) was used to evaluate the effect size, with
\(.01\),
\(.06\) and
\(.14\) being used as the empirical thresholds for
small,
medium and
large effect size [
31]. We used thematic analysis [
23] as the interpretative qualitative data analysis approach, along with deductive and inductive coding approach [
21] to analyze participants’ responses during semi-structured interviews, to better understand participants’ thoughts and uncover the reasons behind the measurements and survey responses. We used the initial codes from Q1 to Q4. We first read the transcripts independently and identified repeating ideas using initial codes derived from Q1 to Q4. Next, we inductively come up with new codes and iterate on the codes as sifting through the data. The final codebook can be found in
Figure H3 in
Appendix H.
Results and Discussions. Most participants found our viewpoint suggestion features useful (RQ1) and the image modifiers easy to use to visualize textual comments (RQ2). Most participants also believed the reference images created with MemoVis could easily and explicitly convey the gist of 3D design feedback. Patterns of usage of each image modifier could be referred to
Figure G1 in
Appendix G.
—
How the real-time view suggestions could help feedback providers navigate inside 3D explorer (RQ1)? Overall,
\(13/14\) participants (except PF14) leveraged the viewpoint suggestion features while visualizing the textual comments. More participants positively believed that it was easy to locate the viewpoints and target objects with MemoVis, compared to the baseline (
\(13\) vs.
\(4\),
Figure 9(a)).
Figure 9(b) shows a significant reduction (
\(F_{1,24}=7.398\),
\({\rm p}=.018\),
\(\eta^{2}=.236\)) of navigating time while using MemoVis (
\(M=44.06\)s,
\(SD=18.94\)s) vs. the baseline (
\(M=66.67\)s,
\(SD=16.94\)s). Notably, the normality of the measurement was verified (
\(p_{baseline}=0.54\),
\(p_{MemoVis}=0.17\)). Our qualitative analysis suggested two benefits brought by the viewpoint suggestion feature.
Providing guides to find the viewpoint contextualized on the design comments. Most participants appreciated the help and guides brought by the viewpoint suggestion feature. For example, “the view angle is good that it’s trying to give you a context” (PF1) and “it helped me a lot to find a nice view where I could create reference image” (PF2). PF8 also suggested the potential benefits for feedback providers to make faster decision: “it helps me to make faster decisions. Sometimes I don’t know which views might be better. And it gives me options that I could choose” (PF11). After trying baseline, PF11 emphasized: “[without viewpoint suggestion] I have to decide on how to look. I have to make bunch of decisions in the way. It is pretty cognitively demanding.” PF7, who did not have prior 3D experience, initially felt “confused about the directions of the viewpoint,” while attempting to navigate the 3D scene using the 3D explorer. After using the viewpoints suggested by MemoVis, she felt “it’s a much better view, as the bed could be seen from a nice view angle.”
Locating target object(s) in a scene. Despite occasional failures and the needs of minor adjustments, most participants appreciated the benefits of being able to quickly locating target objects. For example, PF8 acknowledged: “I feel that around 85% of the time that the system could give me the right view that I expected, although I sometimes might still need to adjust like a zoom.” PF4 justified the reason of not strongly agreeing with Q1: “although it was helpful, like the system gave me a nice view suggestion. But I had to move it manually. Although that was a good starting point, I still need to make adjustments by myself.” During the interview, PF2 believed that “the view suggestions would be more useful for the bigger scene.” With the past experience of designing 3D computer games, he further commented: “sometimes I work in video games. And video game maps could become really large. And there are multiple things. For example, there’s a very specific area that I find to go to, and to edit it. And then if I can type and say, for example, the boxes on the second floor of the map. And it instantly teleport me to there. Then, there is gonna save me a lot of time to find it in the hierarchy.” After PF14 critiqued a car design with MemoVis, he commented: “I think moving around with car was easier just because it’s a car instead of the room. But I believe for the bedroom it would be much helpful to have the view suggestion feature that kind of guide.”
—
How MemoVis could better support feedback providers to create and visualize textual comments for 3D design (RQ2)? Figure 9(a) shows that compared to the baseline, more participants rated positively in terms of reference image creations (
\(6\) vs.
\(1\) for Q2), explicitness of the images (
\(10\) vs.
\(4\) for Q3), and creativity support (
\(13\) vs.
\(3\) for Q4). Feedback provider participants have overall used
\(48\) times (
\(42.11\)%) of text
\(+\) scribble modifier,
\(30\) times (
\(26.32\)%) of grab’n go modifier, and
\(36\) times (
\(31.57\)%) of text
\(+\) paint modifier (see
Figure G1 in
Appendix G for the detailed usages of the image modifiers). While completing the baseline task, despite the challenges of searching well-matched internet images, participants used three main strategies to make the reference images more explicit. Specifically, four participants used annotations to highlight the areas that the textual feedback focus; One participant (PF3) used two reference images to demonstrate a good and bad design, and expected the designers to understanding gist by contrasting two extreme examples; And two participants (PF2, PF13) provided multiple reference images and hoped the designers could extract different gist from separate images.
Image explicitness. Participants appreciated the explicitness of the reference images created by MemoVis, while keeping the contexts of the initial design. For example, “I like how it keeps the context around this picture. It could be much easier for people to understand my thoughts” (PF2), “it allows me to generate reference image in the same scenario, and not some random scenario that I’ve pulled from the internet” (PF9), and “this image references are pretty easy and explicitly. It just conveys my point. That’s what matters. Now it’s up to the person to make the decision to how have to make it better” (PF11). Some participants like PF6 prefer to the MemoVis-created reference images, compared to the searched and hand-sketched images. PF11 highlighted the easy and convenience workflow: “this process was pretty easy. The reference image is just like pop up to me! This is something I love. Like when I provide feedback while I was teaching, I didn’t explicitly tell the students like, your typography is really hard to read, you should change it to this font. But something like a bigger front, a different style, just like that.” In contrast, after creating feedback with baseline task, PF13 complained: “you can find tones of image on the Google. They are very realistic. They are very decorative. But it’s just not related to my model, it’s not in the context […] when I use an internet image, there are more details, but there is even more confusing part. So many times, I just tweak my textual feedback, to minimize the possible confusions for the designers […] I think if I were doing by myself, I would just spend some more time and use Photoshop to edit the internet images.” P14 also commented: “I typical have a specific image in mind. I think to come up with that design, [the MemoVis] is much much better. When I search for something, it is typically very generic. I never search something that is very specific like, a bedroom with blue walls or something. It’s easy if you’re coming in with a specific design.”
Unexpected inspirations and creativity support. Most participants recognized the benefits of MemoVis for inspiring new ideas, which is similar to FP2’s comments (
Section 3.1). For example,
“there are just so many possibilities. If you search through Google, your thoughts are limited by your experience. But this tool could give me so much unexpected surprises, which is a good thing and they are many times actually better than what I thought. I think it works quite well to help inspire more new ideas” (PF12). PF13 particularly enjoyed the mental experience of getting inspired iteratively while refining the textual prompts:
“when I was typing like the description, it was just a text like a description. I don’t really have a solid image in my brain. But [MemoVis] helps me shape my idea, and helps me better think iteratively […] it’s like when you’re writing papers, instead of starting from the scratch, you have a draft, so it’s easier to discuss and revise based on it […].” Figure 10 demonstrates three examples with unexpected creativity that were appreciated by PF4. For example, upon seeing
Figure 10(b), PF4 thought out loud that:
“ I wasn’t expecting the plant to have like the little orange leaves, but I think it was a good idea for the final design.” Simple and easy image creation workflow. Most participants appreciated the simple workflow for creating reference images with MemoVis, and felt
“very easy to use” (PF14). This was also evident by the responses for Q2 reported in
Figure 9. Participants highlighted the merits of easy workflow facilitated by the conveniences of all image modifiers. For example,
“the system is very easy to use, although it might require a trial or two to become familiar with the modifiers. Once I understand, it is really handy and convenient to instantly create reference images that I want, which also match the design and my comments. All image modifiers are really useful, essential and indispensable!” (PF1). PF9 particularly appreciated the option of
“select and extract” supported by grab’n go modifier:
“I think the workflow is pretty good. […] The image generation is not the only part. But I’m getting the option to select and extract [the new objects that I am trying to suggest], instead of just using some random image on Google [PF9 refers to the baseline interface condition].” Participants also valued the simplicity and effectiveness of composing text-to-image prompts. For example, even without prior GenAI experience, PF14 commented
“it was really easy to write prompt. I love the flexibility to write prompts. It also helped me think.” Scenarios when feedback providers fail to create explicit and satisfying companion reference images. We observed multiple scenarios where participants failed to create satisfying visual references that can explicitly convey the textual feedback. Our analysis unveiled three key reasons.
(1)
Unsatisfactory generation. Text-to-image generation is still an emerging technology and therefore is not perfect [
65]. MemoVis could fail completely to generate a reasonable shape (
Figure 11(b)) or was able to generate a reasonable image but failed to match what the users wanted (
Figure 11(d) and (e)). These setbacks may result in less explicit reference images, potentially causing misunderstandings for the designers and contributing to the negative ratings shown in
Figure 9(a). For example, PF4 thought aloud while examining
Figure 11(b):
“it doesn’t look like a plant. I don’t think this is explicit enough for the designer to understand.” In these cases, we observed that participants tend to try for several attempts or with a different view angle until they can generate the desirable image (
Figure 11(c)). While PF4 positively rated the image explicitness, she
disagreed that it was easy to create reference images. PF2 made a similar comment with respect to
Figure 11(e):
“the chair does not look like the one I wanted.”(2)
Difficulty of writing prompts. Few participants emphasized the importance of prompts and the challenges in writing them. For example,
“sometimes, I need several times of revision on the prompt to make the generated image better. […] So while my feedback takeaways could be visualized, it still needs several attempts” (PF12). A small number of participants also described the needs for extracting the knowledge of the existing design:
“if I make a prompt, it should have the knowledge of this existing design. For example, if I say like, keep the same blanket, then it should be same” (PF5). PF11 suggested that MemoVis should automatically create prompts based on the textual feedback:
“I thought when I write the feedback, the prompt will be automatically generated so that it can bring me the reference image. But it’s more like I write the feedback and then I also create the prompt […] It was initially hard for me to distinguish the prompt and the feedback itself, that I need to tell the AI versus also tell the designer.” PF7, a non-native English speaker, found it challenging to phrase the prompt of “lap desk,” leading to the complete failure of the final reference image created by MemoVis (
Figure 11(f)–(h)). This led him to rate the implicitness of the reference images (Q3) as
strongly disagree.
(3) Lack of alternative design exploration. While most participants believed that the images created by MemoVis are more explicit and could provide inspirational support compared to the baseline, some participants commented on the necessity of being able to see multiple inference results similar to mainstream search engines. For example: “compared to the Google image, I feel there’s something very inspiring about seeing like 20 images all at once from like different creators” (PF6), and “when you search for an image on Google, it has the big long list of different images. That helps me to see different ideas. And because it’s Google, it’s pulling from a bunch of different websites. So I think that also helps me to think like, oh, this is what someone else thought. So yeah, I think if I’m thinking about it that way” (PF4).
5.2 Study 2: Assessing Reference Images
Our second study aims to tackle the key RQ: how the reference images created by MemoVis could convey the gist of the 3D design feedback, compared to the images created by the baseline condition (i.e., internet searched images and/or hand sketches)?
Participants. We recruited PD1–PD8 (age,
\(M=23.75\),
\(SD=2.55\),
incl. four males and four females) as the designer participants, with prior 3D design experience, from an institutional 3D design & eXtended Reality student society. No designer participants were involved in the formative studies and the first study. Details of the demographic backgrounds of designer participants can be referred to
Appendix E.
Procedures. The second study was structured as a survey study, where participants were invited to complete an online questionnaire. We first collected all reference images collected from first study, including \(44\) and \(39\) design feedback, created by C1 and C2, respectively. Each design feedback contains a textual comment, and one (or multiple) reference image(s). We also captured the viewpoints of the initially designed 3D models for each reference image. Next, we randomly shuffled all design feedback, which were then divided into eight groups. All groups, except for the last one containing six design feedback, consist of \(11\) design feedback each. On average, each group comprises \(46.21\%\) (\(SD=9.54\%\)) design feedback created by MemoVis, and \(74.05\%\) (\(SD=9.21\%\)) design feedback contributed by different feedback provider participants from Study 1. We then assigned the eight surveys to PD1 to PD8 for evaluation. Participants were invited to rate each reference image in a \(5\)-point Likert scale, regarding how well the reference image can convey the gist of the textual comment. Participants were encouraged to put their justifications as the textual comments for each rating. Each questionnaire takes approximately \(10\)–\(15\) min to complete.
Analysis. We analyzed the Likert-scale score for each reference image rated by one designer participants. Qualitatively, we also collected the textual comments that participants provided to justify their ratings. These include
\(33\) and
\(31\) textual comments for design feedback created by baseline and MemoVis, respectively. Inductive coding approach [
21] were used to analyze participants’ qualitative responses.
Results and Discussions. Figure 12(a) shows the Likert rating for each feedback. We found that around
\(66.67\%\) of the reference images created by MemoVis were rated positively, versus only around
\(38.64\%\) of images created by baseline approach were rated positively (
Figure 12(b)).
When evaluating reference images and feedback produced using MemoVis, participants liked the explicitness of the generated reference images, e.g.,
“clear and focused” (PD6),
“the reference image perfectly captures all the elements mentioned in the design comments” (PD4) and
“the image clearly shows what the text is trying to say” (PD1).
Figure 14(a) shows an example of how PD1 believed that
“it is easily to identify the new picture and the desired location” in the initial bedroom 3D model.
For the images produced in the baseline condition, comments from the designer participants point to a common drawback of mismatched contexts: the contextual difference between the reference images and the textual comments can cause confusion. For example,
Figure 13(a) shows an example of the reference image created by PF10, where PD2 judged:
“the structure between the bed and the floor can easily be mistaken for legs upon a cursory glance.” Figure 13(c) shows an example when both the viewpoint and the requested change in the reference image are dramatically different to the initial design. The designer participant did not feel fully confident in grasping the feedback:
“the idea of a pet seat is clear, but it seems so different from the original image that it would be confusing” (PD1). We also found that some feedback provided attempted to reduce ambiguity by annotating on the initial design to highlight the changes (
Figure 13(b)). This approach, however, is not as effective when the requested change is not explicit. In this case, the feedback provider was asking for a new material design on the car body. PD1 commented:
“I’m not sure what the circles are trying to show.”Although most of participants favored the reference images created by MemoVis, participants also highlighted few setbacks of the MemoVis-generated images. We identified three key reasons from the designers’ perspective.
(1)
Mismatched contexts. Similar to the baseline condition, occasionally, designer participants pointed out context mismatches, although these did not affect their understanding of the design feedback.
Figure 14(b) shows an example of how the overall bedroom design was changed though the focus of the textual feedback is the bed, which might be caused by the failure of the grab’n go modifier. Despite this, PD5 noted:
“even the background is a little different, the reference image still preserves the angle of the view, and the new environment setting.”(2)
Missing details. While some reference images can reflect the suggested edits, designer participants believed that the missing details might cause misunderstanding. For example, regarding
Figure 14(c), PD4
disagreed that the reference image can convey the textual comment because
“while the display has been added as per the design comments, some panels have been removed. I don’t know if these removals align with the design comments.” Such confusion might result in incorrect translations of the design feedback into the final 3D model. While mismatched contexts were identified as common problems, designer participants did not highlight missing details for the reference images created by baselines.
(3)
Complete failure. In very few cases, designer participants highlighted that the reference image can be entirely unsuccessful. For example, PD8 commented on the design feedback of
Figure 14(d):
“I don’t understand what is that white thing, doesn’t look like a curtain.” Such reference images might cause designers to misunderstand the gist of the design feedback, potentially necessitating further communication with the feedback providers.