WISE

This repository is the official implementation of WISE.

💡 News

2025/06/03: We have updated our code again to provide clearer, simpler, and easier evaluation! 😊
2025/05/24: We have collected some feedback and updated our code. If you have any questions or comments, feel free to email us at niuyuwei04@gmail.com!
2025/03/11: We release our paper at https://arxiv.org/abs/wise.
2025/03/10: We have released the codes and data.

🎩Introduction

Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text to image generation. To address this challenge, we propose WISE, the first benchmark specifically designed for World Knowledge-Informed Semantic Evaluation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 sub-domains in cultural common sense, spatio-temporal understanding, and natural science. To overcome the limitations of traditional CLIP metric, we introduce WiScore, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models.

📖WISE Eval

Prompt Generation: We meticulously crafted 1000 prompts across 25 sub-domains within Cultural Common Sense, Spatio-temporal Reasoning, and Natural Science.
Image Generation: Each prompt was fed to 20 different Text-to-Image (T2I) models (10 dedicated T2I models and 10 unified multimodal models) to generate corresponding images.
GPT-4o Evaluation: For each generated image, we employed GPT-4o-2024-05-13 (with specified instructions detailed in the paper) to independently assess and score each aspect (Consistency, Realism, and Aesthetic Quality) on a scale from 0 to 2. GPT-4o acts as a judge, providing objective and consistent scoring.
WiScore Calculation: Finally, we calculated the WiScore for each image based on the GPT-4o scores and the defined weights, providing a comprehensive assessment of the model's ability to generate world knowledge-informed images.

WiScore assesses Text-to-Image models using three key components:

Consistency: How accurately the image matches the prompt's content and relationships.
Realism: How believable and photorealistic the image appears.
Aesthetic Quality: How visually appealing and artistically well-composed the image is.

WiScore Calculation:

WiScore = (0.7 * Consistency + 0.2 * Realism + 0.1 * Aesthetic Quality) /2

The Overall WiScore is a weighted sum of six categories:

Overall WiScore = (0.4 * Cultural + 0.167 * Time + 0.133 * Space + 0.1 * Biology + 0.1 * Physics + 0.1 * Chemistry)

Usage Guide

To evaluate using GPT-4o-2024-05-13, follow these steps:

1. Evaluate with GPT-4o-2024-05-13

First, set the IMAGE_DIR variable to the directory where your model's generated images are saved. The image names should be in the format 1-1000.png.

IMAGE_DIR="path/to/your_image_output_dir" # Directory where model-generated images are saved, e.g., 1-1000.png

Then, run the gpt_eval.py script for each category. Remember to replace "" with your actual API key.

python gpt_eval.py \
    --json_path data/cultural_common_sense.json \
    --output_dir ${IMAGE_DIR}/Results/cultural_common_sense \
    --image_dir ${IMAGE_DIR} \
    --api_key "" \
    --model "gpt-4o-2024-05-13" \
    --result_full ${IMAGE_DIR}/Results/cultural_common_sense_full_results.json \
    --result_scores ${IMAGE_DIR}/Results/cultural_common_sense_scores_results.jsonl \
    --max_workers 96

python gpt_eval.py \
    --json_path data/spatio-temporal_reasoning.json \
    --output_dir ${IMAGE_DIR}/Results/spatio-temporal_reasoning \
    --image_dir ${IMAGE_DIR} \
    --api_key "" \
    --model "gpt-4o-2024-05-13" \
    --result_full ${IMAGE_DIR}/Results/spatio-temporal_reasoning_results.json \
    --result_scores ${IMAGE_DIR}/Results/spatio-temporal_reasoning_results.jsonl \
    --max_workers 96

python gpt_eval.py \
    --json_path data/natural_science.json \
    --output_dir ${IMAGE_DIR}/Results/natural_science \
    --image_dir ${IMAGE_DIR} \
    --api_key "" \
    --model "gpt-4o-2024-05-13" \
    --result_full ${IMAGE_DIR}/Results/natural_science_full_results.json \
    --result_scores ${IMAGE_DIR}/Results/natural_science_scores_results.jsonl \
    --max_workers 96

2. Calculate Scores

After running the evaluations, use Calculate.py to compute the scores.

python Calculate.py \
    "${IMAGE_DIR}/Results/cultural_common_sense_scores_results.jsonl" \
    "${IMAGE_DIR}/Results/natural_science_scores_results.jsonl" \
    "${IMAGE_DIR}/Results/spatio-temporal_reasoning_results.jsonl" \
    --category all

Important Notes!

GPT Version: Please ensure you use gpt-4o-2024-05-13 for evaluation.
Breakpoint Retesting: Our gpt_eval.py supports resuming from breakpoints. If your evaluation encounters an error midway, simply re-run the script.
Categorized Score Calculation: Calculate.py supports calculating scores by category. You can change the --category parameter to specify which categories to calculate (e.g., --category culture or --category all).

🏆 Leaderboard

Normalized WiScore of different models

Dedicated T2I
Model	Cultural	Time	Space	Biology	Physics	Chemistry	Overall
FLUX.1-dev	0.48	0.58	0.62	0.42	0.51	0.35	0.50
FLUX.1-schnell	0.39	0.44	0.50	0.31	0.44	0.26	0.40
PixArt-Alpha	0.45	0.50	0.48	0.49	0.56	0.34	0.47
playground-v2.5	0.49	0.58	0.55	0.43	0.48	0.33	0.49
SD-v1-5	0.34	0.35	0.32	0.28	0.29	0.21	0.32
SD-2-1	0.30	0.38	0.35	0.33	0.34	0.21	0.32
SD-XL-base-0.9	0.43	0.48	0.47	0.44	0.45	0.27	0.43
SD-3-medium	0.42	0.44	0.48	0.39	0.47	0.29	0.42
SD-3.5-medium	0.43	0.50	0.52	0.41	0.53	0.33	0.45
SD-3.5-large	0.44	0.50	0.58	0.44	0.52	0.31	0.46
Unify MLLM
Model	Cultural	Time	Space	Biology	Physics	Chemistry	Overall
Liquid	0.38	0.42	0.53	0.36	0.47	0.30	0.41
Emu3	0.34	0.45	0.48	0.41	0.45	0.27	0.39
Harmon-1.5B	0.38	0.48	0.52	0.37	0.44	0.29	0.41
Janus-1.3B	0.16	0.26	0.35	0.28	0.30	0.14	0.23
JanusFlow-1.3B	0.13	0.26	0.28	0.20	0.19	0.11	0.18
Janus-Pro-1B	0.20	0.28	0.45	0.24	0.32	0.16	0.26
Janus-Pro-7B	0.30	0.37	0.49	0.36	0.42	0.26	0.35
Orthus-7B-base	0.07	0.10	0.12	0.15	0.15	0.10	0.10
Orthus-7B-instruct	0.23	0.31	0.38	0.28	0.31	0.20	0.27
show-o	0.28	0.36	0.40	0.23	0.33	0.22	0.30
show-o-512	0.28	0.40	0.48	0.30	0.46	0.30	0.35
vila-u-7b-256	0.26	0.33	0.37	0.35	0.39	0.23	0.31

Citation

@article{niu2025wise,
  title={WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation},
  author={Niu, Yuwei and Ning, Munan and Zheng, Mengren and Lin, Bin and Jin, Peng and Liao, Jiaqi and Ning, Kunpeng and Zhu, Bin and Yuan, Li},
  journal={arXiv preprint arXiv:2503.07265},
  year={2025}
}

📧 Contact

If you have any questions, feel free to contact Yuwei Niu with niuyuwei04@gmail.com

Recommendation

If you're interested in the Unify model, Purshow/Awesome-Unified-Multimodal is one of the most comprehensive resources for papers, code, and other materials related to unified multimodal models.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
assets		assets
data		data
Calculate.py		Calculate.py
README.md		README.md
gpt_eval.py		gpt_eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WISE

💡 News

🎩Introduction

📖WISE Eval

Usage Guide

1. Evaluate with GPT-4o-2024-05-13

2. Calculate Scores

Important Notes!

🏆 Leaderboard

Citation

📧 Contact

Recommendation

About

Uh oh!

Releases

Packages

Languages

WayneJin0918/WISE

Folders and files

Latest commit

History

Repository files navigation

WISE

💡 News

🎩Introduction

📖WISE Eval

Usage Guide

1. Evaluate with GPT-4o-2024-05-13

2. Calculate Scores

Important Notes!

🏆 Leaderboard

Citation

📧 Contact

Recommendation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages