WorldEdit: Towards Open-World Image Editing With a Knowledge-Informed Benchmark

Wang Lin¹, Feng Wang², Majun Zhang¹, Wentao Hu³, Tao Jin¹, Zhou Zhao¹, Fei Wu¹, Jingyuan Chen¹, Alan Yuille², Sucheng Ren²

¹Zhejiang University ²Johns Hopkins University
³Nanyang Technological University
ICLR2026

Paper

HuggingFace Code

Abstract

Recent advances in image editing models have demonstrated remarkable capabilities in executing explicit instructions, such as attribute manipulation, style transfer, and pose synthesis. However, these models often face challenges when dealing with implicit editing instructions, which describe the cause of a visual change without explicitly detailing the resulting outcome. These limitations arise because existing models rely on uniform editing strategies that are not equipped to handle the complex world knowledge and reasoning required for implicit instructions. To address this gap, we introduce WorldEdit, a dataset specifically designed to enable world-driven image editing. WorldEdit consists of high-quality editing samples, guided by paraphrased instructions that align with real-world causal logic. Furthermore, we provide WorldEdit-Test for evaluating the existing model's performance on causal editing scenarios. With WorldEdit, we use a two-stage training framework for fine-tuning models like Bagel, integrating with a causal verification reward. Our results show that the proposed dataset and methods significantly narrow the gap with GPT-4o and Nano-Banana, demonstrating competitive performance not only in instruction following but also in knowledge plausibility, where many open-source systems typically struggle.

Overview

Figure 1: Unlike traditional image editing (left), which adopts a uniform editing strategy for different editing objects, world editing (right) takes into account the nature of objects in the real world and produces editing results that conform to causal logic.

WorldEdit focuses on causal, world-driven editing scenarios where the model must understand how real-world entities respond to implicit changes, such as temperature, acidity, or external forces. Rather than directly specifying the final visual outcome, the instructions describe the underlying cause, and the model must infer and generate the appropriate effect in the image space.

Data Construction

Figure 2: The automated construction pipeline of the WorldEdit dataset. Open-world images are filtered and screened along three dimensions: (1) causal consistency of implicit instructions, (2) richness of the expected visual transformations, and (3) quality of the synthesized edited images.

Visualization

Qualitative comparison across different causal categories. The figure shows representative examples from ten causal reasoning tasks. Each row corresponds to a causal scenario, with the source image on the left followed by results from different models.

Additional visualization results across different causal modes in the WorldEdit-Test.

Leaderboard

Rank	Model	Edit Consistency	Visual Quality	Instruct Following	Knowledge Plausibility	Avg▼
1	Nano Banana 2	4.41	4.88	4.39	4.42	4.53
2	GPT-4o	4.33	4.94	4.07	4.12	4.36
3	Nano Banana	4.50	4.91	3.70	3.76	4.22
4	SeedEdit-3.0	4.27	4.80	3.89	3.87	4.21
5	WorldEdit	4.08	4.54	3.85	3.83	4.07
6	Bagel-Think	2.89	4.33	3.11	3.08	3.35
7	Flux-Kontext	4.46	4.82	1.77	1.79	3.21
8	Bagel	2.44	4.20	2.22	2.19	2.76
9	Omnigen	2.63	4.15	1.67	1.61	2.52
10	Omnigen2	2.16	4.54	1.67	1.68	2.51
11	Emu2	1.59	4.61	1.76	1.85	2.46
12	Ip2p	2.51	4.41	1.39	1.44	2.44
13	Magicbrush	1.59	4.18	1.36	1.42	2.14
14	Anyedit	1.53	3.25	1.80	1.77	2.09