OneIG-Benchmarkο
OneIG-Benchmark is a comprehensive evaluation benchmark for text-to-image models, organized around five dimensions: Alignment, Text Rendering, Reasoning, Style, and Diversity. The official Standard suite includes 5 sub-tasks and supports both EN (English) and ZH (Chinese) language modes.
AISBench has adapted OneIG-Benchmark. The ais_bench/configs/oneig_examples/ directory contains standalone configuration file examples for multi-dimensional quality evaluation of generated images on GPU. OneIG uses an eval-only mode and does not include image generation steps. Please generate images using the model under evaluation before running the assessment.
Dataset Overviewο
Backgroundο
OneIG-Benchmark is developed to comprehensively evaluate the generation quality of text-to-image models from multiple fine-grained dimensions. Official GitHub: https://github.com/OneIG-Bench/OneIG-Benchmark, Dataset: https://huggingface.co/datasets/OneIG-Bench/OneIG-Benchmark.
Key Featuresο
Feature |
Description |
|---|---|
5-Dimension Evaluation |
Covers alignment, text, reasoning, style, and diversity |
Bilingual Support |
EN (English) / ZH (Chinese) modes |
LLM-as-Judge |
Alignment and Text tasks use multimodal LLM as judge |
ML Model Evaluation |
Reasoning, Style, and Diversity use specialized ML models |
Eval-Only Mode |
Only evaluates generated images, no image generation step |
Grid Splitting |
Supports automatic splitting of grid-composited images into sub-images |
Accuracy Alignment |
Accuracy difference < 1% compared to official evaluation |
Architecture Overviewο
The end-to-end evaluation process consists of data preparation and evaluation phases:
Data Preparation Evaluation Phase
ββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββ
β OneIG-Bench.csv β β Evaluation Phase β
β (Original Dataset) β β β
β β β β images/ directory (images under test) β
β Prompt Extraction β β β β
β β β β βββββββββββββββ βββββββββββββββ β
β T2I Model Generation β β β Alignment β β Text β β
β β β β β (LLM-Judge) β β (LLM-Judge) β β
β images/ directory βββββΆβ βββββββββββββββ βββββββββββββββ β
β (Evaluation Target) β β βββββββββββββββ βββββββββββββββ β
ββββββββββββββββββββββββ β β Reasoning β β Style β β
β β (LLM2CLIP) β β (CSD+SE) β β
β βββββββββββββββ βββββββββββββββ β
β βββββββββββββββ β
β β Diversity β β results/ directory β
β β (DreamSim) β (evaluation results) β
β βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββ
AISBench Adaptation Architecture (four-layer separation):
ais_bench/
βββ benchmark/ # Framework Layer
β βββ datasets/oneig.py # Dataset Loader
β βββ tasks/oneig/ # Evaluation Task Package
β β βββ __init__.py # Module Entry
β β βββ oneig_eval.py # Evaluation Task (OneIGEvalTask)
β β βββ oneig_eval_utils.py # Utility Functions
β β βββ oneig_alignment_eval.py # Alignment Evaluator
β β βββ oneig_text_eval.py # Text Evaluator
β β βββ oneig_reasoning_eval.py # Reasoning Evaluator
β β βββ oneig_style_eval.py # Style Evaluator
β β βββ oneig_diversity_eval.py # Diversity Evaluator
β βββ summarizers/oneig.py # Score Summarizer
βββ configs/oneig_examples/ # User Example Configs
β βββ oneig_full_eval.py # Full Evaluation Config
βββ docs/
βββ source_zh_cn/extended_benchmark/lmm_generate/oneig.md # Chinese Doc
βββ source_en/extended_benchmark/lmm_generate/oneig.md # English Doc
Dependencies and Environmentο
Base Environmentο
OneIG evaluation supports GPU only. Before starting, ensure AISBench is installed:
# Clone AISBench repository
git clone https://github.com/AISBench/benchmark.git
cd benchmark/
# Install dependencies
pip install -e ./ --use-pep517
OneIG Official Repositoryο
OneIG evaluation depends on auxiliary data and reference embeddings from the official repository:
# Clone OneIG code from AISBench organization (with known bugs fixed)
git clone https://github.com/AISBench/OneIG-Benchmark.git
cd OneIG-Benchmark/
# Install dependencies
pip install -r requirements.txt
Model Weights and Resource Downloadsο
OneIG evaluation involves multiple model weights, categorized as follows:
1. HuggingFace Auto-Download (automatic on first run, no manual action required)ο
Model |
Used For |
HuggingFace Path |
|---|---|---|
Judge Model |
Alignment / Text |
|
LLM2CLIP Clip |
Reasoning |
|
LLM2CLIP Vision |
Reasoning |
|
LLM2CLIP LLM |
Reasoning |
|
SE Encoder |
Style |
|
2. DreamSim Weights (Diversity)ο
On first run, the dreamsim library automatically downloads weights to {ONEIG_ROOT}/models/. If GitHub is inaccessible, download manually:
Download URL:
https://github.com/ssundaram21/dreamsim/releases/download/v0.2.0-checkpoints/dreamsim_ensemble_checkpoint.zipExtract to
{ONEIG_ROOT}/models/directoryShould contain:
dino_vitb16_pretrain.pth,open_clip_vitb16_pretrain.pth.tar,clip_vitb16_pretrain.pth.tar,ensemble_lora/
3. Manual Download Required (Style Task)ο
File |
Archive Path |
Download URL |
|---|---|---|
CSD Encoder |
|
|
CLIP ViT-L-14 |
|
4. Data Files Distributed with OneIG Repository (auto-obtained via git clone)ο
File |
Path |
Purpose |
|---|---|---|
Question Dependencies |
|
Alignment task Q&A dependencies |
Text Content Data |
|
Text task reference texts |
Reference Answers |
|
Reasoning task reference answers |
Style Labels |
|
Style task style labels |
CSD Reference Embeddings |
|
Style task CSD reference vectors |
SE Reference Embeddings |
|
Style task SE reference vectors |
Quick Startο
Configurationο
Edit the config file ais_bench/configs/oneig_examples/oneig_full_eval.py and modify the following key parameters:
# OneIG official project absolute path (clone required)
ONEIG_ROOT = "/path/to/OneIG-Benchmark"
# Language mode: EN (English) or ZH (Chinese)
MODE = "EN"
# Image root directory (where generated images are stored)
IMAGE_DIR = "/path/to/oneig/images"
# Model name list (name of the image generation model)
MODEL_NAMES = ["Qwen-Image"]
# Grid configuration list (corresponds to MODEL_NAMES, format: 'rows,cols')
IMAGE_GRIDS = ["2,2"]
# Task list to execute (freely combinable)
TASKS = ['alignment', 'text', 'reasoning', 'style', 'diversity']
Run Evaluationο
# Full evaluation (5 sub-tasks)
ais_bench ais_bench/configs/oneig_examples/oneig_full_eval.py -m eval
Resultsο
After evaluation, results are output to outputs/default/{timestamp}/:
outputs/default/{timestamp}/
βββ configs/
β βββ {timestamp}.py # Evaluation config snapshot
βββ logs/
β βββ eval/
β βββ oneig_eval/
β βββ oneig_alignment.out # Task logs
β βββ oneig_text.out
β βββ oneig_reasoning.out
β βββ oneig_style.out
β βββ oneig_diversity.out
βββ results/
β βββ oneig_eval/
β βββ oneig_alignment.json # Task results (with per-sample details)
β βββ oneig_text.json
β βββ oneig_reasoning.json
β βββ oneig_style.json
β βββ oneig_diversity.json
βββ summary/
βββ summary_{timestamp}.csv # Evaluation summary
βββ summary_{timestamp}.md
βββ summary_{timestamp}.txt
Configuration and Outputο
Common Configuration Optionsο
Option |
Purpose |
Required |
|---|---|---|
|
OneIG official project absolute path |
Yes |
|
Language mode: |
Yes |
|
Image root directory for evaluation |
Yes |
|
List of image generation model names |
Yes |
|
Grid configuration list, format |
Yes |
|
Task list, options: |
Yes |
|
Judge model path (Alignment/Text), default |
No |
|
Judge model random seed, default |
No |
|
DreamSim weight cache directory, default |
No |
Preset Configurationsο
Config Name |
Description |
Config File |
|---|---|---|
oneig_full_eval |
Full evaluation config with 5 sub-tasks, freely combinable |
|
Result Pathο
Written per sub-task:
{work_dir}/results/oneig_eval/oneig_{task}.json
Where {task} is one of alignment, text, reasoning, style, diversity.
Output Formatο
Each sub-task JSON result file has the following structure (using Alignment as an example):
{
"accuracy": 88.44,
"details": [
{
"id": "000",
"class_item": "anime",
"score": 0.85,
"image_path": "/path/to/image.png",
"grid": "2x2",
"num_splits": 4,
"judge_details": [
{
"question_id": "Q1",
"question": "...",
"judge_prompt": "...",
"judge_outputs": [
{"grid_idx": 0, "raw_output": "Yes", "parsed_answer": "Yes", "score": 1.0}
],
"dependency": [0],
"filtered_scores": null
}
]
}
],
"style_scores": null
}
The details field contains different intermediate data per sub-task:
Sub-task |
Intermediate Data Field |
Description |
|---|---|---|
Alignment |
|
Per-split Judge Q&A details |
Text |
|
Per-split OCR results and text metrics (ED/CR/WAC) |
Reasoning |
|
Per-split similarity scores |
Style |
|
Per-split CSD/SE similarity and style scores |
Diversity |
|
Per-pair split DreamSim distances |
Evaluation Metricsο
Metric Overviewο
Sub-task |
Primary Metric |
Auxiliary Metrics |
Evaluation Method |
Evaluation Model |
|---|---|---|---|---|
Alignment |
|
- |
LLM-as-Judge |
Qwen3-VL-8B-Instruct |
Text |
|
|
LLM-as-Judge + OCR |
Qwen3-VL-8B-Instruct |
Reasoning |
|
- |
Feature Similarity |
LLM2CLIP |
Style |
|
- |
Feature Similarity |
CSD + SE Encoder |
Diversity |
|
|
Perceptual Distance |
DreamSim |
Total |
|
- |
Average of 5 tasks |
- |
Sub-task Evaluation Logicο
Alignment (LLM-as-Judge)ο
Goal: Evaluate the alignment between generated images and prompts.
Flow:
Split grid images into sub-images
For each sub-image, use Judge model (Qwen3-VL-8B-Instruct) to answer Yes/No questions
βYesβ scores 1, βNoβ scores 0
Average all sub-image scores as the sample score
Average all sample scores Γ 100 as accuracy
Key Parameters:
judge_model_path: Judge model pathjudge_seed: Random seed (default 42, ensures reproducibility)num_gpus: Supports multi-GPU parallelism (recommended 4)
Text (LLM-as-Judge + OCR)ο
Goal: Evaluate the accuracy of text rendering in generated images.
Flow:
Split grid images into sub-images
Use Judge model to perform OCR on each sub-image, extracting text
Compare extracted text with reference text, computing three metrics:
ED (Edit Distance): Edit distance
CR (Character Ratio): Character ratio
WAC (Word Accuracy Coincidence): Word accuracy
Combine OCR metrics and Judge score to get accuracy
Reasoning (LLM2CLIP)ο
Goal: Evaluate the understanding of reasoning-type prompts in generated images.
Flow:
Split grid images into sub-images
Use LLM2CLIP to extract image features and reference answer text features
Compute cosine similarity between image and text features
Average all sub-image similarities as the sample score
Average all sample scores Γ 100 as accuracy
Model Components:
CLIP Processor:
openai/clip-vit-large-patch14-336CLIP Model:
microsoft/LLM2CLIP-Openai-L-14-336LLM Model:
microsoft/LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned
Style (CSD + SE Encoder)ο
Goal: Evaluate the style performance of generated images.
Flow:
Split grid images into sub-images
Use CSD (CLIP-Style-Diffusion) encoder to extract style features
Use SE (Style Encoder) encoder to extract style features
Compute cosine similarity of CSD and SE features with reference style embeddings
Average the two similarities as the sub-image style score
Average all sub-images, then all samples Γ 100 as accuracy
Style Categories (29 types): abstract_expressionism, art_nouveau, baroque, chinese_ink_painting, cubism, fauvism, impressionism, line_art, minimalism, pointillism, pop_art, rococo, ukiyo-e, clay, crayon, graffiti, lego, comic, pencil_sketch, stone_sculpture, watercolor, celluloid, chibi, cyberpunk, ghibli, impasto, pixar, pixel_art, 3d_rendering
Diversity (DreamSim)ο
Goal: Evaluate the diversity among multiple images generated by the same model.
Flow:
Split grid images into sub-images
Use DreamSim model to compute pairwise perceptual distances between all sub-images
Average all distance pairs as the sample diversity score
Group by class_item (anime, human, object, text, reasoning) for fine-grained metrics
Average all sample scores Γ 100 as accuracy
Score Aggregation (oneig_total)ο
oneig_total is the simple average of 5 sub-task accuracy values:
oneig_total = (alignment + text + reasoning + style + diversity) / 5
Additionally, the Diversity task outputs fine-grained metrics grouped by class_item:
Metric |
Description |
|---|---|
|
Diversity score for Anime category |
|
Diversity score for Portrait category |
|
Diversity score for General Object category |
|
Diversity score for Text Rendering category |
|
Diversity score for Knowledge Reasoning category |
Example Resultsο
dataset version metric mode oneig_eval
oneig_alignment a39421 accuracy gen 88.44
oneig_text a39421 accuracy gen 80.79
oneig_text a39421 ED gen 43.32
oneig_text a39421 CR gen 0.08
oneig_text a39421 WAC gen 0.52
oneig_reasoning a39421 accuracy gen 29.84
oneig_style a39421 accuracy gen 35.85
oneig_diversity a39421 accuracy gen 18.28
oneig_total - accuracy gen 50.64
oneig_diversity_anime - accuracy gen 9.00
oneig_diversity_human - accuracy gen 11.21
oneig_diversity_object - accuracy gen 13.27
oneig_diversity_text - accuracy gen 21.14
oneig_diversity_reasoning - accuracy gen 36.80
Data Formatο
Original Dataset Formatο
The OneIG original dataset is a CSV file (OneIG-Bench.csv), where each record contains:
{
"category": "Anime_Stylization",
"id": "000",
"prompt_en": "4boys, 5girls, multiple boys, multiple girls, ...",
"type": "T, P",
"prompt_length": "long",
"class": "None"
}
Field |
Description |
|---|---|
|
Prompt category: Anime_Stylization, Portrait, General Object, Text Rendering, Knowledge Reasoning, Multilingualism |
|
Unique ID, maintained independently per category |
|
Text-to-image prompt |
|
Type marker: T (Text), P (Portrait), NP (Non-Portrait) |
|
Prompt length: short, middle, long |
|
Style category (optional): fauvism, watercolor, None |
Image Directory Structureο
Images for evaluation should be organized as follows:
IMAGE_DIR/
βββ anime/ # class_item directory
β βββ {model_name}/ # model name directory
β βββ 000.png # image file (first 3 chars of filename = sample_id)
β βββ 001.png
β βββ ...
βββ human/
β βββ {model_name}/
β βββ 000.png
β βββ ...
βββ object/
β βββ {model_name}/
β βββ ...
βββ text/
β βββ {model_name}/
β βββ ...
βββ reasoning/
βββ {model_name}/
βββ ...
class_item directories for each sub-task:
Sub-task |
EN Mode |
ZH Mode (additional) |
|---|---|---|
Alignment |
anime, human, object |
multilingualism |
Text |
text |
- |
Reasoning |
reasoning |
- |
Style |
anime |
- |
Diversity |
anime, human, object, text, reasoning |
multilingualism |
Grid Splittingο
OneIG supports compositing multiple generated images into a grid for batch evaluation. The IMAGE_GRIDS config specifies the grid rows and columns:
Grid Config |
Meaning |
Sub-images |
|---|---|---|
|
1 row, 2 columns |
2 |
|
2 rows, 2 columns |
4 |
|
1 row, 4 columns |
4 |
|
3 rows, 3 columns |
9 |
During evaluation, grid images are automatically split into sub-images, each evaluated independently and averaged.
Example Codeο
Single Task Evaluationο
Modify the TASKS list in the config file to include only the desired task:
# Evaluate Alignment only
TASKS = ['alignment']
Run:
ais_bench ais_bench/configs/oneig_examples/oneig_full_eval.py -m eval
Full Evaluationο
# Evaluate all 5 sub-tasks
TASKS = ['alignment', 'text', 'reasoning', 'style', 'diversity']
Run:
ais_bench ais_bench/configs/oneig_examples/oneig_full_eval.py -m eval
Chinese Mode Evaluationο
ONEIG_ROOT = "/path/to/OneIG-Benchmark"
MODE = "ZH" # Switch to Chinese mode
IMAGE_DIR = "/path/to/oneig/images_zh" # Images generated from Chinese prompts
MODEL_NAMES = ["Qwen-Image"]
IMAGE_GRIDS = ["2,2"]
TASKS = ['alignment', 'text', 'reasoning', 'style', 'diversity']
Multi-Model Comparisonο
MODEL_NAMES = ["model_a", "model_b"]
IMAGE_GRIDS = ["2,2", "2,2"] # Must match MODEL_NAMES length