Improve model card: Add prominent links and evaluation instructions (#1)
Browse files- Improve model card: Add prominent links and evaluation instructions (25e00aaabe2d6bc56b439581892da2f4aeb43810)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
|
@@ -3,18 +3,20 @@ base_model:
|
|
| 3 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 4 |
language:
|
| 5 |
- en
|
|
|
|
| 6 |
license: apache-2.0
|
|
|
|
| 7 |
tags:
|
| 8 |
- gui
|
| 9 |
- agent
|
| 10 |
- gui-grounding
|
| 11 |
- reinforcement-learning
|
| 12 |
-
pipeline_tag: image-text-to-text
|
| 13 |
-
library_name: transformers
|
| 14 |
---
|
| 15 |
|
| 16 |
# InfiGUI-G1-7B
|
| 17 |
|
|
|
|
|
|
|
| 18 |
This repository contains the InfiGUI-G1-7B model from the paper **[InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization](https://arxiv.org/abs/2508.05731)**.
|
| 19 |
|
| 20 |
The model is based on `Qwen2.5-VL-7B-Instruct` and is fine-tuned using our proposed **Adaptive Exploration Policy Optimization (AEPO)** framework. AEPO is a novel reinforcement learning method designed to enhance the model's **semantic alignment** for GUI grounding tasks. It overcomes the exploration bottlenecks of standard RLVR methods by integrating a multi-answer generation strategy with a theoretically-grounded adaptive reward function, enabling more effective and efficient learning for complex GUI interactions.
|
|
@@ -91,7 +93,7 @@ def visualize_points(original_image: Image.Image, points: list,
|
|
| 91 |
# Draw circle
|
| 92 |
circle_radius = 20
|
| 93 |
draw.ellipse([original_x - circle_radius, original_y - circle_radius,
|
| 94 |
-
original_x + circle_radius, original_y + circle_radius]
|
| 95 |
fill=(255, 0, 0))
|
| 96 |
|
| 97 |
# Draw label
|
|
@@ -125,7 +127,8 @@ def main():
|
|
| 125 |
|
| 126 |
# Prepare model inputs
|
| 127 |
instruction = "shuffle play the current playlist"
|
| 128 |
-
system_prompt = 'You FIRST think about the reasoning process as an internal monologue and then provide the final answer
|
|
|
|
| 129 |
prompt = f'''The screen's resolution is {new_width}x{new_height}.
|
| 130 |
Locate the UI element(s) for "{instruction}", output the coordinates using JSON format: [{{"point_2d": [x, y]}}, ...]'''
|
| 131 |
|
|
@@ -162,10 +165,6 @@ if __name__ == "__main__":
|
|
| 162 |
main()
|
| 163 |
```
|
| 164 |
|
| 165 |
-
To reproduce the results in our paper, please refer to our repo for detailed instructions.
|
| 166 |
-
|
| 167 |
-
For more details on the methodology and evaluation, please refer to our [paper](https://arxiv.org/abs/2508.05731) and [repository](https://github.com/InfiXAI/InfiGUI-G1).
|
| 168 |
-
|
| 169 |
## Results
|
| 170 |
|
| 171 |
Our InfiGUI-G1 models, trained with the AEPO framework, establish new state-of-the-art results among open-source models across a diverse and challenging set of GUI grounding benchmarks.
|
|
@@ -210,7 +209,92 @@ On the widely-used ScreenSpot-V2 benchmark, which provides comprehensive coverag
|
|
| 210 |
<img src="https://raw.githubusercontent.com/InfiXAI/InfiGUI-G1/main/assets/results_screenspot-v2.png" width="90%" alt="ScreenSpot-V2 Results">
|
| 211 |
</div>
|
| 212 |
|
| 213 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 214 |
|
| 215 |
If you find this work useful, we would be grateful if you consider citing the following papers:
|
| 216 |
|
|
@@ -243,3 +327,7 @@ If you find this work useful, we would be grateful if you consider citing the fo
|
|
| 243 |
year={2025}
|
| 244 |
}
|
| 245 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 4 |
language:
|
| 5 |
- en
|
| 6 |
+
library_name: transformers
|
| 7 |
license: apache-2.0
|
| 8 |
+
pipeline_tag: image-text-to-text
|
| 9 |
tags:
|
| 10 |
- gui
|
| 11 |
- agent
|
| 12 |
- gui-grounding
|
| 13 |
- reinforcement-learning
|
|
|
|
|
|
|
| 14 |
---
|
| 15 |
|
| 16 |
# InfiGUI-G1-7B
|
| 17 |
|
| 18 |
+
**[π Paper](https://arxiv.org/abs/2508.05731)** | **[π Project Page](https://osatlas.github.io/)** | **[π» Code](https://github.com/InfiXAI/InfiGUI-G1)**
|
| 19 |
+
|
| 20 |
This repository contains the InfiGUI-G1-7B model from the paper **[InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization](https://arxiv.org/abs/2508.05731)**.
|
| 21 |
|
| 22 |
The model is based on `Qwen2.5-VL-7B-Instruct` and is fine-tuned using our proposed **Adaptive Exploration Policy Optimization (AEPO)** framework. AEPO is a novel reinforcement learning method designed to enhance the model's **semantic alignment** for GUI grounding tasks. It overcomes the exploration bottlenecks of standard RLVR methods by integrating a multi-answer generation strategy with a theoretically-grounded adaptive reward function, enabling more effective and efficient learning for complex GUI interactions.
|
|
|
|
| 93 |
# Draw circle
|
| 94 |
circle_radius = 20
|
| 95 |
draw.ellipse([original_x - circle_radius, original_y - circle_radius,
|
| 96 |
+
original_x + circle_radius, original_y + circle_radius],\
|
| 97 |
fill=(255, 0, 0))
|
| 98 |
|
| 99 |
# Draw label
|
|
|
|
| 127 |
|
| 128 |
# Prepare model inputs
|
| 129 |
instruction = "shuffle play the current playlist"
|
| 130 |
+
system_prompt = 'You FIRST think about the reasoning process as an internal monologue and then provide the final answer.
|
| 131 |
+
The reasoning process MUST BE enclosed within <think> </think> tags.'
|
| 132 |
prompt = f'''The screen's resolution is {new_width}x{new_height}.
|
| 133 |
Locate the UI element(s) for "{instruction}", output the coordinates using JSON format: [{{"point_2d": [x, y]}}, ...]'''
|
| 134 |
|
|
|
|
| 165 |
main()
|
| 166 |
```
|
| 167 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
## Results
|
| 169 |
|
| 170 |
Our InfiGUI-G1 models, trained with the AEPO framework, establish new state-of-the-art results among open-source models across a diverse and challenging set of GUI grounding benchmarks.
|
|
|
|
| 209 |
<img src="https://raw.githubusercontent.com/InfiXAI/InfiGUI-G1/main/assets/results_screenspot-v2.png" width="90%" alt="ScreenSpot-V2 Results">
|
| 210 |
</div>
|
| 211 |
|
| 212 |
+
## βοΈ Evaluation
|
| 213 |
+
|
| 214 |
+
This section provides instructions for reproducing the evaluation results reported in our paper.
|
| 215 |
+
|
| 216 |
+
### 1. Getting Started
|
| 217 |
+
|
| 218 |
+
Clone the repository and navigate to the project directory:
|
| 219 |
+
|
| 220 |
+
```bash
|
| 221 |
+
git clone https://github.com/InfiXAI/InfiGUI-G1.git
|
| 222 |
+
cd InfiGUI-G1
|
| 223 |
+
```
|
| 224 |
+
|
| 225 |
+
### 2. Environment Setup
|
| 226 |
+
|
| 227 |
+
The evaluation pipeline is built upon the [vLLM](https://github.com/vllm-project/vllm) library for efficient inference. For detailed installation guidance, please refer to the official vLLM repository. The specific versions used to obtain the results reported in our paper are as follows:
|
| 228 |
+
|
| 229 |
+
- **Python**: `3.10.12`
|
| 230 |
+
- **PyTorch**: `2.6.0`
|
| 231 |
+
- **Transformers**: `4.50.1`
|
| 232 |
+
- **vLLM**: `0.8.2`
|
| 233 |
+
- **CUDA**: `12.6`
|
| 234 |
+
|
| 235 |
+
The reported results were obtained on a server equipped with 4 x NVIDIA H800 GPUs.
|
| 236 |
+
|
| 237 |
+
### 3. Model Download
|
| 238 |
+
|
| 239 |
+
Download the InfiGUI-G1 models from the Hugging Face Hub into the `./models` directory.
|
| 240 |
+
|
| 241 |
+
```bash
|
| 242 |
+
# Create a directory for models
|
| 243 |
+
mkdir -p ./models
|
| 244 |
+
|
| 245 |
+
# Download InfiGUI-G1-3B
|
| 246 |
+
huggingface-cli download --resume-download InfiX-ai/InfiGUI-G1-3B --local-dir ./models/InfiGUI-G1-3B
|
| 247 |
+
|
| 248 |
+
# Download InfiGUI-G1-7B
|
| 249 |
+
huggingface-cli download --resume-download InfiX-ai/InfiGUI-G1-7B --local-dir ./models/InfiGUI-G1-7B
|
| 250 |
+
```
|
| 251 |
+
|
| 252 |
+
### 4. Dataset Download and Preparation
|
| 253 |
+
|
| 254 |
+
Download the required evaluation benchmarks into the `./data` directory.
|
| 255 |
+
|
| 256 |
+
```bash
|
| 257 |
+
# Create a directory for datasets
|
| 258 |
+
mkdir -p ./data
|
| 259 |
+
|
| 260 |
+
# Download benchmarks
|
| 261 |
+
huggingface-cli download --repo-type dataset --resume-download likaixin/ScreenSpot-Pro --local-dir ./data/ScreenSpot-Pro
|
| 262 |
+
huggingface-cli download --repo-type dataset --resume-download ServiceNow/ui-vision --local-dir ./data/ui-vision
|
| 263 |
+
huggingface-cli download --repo-type dataset --resume-download OS-Copilot/ScreenSpot-v2 --local-dir ./data/ScreenSpot-v2
|
| 264 |
+
huggingface-cli download --repo-type dataset --resume-download OpenGVLab/MMBench-GUI --local-dir ./data/MMBench-GUI
|
| 265 |
+
huggingface-cli download --repo-type dataset --resume-download vaundys/I2E-Bench --local-dir ./data/I2E-Bench
|
| 266 |
+
```
|
| 267 |
+
|
| 268 |
+
After downloading, some datasets require unzipping compressed image files.
|
| 269 |
+
|
| 270 |
+
```bash
|
| 271 |
+
# Unzip images for ScreenSpot-v2
|
| 272 |
+
unzip ./data/ScreenSpot-v2/screenspotv2_image.zip -d ./data/ScreenSpot-v2/
|
| 273 |
+
|
| 274 |
+
# Unzip images for MMBench-GUI
|
| 275 |
+
unzip ./data/MMBench-GUI/MMBench-GUI-OfflineImages.zip -d ./data/MMBench-GUI/
|
| 276 |
+
```
|
| 277 |
+
|
| 278 |
+
### 5. Running the Evaluation
|
| 279 |
+
|
| 280 |
+
To run the evaluation, use the `eval/eval.py` script. You must specify the path to the model, the benchmark name, and the tensor parallel size.
|
| 281 |
+
|
| 282 |
+
Here is an example command to evaluate the `InfiGUI-G1-3B` model on the `screenspot-pro` benchmark using 4 GPUs:
|
| 283 |
+
|
| 284 |
+
```bash
|
| 285 |
+
python eval/eval.py \
|
| 286 |
+
./models/InfiGUI-G1-3B \
|
| 287 |
+
--benchmark screenspot-pro \
|
| 288 |
+
--tensor-parallel 4
|
| 289 |
+
```
|
| 290 |
+
|
| 291 |
+
- **`model_path`**: The first positional argument specifies the path to the downloaded model directory (e.g., `./models/InfiGUI-G1-3B`).
|
| 292 |
+
- **`--benchmark`**: Specifies the benchmark to evaluate. Available options include `screenspot-pro`, `screenspot-v2`, `ui-vision`, `mmbench-gui`, and `i2e-bench`.
|
| 293 |
+
- **`--tensor-parallel`**: Sets the tensor parallelism size, which should typically match the number of available GPUs.
|
| 294 |
+
|
| 295 |
+
Evaluation results, including detailed logs and performance metrics, will be saved to the `./output/{model_name}/{benchmark}/` directory.
|
| 296 |
+
|
| 297 |
+
## π Citation Information
|
| 298 |
|
| 299 |
If you find this work useful, we would be grateful if you consider citing the following papers:
|
| 300 |
|
|
|
|
| 327 |
year={2025}
|
| 328 |
}
|
| 329 |
```
|
| 330 |
+
|
| 331 |
+
## π Acknowledgements
|
| 332 |
+
|
| 333 |
+
We would like to express our gratitude for the following open-source projects: [VERL](https://github.com/volcengine/verl), [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) and [vLLM](https://github.com/vllm-project/vllm).
|