minpeter commited on
Commit
4d092fe
·
verified ·
1 Parent(s): b495ceb

diff for compatibility

Browse files
Files changed (2) hide show
  1. README.md +12 -100
  2. preprocessor_config.json +2 -2
README.md CHANGED
@@ -7,110 +7,22 @@ base_model:
7
  pipeline_tag: image-text-to-text
8
  ---
9
 
10
- ## R1-Onevision
 
 
 
 
11
 
12
- [\[📂 GitHub\]](https://github.com/Fancy-MLLM/R1-Onevision)[\[📝 Report\]](https://yangyi-vai.notion.site/r1-onevision?pvs=4)
13
- [\[🤗 HF Dataset\]](https://huggingface.co/datasets/Fancy-MLLM/R1-onevision) [\[🤗 Reasoning Benchmark\]](https://huggingface.co/datasets/Fancy-MLLM/R1-OneVision-Bench) [\[🤗 HF Demo\]](https://huggingface.co/spaces/Fancy-MLLM/R1-OneVision)
14
 
15
- ## Model Overview
16
 
17
- This is a multimodal large language model fine-tuned from Qwen2.5-VL on the **R1-Onevision** dataset. The model enhances vision-language understanding and reasoning capabilities, making it suitable for various tasks such as visual reasoning, image understanding. With its robust ability to perform multimodal reasoning, R1-Onevision emerges as a powerful AI assistant capable of addressing a wide range of problem-solving challenges across different domains.
 
18
 
19
- ## Training Configuration and Curve
20
- - Framework: The training process uses the open-source **LLama-Factory** library, with **Qwen2.5-VL-Instruct** as the base model. This model comes in three variants: 3B, 7B, and 32B.
21
- - Parameters: For efficiency, we use a resolution of 518 for image inputs to save GPU memory. The training follows a full model SFT (Supervised Fine-Tuning) approach with a learning rate of 1e-5, trained for one epoch.
22
-
23
- The training configuration is as follows:
24
- ```python
25
- image_resolution: 518
26
- cutoff_len: 8192
27
- per_device_train_batch_size: 1
28
- gradient_accumulation_steps: 16
29
- learning_rate: 1.0e-5
30
 
31
- num_train_epochs: 1.0
32
- lr_scheduler_type: cosine
33
- warmup_ratio: 0.05
34
- bf16: true
35
- flash_attn: fa2
36
- ```
37
 
38
- Training loss curve:
39
- <img src="https://cdn-uploads.huggingface.co/production/uploads/65af78bb3e82498d4c65ed2a/8BNyo-v68aFvab2kXxtt1.png"/>
40
 
41
- ## Usage
42
-
43
- You can load the model using the Hugging Face `transformers` library:
44
-
45
- ```python
46
- from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
47
- import torch
48
- from qwen_vl_utils import process_vision_info
49
-
50
- MODEL_ID = "Fancy-MLLM/R1-Onevision-7B"
51
- processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
52
- model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
53
- MODEL_ID,
54
- trust_remote_code=True,
55
- torch_dtype=torch.bfloat16
56
- ).to("cuda").eval()
57
-
58
- messages = [
59
- {
60
- "role": "user",
61
- "content": [
62
- {"type": "image", "image": "<your image path>"},
63
- {"type": "text", "text": "Hint: Please answer the question and provide the final answer at the end. Question: Which number do you have to write in the last daisy?"},
64
- ],
65
- }
66
- ]
67
-
68
- # Preparation for inference
69
- text = processor.apply_chat_template(
70
- messages, tokenize=False, add_generation_prompt=True
71
- )
72
- image_inputs, video_inputs = process_vision_info(messages)
73
- inputs = processor(
74
- text=[text],
75
- images=image_inputs,
76
- videos=video_inputs,
77
- padding=True,
78
- return_tensors="pt",
79
- )
80
- inputs = inputs.to(model.device)
81
-
82
- generated_ids = model.generate(**inputs, max_new_tokens=4096)
83
- generated_ids_trimmed = [
84
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
85
- ]
86
- output_text = processor.batch_decode(
87
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
88
- )
89
- print(output_text)
90
- ```
91
-
92
- ## Ongoing Work
93
- 1. **Rule-Based Reinforcement Learning (RL)**
94
-
95
- We are actively exploring the integration of rule-based systems into reinforcement learning to enhance the agent's decision-making process. This approach combines domain-specific rules with the learning process, aiming to improve the efficiency and safety of learning in complex environments.
96
-
97
- 2. **Training with General Data and Multimodal Reasoning CoT**
98
-
99
- Our ongoing work includes expanding the training datasets by incorporating more general data alongside multimodal reasoning Chain-of-Thought (CoT) data. This will enable the model to benefit from a broader range of information, enhancing its ability to handle diverse reasoning tasks across various domains.
100
-
101
- 3. **Incorporating Chinese Multimodal Reasoning CoT Data**
102
-
103
- We are also focused on integrating Chinese multimodal reasoning CoT data into the training process. By adding this language-specific dataset, we aim to improve the model’s capability to perform reasoning tasks in Chinese, expanding its multilingual and multimodal reasoning proficiency.
104
-
105
- 4. **Release of the 3B Model**
106
-
107
-
108
- We are working on the release of a smaller, more efficient 3B model, which is designed to provide a balance between performance and resource efficiency. This model aims to deliver strong multimodal reasoning capabilities while being more accessible and optimized for environments with limited computational resources, offering a more compact alternative to the current 7B model.
109
-
110
- # Institution
111
- - Zhejiang University
112
-
113
- ## Model Contact
114
115
116
 
7
  pipeline_tag: image-text-to-text
8
  ---
9
 
10
+ <!-- header start -->
11
+ <p align="center">
12
+ <img src="https://huggingface.co/datasets/FriendliAI/documentation-images/resolve/main/model-card-assets/friendliai.png" width="100%" alt="FriendliAI Logo">
13
+ </p>
14
+ <!-- header end -->
15
 
 
 
16
 
17
+ # Fancy-MLLM/R1-Onevision-7B
18
 
19
+ * Model creator: [Fancy-MLLM](https://huggingface.co/Fancy-MLLM)
20
+ * Original model: [R1-Onevision-7B](https://huggingface.co/Fancy-MLLM/R1-Onevision-7B)
21
 
22
+ ## Differences
 
 
 
 
 
 
 
 
 
 
23
 
24
+ * Fixed the incorrectly set image_processor_type in preprocessor_config.json.
 
 
 
 
 
25
 
26
+ ## License
 
27
 
28
+ Refer to the license of the original model card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
preprocessor_config.json CHANGED
@@ -8,7 +8,7 @@
8
  0.4578275,
9
  0.40821073
10
  ],
11
- "image_processor_type": "Qwen2_5_VLImageProcessor",
12
  "image_std": [
13
  0.26862954,
14
  0.26130258,
@@ -26,4 +26,4 @@
26
  "shortest_edge": 3136
27
  },
28
  "temporal_patch_size": 2
29
- }
 
8
  0.4578275,
9
  0.40821073
10
  ],
11
+ "image_processor_type": "Qwen2VLImageProcessor",
12
  "image_std": [
13
  0.26862954,
14
  0.26130258,
 
26
  "shortest_edge": 3136
27
  },
28
  "temporal_patch_size": 2
29
+ }