yanboding commited on
Commit
31f0d58
Β·
verified Β·
1 Parent(s): c5ebfbf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -29
README.md CHANGED
@@ -7,25 +7,37 @@ license: apache-2.0
7
 
8
  <h2><a href="https://www.arxiv.org/abs/2505.10238">MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation</a></h2>
9
 
 
 
 
 
 
 
10
  > Official project page of **MTVCrafter**, a novel framework for general and high-quality human image animation using raw 3D motion sequences.
11
 
12
- <!--
13
- [Yanbo Ding](https://github.com/DINGYANB),
14
- [Shaobin Zhuang](https://scholar.google.com/citations?user=PGaDirMAAAAJ&hl=zh-CN&oi=ao),
15
- [Kunchang Li](https://scholar.google.com/citations?user=D4tLSbsAAAAJ),
16
- [Zhengrong Yue](https://arxiv.org/search/?searchtype=author&query=Zhengrong%20Yue),
17
- [Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ&hl),
18
  [Yali Wang†](https://scholar.google.com/citations?user=hD948dkAAAAJ)
19
- -->
20
 
21
- πŸ”— [Project Page](https://dingyanb.github.io/MTVCtafter/) |
22
- πŸ“„ [ArXiv](https://arxiv.org/abs/2505.10238) |
23
- πŸ’» [Code](https://github.com/DINGYANB/MTVCrafter) |
24
- πŸ€— [Hugging Face Model](https://huggingface.co/yanboding/MTVCrafter)
 
25
 
26
  </div>
27
 
28
 
 
 
 
 
 
 
 
 
 
29
  ## πŸ” Abstract
30
 
31
  Human image animation has attracted increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information.
@@ -69,7 +81,8 @@ we use learnable unconditional tokens for motion classifier-free guidance.
69
  We recommend using a clean Python environment (Python 3.10+).
70
 
71
  ```bash
72
- clone this repository && cd MTVCrafter-main
 
73
 
74
  # Create virtual environment
75
  conda create -n mtvcrafter python=3.11
@@ -79,48 +92,99 @@ conda activate mtvcrafter
79
  pip install -r requirements.txt
80
  ```
81
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
  ## πŸš€ Usage
83
 
84
- To animate a human image with a given 3D motion sequence,
85
- you first need to obtain the SMPL motion sequnces from the driven video:
 
 
 
86
 
87
  ```bash
88
  python process_nlf.py "your_video_directory"
89
  ```
90
 
91
- Then, you can use the following command to animate the image guided by 4D motion tokens:
 
 
 
 
 
 
 
 
 
 
92
 
 
93
  ```bash
94
- python infer.py --ref_image_path "ref_images/hunam.png" --motion_data_path "data/sample_data.pkl" --output_path "inference_output"
 
 
 
 
95
  ```
96
 
97
- - `--ref_image_path`: Path to the image of reference character.
98
- - `--motion_data_path`: Path to the motion sequence (.pkl format).
99
- - `--output_path`: Where to save the generated animation results.
100
 
101
- For our 4DMoT, you can run the following command to train the model on your dataset:
 
 
 
 
 
 
 
 
 
102
 
103
  ```bash
104
  accelerate launch train_vqvae.py
105
  ```
106
 
 
 
 
 
 
 
 
 
 
 
107
  ## πŸ“„ Citation
108
 
109
  If you find our work useful, please consider citing:
110
 
111
  ```bibtex
112
- @misc{ding2025mtvcrafter4dmotiontokenization,
113
- title={MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation},
114
- author={Yanbo Ding and Xirui Hu and Zhizhi Guo and Chi Zhang and Yali Wang},
115
- year={2025},
116
- eprint={2505.10238},
117
- archivePrefix={arXiv},
118
- primaryClass={cs.CV},
119
- url={https://arxiv.org/abs/2505.10238},
120
  }
121
  ```
122
 
123
  ## πŸ“¬ Contact
124
 
125
  For questions or collaboration, feel free to reach out via GitHub Issues
126
- or email me at πŸ“§ [email protected].
 
7
 
8
  <h2><a href="https://www.arxiv.org/abs/2505.10238">MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation</a></h2>
9
 
10
+ <meta name="google-site-verification" content="-XQC-POJtlDPD3i2KSOxbFkSBde_Uq9obAIh_4mxTkM" />
11
+
12
+ <div align="center">
13
+
14
+ <h2><a href="https://www.arxiv.org/abs/2505.10238">MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation</a></h2>
15
+
16
  > Official project page of **MTVCrafter**, a novel framework for general and high-quality human image animation using raw 3D motion sequences.
17
 
18
+ [Yanbo Ding](https://scholar.google.com/citations?user=r_ty-f0AAAAJ&hl=zh-CN),
19
+ [Xirui Hu](https://scholar.google.com/citations?user=-C7R25QAAAAJ&hl=zh-CN&oi=ao),
20
+ [Zhizhi Guo](https://dblp.org/pid/179/1036.html),
 
 
 
21
  [Yali Wang†](https://scholar.google.com/citations?user=hD948dkAAAAJ)
 
22
 
23
+ [![arXiv](https://img.shields.io/badge/πŸ“–%20Paper-2408.10605-b31b1b.svg)](https://www.arxiv.org/abs/2505.10238)
24
+ [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-yellow)](https://huggingface.co/yanboding/MTVCrafter)
25
+ [![ModelScope](https://img.shields.io/badge/πŸ€–%20ModelScope-Models-blue)](https://www.modelscope.cn/models/AI-ModelScope/MTVCrafter)
26
+ [![Project Page1](https://img.shields.io/badge/🌐%20Page-CogVideoX-brightgreen)](https://dingyanb.github.io/MTVCtafter/)
27
+ [![Project Page2](https://img.shields.io/badge/🌐%20Page-Wan2.1-orange)](https://dingyanb.github.io/MTVCrafter-/)
28
 
29
  </div>
30
 
31
 
32
+ ## πŸ“Œ ToDo List
33
+
34
+ - [x] Release **global dataset statistics** (mean / std)
35
+ - [x] Release **4D MoT** model
36
+ - [x] Release **MV-DiT-7B** (based on *CogVideoX-T2V-5B*)
37
+ - [x] Release **MV-DiT-17B** (based on *Wan-2.1-I2V-14B*)
38
+ - [ ] Release a Hugging Face Demo Space
39
+
40
+
41
  ## πŸ” Abstract
42
 
43
  Human image animation has attracted increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information.
 
81
  We recommend using a clean Python environment (Python 3.10+).
82
 
83
  ```bash
84
+ git clone https://github.com/your-username/MTVCrafter.git
85
+ cd MTVCrafter
86
 
87
  # Create virtual environment
88
  conda create -n mtvcrafter python=3.11
 
92
  pip install -r requirements.txt
93
  ```
94
 
95
+ For models regarding:
96
+
97
+ 1. **NLF-Pose Estimator**
98
+ Download [`nlf_l_multi.torchscript`](https://github.com/isarandi/nlf/releases) from the NLF release page.
99
+
100
+ 2. **MV-DiT Backbone Models**
101
+ - **CogVideoX**: Download the [CogVideoX-5B checkpoint](https://huggingface.co/THUDM/CogVideoX-5b).
102
+ - **Wan-2-1**: Download the [Wan-2-1-14B checkpoint](https://huggingface.co/alibaba-pai/Wan2.1-Fun-V1.1-14B-InP) and place it under the `wan2.1/` folder.
103
+
104
+ 3. **MTVCrafter Checkpoints**
105
+ Download the MV-DiT and 4DMoT checkpoints from [MTVCrafter on Hugging Face](https://huggingface.co/yanboding/MTVCrafter).
106
+
107
+ 4. *(Optional but recommended)*
108
+ Download the enhanced LoRA for better performance of Wan2.1_I2V_14B:
109
+ [`Wan2.1_I2V_14B_FusionX_LoRA.safetensors`](https://huggingface.co/vrgamedevgirl84/Wan14BT2VFusioniX/blob/main/FusionX_LoRa/Wan2.1_I2V_14B_FusionX_LoRA.safetensors)
110
+ Place it under the `wan2.1/` folder.
111
+
112
+ ---
113
+
114
  ## πŸš€ Usage
115
 
116
+ To animate a human image with a given 3D motion sequence,
117
+ you first need to prepare SMPL motion-video pairs. You can either:
118
+
119
+ - Use the provided sample data: `data/sampled_data.pkl`, or
120
+ - Extract SMPL motion sequences from your own driving video using:
121
 
122
  ```bash
123
  python process_nlf.py "your_video_directory"
124
  ```
125
 
126
+ This will generate a motion-video `.pkl` file under `"your_video_directory"`.
127
+
128
+ ---
129
+
130
+ #### ▢️ Inference of MV-DiT-7B
131
+ ```bash
132
+ python infer_7b.py \
133
+ --ref_image_path "ref_images/human.png" \
134
+ --motion_data_path "data/sampled_data.pkl" \
135
+ --output_path "inference_output"
136
+ ```
137
 
138
+ #### ▢️ Inference of MV-DiT-17B (with text control)
139
  ```bash
140
+ python infer_17b.py \
141
+ --ref_image_path "ref_images/woman.png" \
142
+ --motion_data_path "data/sampled_data.pkl" \
143
+ --output_path "inference_output" \
144
+ --prompt "The woman is dancing on the beach, waves, sunset."
145
  ```
146
 
147
+ **Arguments:**
 
 
148
 
149
+ - `--ref_image_path`: Path to the reference character image.
150
+ - `--motion_data_path`: Path to the SMPL motion sequence (.pkl format).
151
+ - `--output_path`: Directory to save the generated video.
152
+ - `--prompt` (optional): Text prompt describing the scene or style.
153
+
154
+ ---
155
+
156
+ ### πŸ‹οΈβ€β™‚οΈ Training 4DMoT
157
+
158
+ To train the 4DMoT tokenizer on your own dataset:
159
 
160
  ```bash
161
  accelerate launch train_vqvae.py
162
  ```
163
 
164
+ ---
165
+
166
+ ## πŸ’™ Acknowledgement
167
+ MTVCrafter is largely built upon
168
+ [CogVideoX](https://github.com/THUDM/CogVideo),
169
+ [Wan-2-1-Fun](https://github.com/aigc-apps/VideoX-Fun).
170
+ We sincerely acknowledge these open-source codes and models.
171
+ We also appreciate the valuable insights from the researchers at Institute of Artificial Intelligence (TeleAI), China Telecom, and Shenzhen Institute of Advanced Technology.
172
+
173
+
174
  ## πŸ“„ Citation
175
 
176
  If you find our work useful, please consider citing:
177
 
178
  ```bibtex
179
+ @article{ding2025mtvcrafter,
180
+ title={MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation},
181
+ author={Ding, Yanbo and Hu, Xirui and Guo, Zhizhi and Zhang, Chi and Wang, Yali},
182
+ journal={arXiv preprint arXiv:2505.10238},
183
+ year={2025}
 
 
 
184
  }
185
  ```
186
 
187
  ## πŸ“¬ Contact
188
 
189
  For questions or collaboration, feel free to reach out via GitHub Issues
190
+ or email me at πŸ“§ [email protected].