The model was trained from scratch for 10,000 steps on a dataset of 1 million high-quality images covering both general and human-centric content. Training was performed at 1328 resolution using BFloat16 precision, with a batch size of 64, a learning rate of 2e-5, and a text dropout ratio of 0.10.
It supports multiple control conditionsβincluding Canny, HED, Depth, Pose and MLSD can be used like a standard ControlNet.
You can adjust control_context_scale for stronger control and better detail preservation. For better stability, we highly recommend using a detailed prompt. The optimal range for control_context_scale is from 0.65 to 0.80.
TODO
Train on more data and for more steps.
Support inpaint mode.
Results
Pose
Output
Pose
Output
Canny
Output
HED
Output
Depth
Output
Inference
Go to the VideoX-Fun repository for more details.
Please clone the VideoX-Fun repository and create the required directories:
# Clone the code
git clone https://github.com/aigc-apps/VideoX-Fun.git
# Enter VideoX-Fun's directorycd VideoX-Fun
# Create model directoriesmkdir -p models/Diffusion_Transformer
mkdir -p models/Personalized_Model
Then download the weights into models/Diffusion_Transformer and models/Personalized_Model.