LTX-2 Audio-to-Video Pipeline with Video Conditioning
A custom diffusers pipeline for LTX-2 that extends audio-to-video generation with video conditioning support.
Features
- Audio-conditioned video generation (lip-sync)
- Video conditioning for motion/pose guidance
- Configurable conditioning strength and start frame
- Compatible with LTX-2 LoRAs (face-swap, camera control, etc.)
Installation
pip install diffusers transformers torch torchaudio av
Usage
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import load_image
# Load pipeline with custom video conditioning support
pipe = DiffusionPipeline.from_pretrained(
"Lightricks/LTX-2",
custom_pipeline="linoyts/ltx2-audio-video-conditioning",
torch_dtype=torch.bfloat16
)
pipe.to("cuda")
# Optional: Load a LoRA (e.g., face-swap)
# pipe.load_lora_weights("Alissonerdx/BFS-Best-Face-Swap-Video",
# weight_name="ltx-2/head_swap_v1_13500_first_frame.safetensors")
# pipe.fuse_lora(lora_scale=1.1)
# Load inputs
image = load_image("input_face.png")
# Generate with video conditioning
video, audio = pipe(
image=image, # Frame 0 appearance
video="reference_motion.mp4", # Video for motion conditioning
video_conditioning_strength=1.0, # How strongly to follow motion (0-1)
video_conditioning_frame_idx=1, # Start video conditioning at frame 1
audio="audio.wav", # Audio for lip-sync
prompt="a person speaking naturally, smooth animation",
negative_prompt="low quality, blurry, distorted",
width=512,
height=768,
num_frames=121,
frame_rate=24.0,
num_inference_steps=40,
guidance_scale=4.0,
return_dict=False,
)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
image |
PIL.Image | None | Input image for frame 0 conditioning |
video |
str/List/Tensor | None | Reference video for motion conditioning |
video_conditioning_strength |
float | 1.0 | Strength of video conditioning (0.0-1.0) |
video_conditioning_frame_idx |
int | 1 | Frame index where video conditioning starts |
audio |
str/Tensor | None | Audio input for lip-sync |
Video Conditioning Frame Index
0: Video conditioning replaces all frames1(default): Frame 0 = image, frames 1+ = video motionN: Frames 0 to N-1 = image/noise, frames N+ = video conditioning
Distilled Model (8-step)
For faster generation with the distilled model:
pipe = DiffusionPipeline.from_pretrained(
"rootonchair/LTX-2-19b-distilled",
custom_pipeline="linoyts/ltx2-audio-video-conditioning",
torch_dtype=torch.bfloat16
)
pipe.to("cuda")
DISTILLED_SIGMAS = [1.0, 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875]
video, audio = pipe(
image=image,
video="reference.mp4",
audio="audio.wav",
prompt="...",
num_inference_steps=8,
sigmas=DISTILLED_SIGMAS,
guidance_scale=1.0,
return_dict=False,
)
License
Apache 2.0
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support