LTX-2 Audio-to-Video Pipeline with Video Conditioning

A custom diffusers pipeline for LTX-2 that extends audio-to-video generation with video conditioning support.

Features

Audio-conditioned video generation (lip-sync)
Video conditioning for motion/pose guidance
Configurable conditioning strength and start frame
Compatible with LTX-2 LoRAs (face-swap, camera control, etc.)

Installation

pip install diffusers transformers torch torchaudio av

Usage

import torch
from diffusers import DiffusionPipeline
from diffusers.utils import load_image

# Load pipeline with custom video conditioning support
pipe = DiffusionPipeline.from_pretrained(
    "Lightricks/LTX-2",
    custom_pipeline="linoyts/ltx2-audio-video-conditioning",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

# Optional: Load a LoRA (e.g., face-swap)
# pipe.load_lora_weights("Alissonerdx/BFS-Best-Face-Swap-Video", 
#                        weight_name="ltx-2/head_swap_v1_13500_first_frame.safetensors")
# pipe.fuse_lora(lora_scale=1.1)

# Load inputs
image = load_image("input_face.png")

# Generate with video conditioning
video, audio = pipe(
    image=image,                          # Frame 0 appearance
    video="reference_motion.mp4",         # Video for motion conditioning
    video_conditioning_strength=1.0,      # How strongly to follow motion (0-1)
    video_conditioning_frame_idx=1,       # Start video conditioning at frame 1
    audio="audio.wav",                    # Audio for lip-sync
    prompt="a person speaking naturally, smooth animation",
    negative_prompt="low quality, blurry, distorted",
    width=512,
    height=768,
    num_frames=121,
    frame_rate=24.0,
    num_inference_steps=40,
    guidance_scale=4.0,
    return_dict=False,
)

Parameters

Parameter	Type	Default	Description
`image`	PIL.Image	None	Input image for frame 0 conditioning
`video`	str/List/Tensor	None	Reference video for motion conditioning
`video_conditioning_strength`	float	1.0	Strength of video conditioning (0.0-1.0)
`video_conditioning_frame_idx`	int	1	Frame index where video conditioning starts
`audio`	str/Tensor	None	Audio input for lip-sync

Video Conditioning Frame Index

0: Video conditioning replaces all frames
1 (default): Frame 0 = image, frames 1+ = video motion
N: Frames 0 to N-1 = image/noise, frames N+ = video conditioning

Distilled Model (8-step)

For faster generation with the distilled model:

pipe = DiffusionPipeline.from_pretrained(
    "rootonchair/LTX-2-19b-distilled",
    custom_pipeline="linoyts/ltx2-audio-video-conditioning",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

DISTILLED_SIGMAS = [1.0, 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875]

video, audio = pipe(
    image=image,
    video="reference.mp4",
    audio="audio.wav",
    prompt="...",
    num_inference_steps=8,
    sigmas=DISTILLED_SIGMAS,
    guidance_scale=1.0,
    return_dict=False,
)

License

Apache 2.0

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support