LTX-2 Audio-to-Video Pipeline with Video Conditioning

A custom diffusers pipeline for LTX-2 that extends audio-to-video generation with video conditioning support.

Features

  • Audio-conditioned video generation (lip-sync)
  • Video conditioning for motion/pose guidance
  • Configurable conditioning strength and start frame
  • Compatible with LTX-2 LoRAs (face-swap, camera control, etc.)

Installation

pip install diffusers transformers torch torchaudio av

Usage

import torch
from diffusers import DiffusionPipeline
from diffusers.utils import load_image

# Load pipeline with custom video conditioning support
pipe = DiffusionPipeline.from_pretrained(
    "Lightricks/LTX-2",
    custom_pipeline="linoyts/ltx2-audio-video-conditioning",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

# Optional: Load a LoRA (e.g., face-swap)
# pipe.load_lora_weights("Alissonerdx/BFS-Best-Face-Swap-Video", 
#                        weight_name="ltx-2/head_swap_v1_13500_first_frame.safetensors")
# pipe.fuse_lora(lora_scale=1.1)

# Load inputs
image = load_image("input_face.png")

# Generate with video conditioning
video, audio = pipe(
    image=image,                          # Frame 0 appearance
    video="reference_motion.mp4",         # Video for motion conditioning
    video_conditioning_strength=1.0,      # How strongly to follow motion (0-1)
    video_conditioning_frame_idx=1,       # Start video conditioning at frame 1
    audio="audio.wav",                    # Audio for lip-sync
    prompt="a person speaking naturally, smooth animation",
    negative_prompt="low quality, blurry, distorted",
    width=512,
    height=768,
    num_frames=121,
    frame_rate=24.0,
    num_inference_steps=40,
    guidance_scale=4.0,
    return_dict=False,
)

Parameters

Parameter Type Default Description
image PIL.Image None Input image for frame 0 conditioning
video str/List/Tensor None Reference video for motion conditioning
video_conditioning_strength float 1.0 Strength of video conditioning (0.0-1.0)
video_conditioning_frame_idx int 1 Frame index where video conditioning starts
audio str/Tensor None Audio input for lip-sync

Video Conditioning Frame Index

  • 0: Video conditioning replaces all frames
  • 1 (default): Frame 0 = image, frames 1+ = video motion
  • N: Frames 0 to N-1 = image/noise, frames N+ = video conditioning

Distilled Model (8-step)

For faster generation with the distilled model:

pipe = DiffusionPipeline.from_pretrained(
    "rootonchair/LTX-2-19b-distilled",
    custom_pipeline="linoyts/ltx2-audio-video-conditioning",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

DISTILLED_SIGMAS = [1.0, 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875]

video, audio = pipe(
    image=image,
    video="reference.mp4",
    audio="audio.wav",
    prompt="...",
    num_inference_steps=8,
    sigmas=DISTILLED_SIGMAS,
    guidance_scale=1.0,
    return_dict=False,
)

License

Apache 2.0

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support