What a trip. Just walked through @burtenshaw and @evalstate tutorial on adding Hugging Face Skills to your Claude Code agent so you can fine tune LLMs by chatting with AI.
These are the kinds of innovations that are going to help everyone benefit from the power of Artificial Intelligence. Well done gentlemen and thank you for sharing.
One speech model with seven voices, streamlined with multimodal capabilities for vision tasks. Performs vision(image-text) to audio inference with Qwen2.5-VL + VibeVoice-Realtime-0.5B. Vision to VibeVoice (EN) - The demo is live. 🗣️🔥
The strangerzonehf [HF] Community / Organization Page, which is maintained by me, has reached the Top 10 Developer Pages ranking at 6th place, contributing 3.4% in the calendar cycle from August 2024 to August 2025. It is also the only South Asia / Indian page in the list. I could not be more proud to be doing things for the community. ❤️🤗
We recently discussed how Tensor Parallelism slices matrices to reduce latency within a single node. But what happens when you need to scale beyond that, where the bandwidth drops?
That is where Pipeline Parallelism (PP) takes over.
Instead of slicing the operation, PP slices the model depth. It turns your GPU cluster into an assembly line: GPU 0 handles layers 1-12, GPU 1 handles 13-24, and so on.
The hardware challenge here isn't the interconnect speed—it is the "Pipeline Bubble." In a naive setup, expensive H100s sit idle for most of the cycle waiting for data to flow through the chain.
My latest guide breaks down the scheduling strategies used to minimize this idle silicon time.
In this deep dive, we cover:
The Hardware Mechanics: Vertical Slicing Unlike TP which requires "chatty" All-Reduce operations, PP relies on lightweight Point-to-Point (Send/Recv) communication. This makes it the only viable strategy for crossing node boundaries over Ethernet or InfiniBand.
Fighting the Bubble: 1F1B vs. GPipe We analyze the scheduling algorithms that keep the GPUs fed:
GPipe: The "flush and fill" approach. Simple, but memory-intensive. 1F1B (One-Forward-One-Backward): The industry standard. By interleaving forward and backward passes, we aggressively free up memory and reduce the bubble size. The Math of Efficiency The "Bubble" is a mathematical inevitability. We look at the efficiency formula M+N−1 M to understand why you need massive global batch sizes to make PP worth the effort.
The article includes a conceptual PyTorch implementation of the 1F1B state machine to illustrate exactly how the data is handed off between stages.
😐 I keep seeing takes on LinkedIn from American business influencers melting down about Silicon Valley startup "dependence" on open-source Chinese models.
🤔 Can anyone describe a credible scenario where these models can be leveraged by the Chinese government to endanger American security interests or am I right to believe that this is just Red Scare nonsense?
Introducing the Super-OCRs Demo, a comparison of state-of-the-art multimodal OCR VLMs, including HunyuanOCR, DeepSeekOCR, Dots, and Nanonets in one space for performing OCR, rendering LaTeX and Markdown, and visual grounding (layout). Find the related Spaces and models below.🤗🔥
When models get too large for a single GPU, simply stacking layers vertically (Pipeline Parallelism) isn't always the answer. Sometimes, you need to slice the matrices themselves.
My latest guide breaks down the hardware mechanics of Tensor Parallelism (TP). We look at how to shard individual operations across devices to make a cluster function as one massive accelerator.
This isn't high-level theory—it is a look at the bare metal implementation.
Here is what is covered in the deep dive:
The Strategies: Column vs. Row Parallelism We analyze how to split weight matrices (W) and inputs (X).
Column-Linear: Splits weights by columns. Requires an All-Gather to reconstruct the output. Row-Linear: Splits weights by rows. Requires an All-Reduce to sum partial results. The "Megatron-LM" Optimization Efficiency comes from minimizing communication. By sandwiching the non-linearity (GeLU) between a Column-Parallel layer and a Row-Parallel layer, we can skip synchronization entirely during the activation phase. This cuts communication events by 50% per block.
The Hardware Reality: The Bandwidth Wall In TP, the dist.all_reduce operation sits on the critical path. The CUDA cores effectively stall while waiting for the ring-reduce to finish.
Intra-Node: Works well because NVLink provides enough bandwidth to hide this latency. Inter-Node: Fails at scale. Standard networking (Ethernet/InfiniBand) is too slow for the high-frequency syncs required by TP. The article includes a raw PyTorch implementation using torch.distributed primitives to show exactly where the data moves and where the bottlenecks sit.
Introducing the advanced sketch-board editor "Nano-Banana-Pro-Sketch-Board" powered by the Gemini 2.5 Flash Image and Gemini 3 Pro Preview Image models through the Gemini API. This version includes more features than the Nano-Banana-AIO app for drawing and prompt-based concept transformation of freestyle sketches. 🔥🍌
Note: The Nano-Banana-Pro-Sketch-Board demo requires a Gemini API key for the editing process. Your API key will be removed when the app is reloaded or closed. Your key remains safe and will not be exposed to any medium. Also, the Gemini 3 Pro Preview Image model may require a paid API key from a Google Cloud project with billing enabled.
To know more about it, visit the app info section or the respective Model Garden page!
Try the demo of NVIDIA Nemotron Parse v1.1, NVIDIA's latest VLM for understanding document semantics and extracting text and table elements with spatial grounding. It is capable of comprehensive text understanding and document structure analysis in a given document, and can provide bounding boxes with coordinates.
Try the all-new trending Qwen-Image-Edit-2509 (Multi-Image-Edits) specialized adapter demos, including Cloth-Design-Fuse, Texture Edit, Guided-Objects-Patching, and more — all in a single Hugging Face Space. The demo link is provided below. 🤗🔥
Made a demo for multimodal understanding of Qwen3-VL space for tasks including point annotation, detection, captioning, guided text inferences, and more. Find the demo link below. 🤗↗️
Running large language models efficiently is more than just raw GPU power. The latest guide breaks down the essential math to determine if your LLM workload is compute-bound or memory-bound.
We apply these principles to a real-world example: Qwen's 32B parameter model on the new NVIDIA RTX PRO 6000 Blackwell Edition.
In this guide, you will learn how to:
Calculate your GPU's operational intensity (Ops:Byte Ratio) Determine your model's arithmetic intensity Identify whether your workload is memory-bound or compute-bound
Made a small write up and experimental finetuning guide for MetaCLIP2 for Image Classification on Downstream Tasks. The blog titled Fine Tuning MetaCLIP 2 for Image Classification on Downstream Tasks demonstrates the step by step finetuning using CIFAR10 and is also flexible for adapting to other datasets. For more details, check out the linked blog below. 🤗↗️
Try the all-new trending Qwen-Image-Edit specialized adapter demos, including Photo-to-Anime, Light Restoration, Multi-Angle Edits, Relighting, and more — all in a single Hugging Face Space. Below is the demo link. 🤗🌠
Struggling with NVIDIA drivers on Ubuntu 24.04? Can't use your GPUs with CUDA installed, or only half of them work? Black screen after startup or nvidia-smi fails?
The nokaslr boot option might be the cause—and the solution. Find out why disabling KASLR can fix these GPU issues until a permanent driver update is available.
🚀 AutoXLA - Accelerating Large Models on TPU AutoXLA is an experimental library that automates the distribution, optimization, and quantization of large language models for TPUs using PyTorch/XLA. It extends the Hugging Face Transformers interface with TPU-aware features such as automatic sharding, custom attention kernels, and quantization-aware loading, making large-scale deployment and training both simpler and faster. With quantization and Splash Attention kernels, AutoXLA achieves up to 4× speedups over standard Flash Attention implementations, significantly improving throughput for both inference and training workloads. Whether you’re experimenting with distributed setups (FSDP, 2D, or 3D sharding) or optimizing memory via LanguageModelQuantizer, AutoXLA is built to make scaling LLMs on TPU seamless. ⚠️ Note: This is an experimental repository. Expect rough edges! Please report bugs or unexpected behavior through GitHub issues. 🔗 GitHub Repository: https://github.com/Locutusque/AutoXLA
A few months ago, I built a quick POC in Hugging Face that used a fine-tuned variant of OpenAI's OSS-20B model that I trained to convert the text from pre-reform Russian-language documents into modern Russian orthography.
⚡️ This morning, I launched novoyaz.io.
This is a production app, the frontend for which I built in like two hours with Lovable, that uses that same fine-tuned model for transliteration, but now has a bunch of extra features that make using it even easier (like taking and uploading pictures with your on-device camera for example 😅).
👉 If you're a researcher, or know a researcher, for whom this app will improve their day-to-day workflows, please get in touch with me.
Introducing Photo-Mate-v2, based on FLUX.1-Kontext-dev, for advanced image manipulation tasks. It supports transforming scenes into top-down/bottom-up perspectives, CAM-right/left-view and its reverse, as well as general kontext-specified object removal. Below is the list of demos and adapters.🔥🤗