My notification

nithin12342 's Collections

My notification

updated about 16 hours ago

Upvote

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Paper • 2601.15369 • Published Jan 21 • 21
Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model

Paper • 2601.15892 • Published Jan 22 • 53
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Paper • 2601.16208 • Published Jan 22 • 55
NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems

Paper • 2601.11004 • Published Jan 16 • 30
Behavior Knowledge Merge in Reinforced Agentic Models

Paper • 2601.13572 • Published Jan 20 • 27
microsoft/VibeVoice-ASR

Automatic Speech Recognition • 9B • Updated Jan 27 • 605k • 919
zai-org/GLM-4.7-Flash

Text Generation • 31B • Updated Jan 29 • 1.65M • • 1.62k
LongCat-Flash-Thinking-2601 Technical Report

Paper • 2601.16725 • Published Jan 23 • 178
iFSQ: Improving FSQ for Image Generation with 1 Line of Code

Paper • 2601.17124 • Published Jan 23 • 33
Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs

Paper • 2601.17058 • Published Jan 22 • 190
Less is More: Optimizing Function Calling for LLM Execution on Edge Devices

Paper • 2411.15399 • Published Nov 23, 2024 • 1
nvidia/personaplex-7b-v1

Audio-to-Audio • Updated 20 days ago • 262k • 2.31k
Qwen/Qwen3-ASR-0.6B

Automatic Speech Recognition • 0.9B • Updated Jan 30 • 432k • 248
Qwen3-ASR Technical Report

Paper • 2601.21337 • Published Jan 29 • 36
Post-LayerNorm Is Back: Stable, ExpressivE, and Deep

Paper • 2601.19895 • Published Jan 27 • 25
DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Paper • 2601.22153 • Published Jan 29 • 74
Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Paper • 2601.20354 • Published Jan 28 • 112
Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

Paper • 2601.21406 • Published Jan 29 • 5
Revisiting Parameter Server in LLM Post-Training

Paper • 2601.19362 • Published Jan 27 • 8
ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation

Paper • 2601.21420 • Published Jan 29 • 42
SERA: Soft-Verified Efficient Repository Agents

Paper • 2601.20789 • Published Jan 28 • 13
moonshotai/Kimi-K2.5

Image-Text-to-Text • 1.1T • Updated 23 days ago • 3.57M • • 2.32k
DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation

Paper • 2601.22904 • Published Jan 30 • 15
Phr00t/LTX2-Rapid-Merges

Image-Text-to-Video • Updated Feb 12 • 337
ReGuLaR: Variational Latent Reasoning Guided by Rendered Chain-of-Thought

Paper • 2601.23184 • Published Jan 30 • 36
FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

Paper • 2602.02092 • Published Feb 2 • 18
PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

Paper • 2602.02493 • Published Feb 2 • 46
TTCS: Test-Time Curriculum Synthesis for Self-Evolving

Paper • 2601.22628 • Published Jan 30 • 35
RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

Paper • 2602.02488 • Published Feb 2 • 35
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Paper • 2602.02185 • Published Feb 2 • 117
Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization

Paper • 2601.21358 • Published Jan 29 • 7
Balancing Understanding and Generation in Discrete Diffusion Models

Paper • 2602.01362 • Published Feb 1 • 17
3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

Paper • 2602.03796 • Published Feb 3 • 64
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Paper • 2602.01785 • Published Feb 2 • 96
LIVE: Long-horizon Interactive Video World Modeling

Paper • 2602.03747 • Published Feb 3 • 12
Qwen/Qwen3-Coder-Next

Text Generation • 80B • Updated Feb 3 • 1.21M • • 1.15k
Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers

Paper • 2602.03510 • Published Feb 3 • 27
RISE-Video: Can Video Generators Decode Implicit World Rules?

Paper • 2602.05986 • Published Feb 5 • 26
FASA: Frequency-aware Sparse Attention

Paper • 2602.03152 • Published Feb 3 • 152
DFlash: Block Diffusion for Flash Speculative Decoding

Paper • 2602.06036 • Published Feb 5 • 44
GEBench: Benchmarking Image Generation Models as GUI Environments

Paper • 2602.09007 • Published Feb 9 • 39
When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

Paper • 2602.08236 • Published Feb 9 • 9
AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research

Paper • 2602.06540 • Published Feb 6 • 21
Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

Paper • 2602.04649 • Published Feb 4 • 12
OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Paper • 2602.05400 • Published Feb 5 • 349
AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

Paper • 2602.05027 • Published Feb 4 • 61
Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

Paper • 2602.06291 • Published Feb 6 • 23
Towards Autonomous Mathematics Research

Paper • 2602.10177 • Published Feb 10 • 36
Free(): Learning to Forget in Malloc-Only Reasoning Models

Paper • 2602.08030 • Published Feb 8 • 6
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

Paper • 2602.07775 • Published Feb 8 • 8
TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Paper • 2602.08711 • Published Feb 9 • 28
Qute: Towards Quantum-Native Database

Paper • 2602.14699 • Published Feb 16 • 13
Qwen/Qwen3.5-397B-A17B

Image-Text-to-Text • 403B • Updated 7 days ago • 1.8M • • 1.37k
DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

Paper • 2602.16742 • Published Feb 18 • 12
tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction

Paper • 2602.20160 • Published 27 days ago • 10
From Perception to Action: An Interactive Benchmark for Vision Reasoning

Paper • 2602.21015 • Published 26 days ago • 23
SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model

Paper • 2602.21818 • Published 25 days ago • 56
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Paper • 2602.24286 • Published 23 days ago • 97
From Scale to Speed: Adaptive Test-Time Scaling for Image Editing

Paper • 2603.00141 • Published 26 days ago • 138
RubricBench: Aligning Model-Generated Rubrics with Human Standards

Paper • 2603.01562 • Published 20 days ago • 60
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

Paper • 2602.23866 • Published 23 days ago • 88
Qwen3-Coder-Next Technical Report

Paper • 2603.00729 • Published 22 days ago • 58
CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

Paper • 2603.04291 • Published 18 days ago • 13
Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling

Paper • 2603.04791 • Published 17 days ago • 16
InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

Paper • 2603.03646 • Published 18 days ago • 8
Utonia: Toward One Encoder for All Point Clouds

Paper • 2603.03283 • Published 19 days ago • 183
DreamWorld: Unified World Modeling in Video Generation

Paper • 2603.00466 • Published 22 days ago • 16
On-Policy Self-Distillation for Reasoning Compression

Paper • 2603.05433 • Published 17 days ago • 6
fal/virtual-tryoff-lora

Image-to-Image • Updated 16 days ago • 1.01k • 29
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Paper • 2603.06569 • Published 16 days ago • 114
PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

Paper • 2603.05888 • Published 16 days ago • 2
Scale Space Diffusion

Paper • 2603.08709 • Published 13 days ago • 15
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

Paper • 2603.09877 • Published 12 days ago • 47
MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

Paper • 2603.09652 • Published 12 days ago • 15
Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Paper • 2603.12255 • Published 10 days ago • 90
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

Paper • 2603.12245 • Published 10 days ago • 18
How Far Can Unsupervised RLVR Scale LLM Training?

Paper • 2603.08660 • Published 13 days ago • 56
WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing

Paper • 2603.11593 • Published 10 days ago • 25
Video-Based Reward Modeling for Computer-Use Agents

Paper • 2603.10178 • Published 12 days ago • 42
CodePercept: Code-Grounded Visual STEM Perception for MLLMs

Paper • 2603.10757 • Published 11 days ago • 13
Mixture-of-Depths Attention

Paper • 2603.15619 • Published 6 days ago • 73
Multimodal OCR: Parse Anything from Documents

Paper • 2603.13032 • Published 9 days ago • 31
Towards a Neural Debugger for Python

Paper • 2603.09951 • Published 12 days ago • 5
Demystifing Video Reasoning

Paper • 2603.16870 • Published 5 days ago • 346
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Paper • 2603.12180 • Published 10 days ago • 63
Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Paper • 2603.13398 • Published 11 days ago • 141
WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

Paper • 2603.15132 • Published 6 days ago • 33
Attention Residuals

Paper • 2603.15031 • Published 6 days ago • 134
Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

Paper • 2603.19232 • Published 3 days ago • 29
FASTER: Rethinking Real-Time Flow VLAs

Paper • 2603.19199 • Published 3 days ago • 47
SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Paper • 2603.16859 • Published 5 days ago • 240
AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Paper • 2603.14465 • Published 7 days ago • 22
LoST: Level of Semantics Tokenization for 3D Shapes

Paper • 2603.17995 • Published 4 days ago • 22
SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

Paper • 2603.16864 • Published 5 days ago • 15
Running

32

Nemotron 3 Nano WebGPU

⚛

32

A compact reasoning-capable model running in your browser.
3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

Paper • 2603.18524 • Published 3 days ago • 48
Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context

Paper • 2603.15653 • Published 15 days ago • 9
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Paper • 2603.18004 • Published 4 days ago • 12
SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Paper • 2603.19228 • Published 3 days ago • 60
Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Paper • 2603.18002 • Published 4 days ago • 6
Matryoshka Gaussian Splatting

Paper • 2603.19234 • Published 3 days ago • 6
Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Paper • 2603.16139 • Published 5 days ago • 30
Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

Paper • 2603.19227 • Published 3 days ago • 36

Upvote

Collection guide
Browse collections

Nemotron 3 Nano WebGPU