-
DataDecide: How to Predict Best Pretraining Data with Small Experiments
Paper • 2504.11393 • Published • 18 -
Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources
Paper • 2504.04152 • Published • 1 -
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
Paper • 2508.10975 • Published • 60 -
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset
Paper • 2412.02595 • Published • 5
Collections
Discover the best community collections!
Collections including paper arxiv:2508.10975
-
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models
Paper • 2410.09732 • Published • 54 -
How to Synthesize Text Data without Model Collapse?
Paper • 2412.14689 • Published • 52 -
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
Paper • 2501.18511 • Published • 20 -
Synthetic Data RL: Task Definition Is All You Need
Paper • 2505.17063 • Published • 10
-
Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated
Paper • 2509.05739 • Published • 2 -
Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers
Paper • 2509.03059 • Published • 24 -
Universal Deep Research: Bring Your Own Model and Strategy
Paper • 2509.00244 • Published • 13 -
<think> So let's replace this phrase with insult... </think> Lessons learned from generation of toxic texts with LLMs
Paper • 2509.08358 • Published • 13
-
Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs
Paper • 2506.19290 • Published • 52 -
Data Efficacy for Language Model Training
Paper • 2506.21545 • Published • 11 -
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents
Paper • 2507.04009 • Published • 51 -
RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs
Paper • 2507.03253 • Published • 18
-
FLAME: Factuality-Aware Alignment for Large Language Models
Paper • 2405.01525 • Published • 28 -
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
Paper • 2405.14333 • Published • 43 -
Transformers Can Do Arithmetic with the Right Embeddings
Paper • 2405.17399 • Published • 54 -
EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture
Paper • 2405.18991 • Published • 12
-
DataDecide: How to Predict Best Pretraining Data with Small Experiments
Paper • 2504.11393 • Published • 18 -
Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources
Paper • 2504.04152 • Published • 1 -
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
Paper • 2508.10975 • Published • 60 -
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset
Paper • 2412.02595 • Published • 5
-
Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated
Paper • 2509.05739 • Published • 2 -
Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers
Paper • 2509.03059 • Published • 24 -
Universal Deep Research: Bring Your Own Model and Strategy
Paper • 2509.00244 • Published • 13 -
<think> So let's replace this phrase with insult... </think> Lessons learned from generation of toxic texts with LLMs
Paper • 2509.08358 • Published • 13
-
Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs
Paper • 2506.19290 • Published • 52 -
Data Efficacy for Language Model Training
Paper • 2506.21545 • Published • 11 -
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents
Paper • 2507.04009 • Published • 51 -
RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs
Paper • 2507.03253 • Published • 18
-
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models
Paper • 2410.09732 • Published • 54 -
How to Synthesize Text Data without Model Collapse?
Paper • 2412.14689 • Published • 52 -
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
Paper • 2501.18511 • Published • 20 -
Synthetic Data RL: Task Definition Is All You Need
Paper • 2505.17063 • Published • 10
-
FLAME: Factuality-Aware Alignment for Large Language Models
Paper • 2405.01525 • Published • 28 -
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
Paper • 2405.14333 • Published • 43 -
Transformers Can Do Arithmetic with the Right Embeddings
Paper • 2405.17399 • Published • 54 -
EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture
Paper • 2405.18991 • Published • 12