djuna (Djuunaa)

liked a model about 5 hours ago

EssentialAI/rnj-1-instruct

8B • Updated 1 day ago • 3.15k • 73

liked a model 4 days ago

Skywork/R1V4

Image-Text-to-Text • Updated 4 days ago • 10

New activity in ChenkinNoob/ChenkinNoob-XL-V0.1 9 days ago

Diffusers library format?

#4 opened 9 days ago by

djuna

liked a Space 9 days ago

Z Image Turbo

🏃

1.15k

Generate images from text prompts

liked a model 9 days ago

YanLabs/gemma3-27b-it-abliterated-normpreserve

Text Generation • 27B • Updated 4 days ago • 122 • 16

reacted to grimjim's post with 🤗 12 days ago

Post

2829

I wanted to call attention to Arli Ai's success in applying my recent modifications to refusal ablation to a MoE model successfully. Nice work, @OwenArli !
ArliAI/GLM-4.5-Air-Derestricted
Ablation on a MoE model is no small thing; I expect preserving norms/magnitudes during intervention better respects routing compared to naive refusal ablation.

(I would have tagged their org earlier, but that feature seemed to be broken via "@")

ArliAI

4 replies

·

upvoted an article 17 days ago

Article

DeLERP: Decomposed Linear Interpolation for Model Merging

17 days ago

•

7

reacted to onekq's post with 👍 17 days ago

Post

2615

GLM 4.6 is on a par with Gemini 2

onekq-ai/WebApp1K-models-leaderboard

1 reply

·

reacted to grimjim's post with 🔥 18 days ago

Post

5011

Implemented a proof of concept sampler in pure PyTorch and transformers.

Max P consists of a dynamic token filter which applies Winsorization to cap the probabilties of top tokens. Specifically, a base probability in the range of [0,1] is used to cap individual token probability; the sampler then redistributes excess proportionally.

https://github.com/jim-plus/maxp-sampler-poc

Combined with Temperature and Min P, this could represent a more intuitive way of reducing repetition in text generation.

2 replies

·

upvoted a paper 22 days ago

Orthogonal Finetuning Made Scalable

Paper • 2506.19847 • Published Jun 24 • 11

liked a model 24 days ago

zz1358m/SofT-GRPO-master

Updated 24 days ago • 6

reacted to Locutusque's post with 🔥 26 days ago

Post

2640

🚀 AutoXLA - Accelerating Large Models on TPU
AutoXLA is an experimental library that automates the distribution, optimization, and quantization of large language models for TPUs using PyTorch/XLA. It extends the Hugging Face Transformers interface with TPU-aware features such as automatic sharding, custom attention kernels, and quantization-aware loading, making large-scale deployment and training both simpler and faster.
With quantization and Splash Attention kernels, AutoXLA achieves up to 4× speedups over standard Flash Attention implementations, significantly improving throughput for both inference and training workloads.
Whether you’re experimenting with distributed setups (FSDP, 2D, or 3D sharding) or optimizing memory via LanguageModelQuantizer, AutoXLA is built to make scaling LLMs on TPU seamless.
⚠️ Note: This is an experimental repository. Expect rough edges! Please report bugs or unexpected behavior through GitHub issues.
🔗 GitHub Repository: https://github.com/Locutusque/AutoXLA

reacted to onekq's post with 🧠 26 days ago

Post

2438

Instead of architectural upgade, each major model drop nowadays perfects a regional innovation. What Kimi brought to spot light this time is quantization aware training (QAT). I wrote an article to explain it and why it matters to reasoning models.

https://huggingface.co/blog/onekq/qat-bonsai

If you are interested in this kind of posts, I will introduce the Muon optimizers, another technology behind Kimi success.

liked a model 26 days ago

WeiboAI/VibeThinker-1.5B

Text Generation • 2B • Updated 13 days ago • 27.5k • 497

liked 2 models about 1 month ago

internlm/JanusCoder-14B

Text Generation • 15B • Updated 28 days ago • 256 • 32

KORMo-Team/KORMo-10B-sft

Text Generation • 11B • Updated Nov 4 • 2.32k • 116

reacted to onekq's post with 👍 about 1 month ago

Post

3663

Context rot is such a catchy phrase, but the problem has been identified 2+ years ago, called attention decay.
Lost in the Middle: How Language Models Use Long Contexts (2307.03172)

I spotted the same problem in coding tasks, and documented in my book (https://www.amazon.com/dp/9999331130).

Why did this problem become hot again? This is because many of us thought the problem has been solved by long context models, which is not true.

Here we were misled by benchmarks. Most long-context benchmarks build around the QA scenario, i.e. "finding needle in haystack". But in agentic scenarios, the model needs to find EVERYTHING in the haystack, and just can't afford enough attention for this challenge.

6 replies

·

New activity in MiniMaxAI/MiniMax-M2 about 1 month ago

No lightning attention?

3

#8 opened about 1 month ago by

djuna

liked a model about 1 month ago

MiniMaxAI/MiniMax-M2

Text Generation • 229B • Updated 24 days ago • 198k • • 1.37k

upvoted an article about 1 month ago

Article

Projected Abliteration

Oct 25

•

30

Djuunaa

AI & ML interests

Recent Activity

Organizations

EssentialAI/rnj-1-instruct

Skywork/R1V4

Diffusers library format?

Z Image Turbo

YanLabs/gemma3-27b-it-abliterated-normpreserve

DeLERP: Decomposed Linear Interpolation for Model Merging

Orthogonal Finetuning Made Scalable

zz1358m/SofT-GRPO-master

WeiboAI/VibeThinker-1.5B

internlm/JanusCoder-14B

KORMo-Team/KORMo-10B-sft

No lightning attention?

MiniMaxAI/MiniMax-M2

Projected Abliteration

Djuunaa

AI & ML interests

Recent Activity

Organizations

djuna's activity

Diffusers library format?

Z Image Turbo

DeLERP: Decomposed Linear Interpolation for Model Merging

No lightning attention?

Projected Abliteration