Thanks for your effort

#1
by exlaw - opened

Thanks for the effort. I should note, though, that mlx_lm is still performing text generation using the standard AR approach. It only seems to work right now because our method remains compatible with AR generation, but we aren't actually leveraging our parallel acceleration yet. We will likely need better compatibility integration in the future to unlock the true advantages.

Yes, I understand (I thought my first attempt was working until I realized it was still AR). The readme is kind of incorrect and still uses mlx-lm because I've been meaning to PR mlx-lm (and I didn't realize they were public, I was still testing them), I've edited them now. I added support for the window decoding here: https://github.com/ZimengXiong/WeDLM-MLX, especially https://github.com/ZimengXiong/WeDLM-MLX/blob/main/wedlm_mlx/wedlm_generate.py. It's not perfect because we're still relying on JIT compilation provided by MLX since the growing cache is hard to add @mx .compile, and from my manual attempts to forcing @mx .compile is slower since we're then handwriting attention instead of using optimized MLX ones. Very much still a WIP and I've been trying to follow the paper.

The second run seems to be much faster because its already done compilation, which I'm guessing is a result of that JIT compilation. Main blocker is still the KV-Caching is hard to port over. Only seeing around 1.7-2x speedup in optimal conditions (e.g. from GSMK8) and after warmup with the prompt, which makes it not useful at all.

Sign up or log in to comment