You wanted assessment, here it is :D

#4
by SerialKicked - opened

Okay, so it's not beta version model anymore. I haven't tested the CoT variant yet. So I can't compare. Will do if I get to that one.

This one feels like proper Cydonia again. Ngl, your previous 24B Cy/Magi-donia versions were really hit and miss this year. This is a lot more solid. It's still using L7 instruct so I don't have a group chat metric for it, but it passed my usual testing set:

Formalized tests:

  • Menu driven navigation (it's basically a poor man's ad hoc function calling, meant to gauge logic and guideline following without enforcing grammar-based responses). Passed, contrary to previous versions.
  • Chat session summary (it's summarizing a chatlog in 2-3 paragraphs, used for long term memory). Passed, like most models. More compliant with rules like "use X paragraphs" or "use 3rd person" than most.
  • Analyse a chat session with the user, determine goals for next sessions (find topics to talk about for next sessions, basically). Passed and accurate.
  • Web search and result compilation (read chatlog, find topic model ain't too sure about, write coherent google queries, and compile search result into a coherent whole). Passed with flying colors, contrary to past models.
  • Integrate content of system messages into the chat without being too obvious about it. Decent at it, not great, but didn't try to hijack unrelated conversation on "faulty" inserts.
  • Decent level answers at basic Q&A and instruction following tasks (but that's Mistral for you in general)

"Feels" testing:

  • Very decent understanding of relatively complex situations
  • Sufficiently uncensored in all the tested scenarios (might refuse something in "out of the blue, first chat message pair", but will comply in long form chat)
  • Pretty good at impersonating different characters (didn't test as many as I wanted to, been busy)
  • Really good at picking up relevant information from a whole long ass prompt (1K sys prompt + 3K recall + 20K message pair) to build a relevant response. It ain't as a good as a CoT model for that part, but it beats (most) normal models of that size in that area.
  • Usual Mistral formatting copy-pasta / needs to switch sampling method every few messages to keep things not to repetitive format-wise. But I don't expect you to fix that typical Mistral behavior.
  • A bit too eager to use lists where it's not necessary. Not a big deal, and not to a qwen level either, but notable.

I didn't have the time to test it in more long-form scenarios yet, but it's definitely an upgrade so far, imho. I can't really comment on "slop" writing, I know it's a big focus for you guys, but it's really the least of my worries personally. Structure repetition is a lot more annoying to me personally (and you can't eradicate slop, it'll get replaced by another repetition. It serves a linguistic purpose to the model, but that's whole different topic that finetuners refuse to acknowledge anyway).

Overall it's a good middle ground between Pinecone (brain), and PaintedFantasy (RP), with a different and more direct writing style.

Anyway, good job! Happy holidays, new year, xmas, and all that!

Cheers.

Edit: For reference. Tested at Q6_K quant with 20 to 24K context length. Various sampling methods: Formalized is mostly 0.85 temp + 0.05 minP + DRY as a general base, and secondary tests at deterministic 0 everywhere to gauge baseline. "Feels" part used a lot of different common place sampling methods for Mistral models, nothing crazy, no XTC ever. On KoboldCpp (text-completion mode) as a backend and my own (still private) front-end for the tests.

Sign up or log in to comment