Shisa V1 7B V2.1

This release is a bit of a meme model to celebrate the 2-year anniversary of the release of Shisa 7B V1, but I was genuinely curious to see how much the original Mistral 7B v0.1-based model could be improved with our latest V2.1 training. How much of our improvements are due to better post-training vs better base models?

Beyond that curiousity, there is also some practical utility, as our shisa-v1 tokenizer remains one of the most efficient tokenizers for Japanese text. (We've since abandoned tokenizer extension as the amount of continued-pre training required to recover performance and crucially, to resolve token leakage, are not a good trade-off for us.)

This model has about 1B tokens of additional mixed EN/JA tokens as a mid-train using a machine-translated version of the Tulu 3 SFT mixture and a whole bunch of other openly-licensed corpuses (mainly as an attempt to further improve the tokenizer extension) and then our new Shisa V2.1 post-training applied on top.

The end result is a significant improvement over the original model and shows that our improved data and training techniques can make older models better, but it still can't match the performance of applying the same data and techniques on top of modern base models.

Below is the MultiEval V2.1 scoring (a custom test battery we use to judge our Shisa V2.1 model quality). While it can't be directly compared to other published results (especially the GPT-5.1 rated LLM Judge results), it should be representative of the general JA/EN performance you should expect:

Eval Shisa V1 7B V2.1 Shisa 7B V1 Shisa Gamma 7B V1 Shisa V2 8B Shisa V2.1 8B
JA AVG 41.5 26.2 38.0 58.7 67.8
EN AVG 28.5 29.4 21.1 55.1 57.8
Shaberi v2.1 5.223 3.743 5.511 6.427 7.353
ELYZA 100 5.590 4.050 5.723 7.300 7.660
JA MT-Bench 4.892 3.326 4.989 5.975 7.783
Rakuda 6.075 3.123 5.947 6.463 7.150
Tengu 4.335 4.473 5.386 5.970 6.817
M-IFEval (JA) 0.343 0.151 0.256 0.477 0.471
shisa-jp-ifeval 0.133 0.093 0.107 0.293 0.347
shisa-rp-bench 2.159 1.547 3.225 4.739 4.792
shisa-tl-bench 4.825 0.664 0.575 7.617 8.917
kiseki-eval 2.359 2.439 2.447 3.348 3.580
chotto-eval 0.200 0.018 0.036 0.145 0.455
MixEval Easy 0.521 0.639 0.503 0.821 0.802
MixEval Hard 0.340 0.361 0.217 0.555 0.607
LiveBench 10.8 14.2 11.8 32.1 45.7
GPQA Diamond 0.086 0.121 0.086 0.379 0.328
IFEval 0.335 0.340 0.274 0.828 0.791
IFBench 0.167 0.170 0.143 0.330 0.259
HumanEval+ 0.439 0.287 0.134 0.622 0.805

MultiEval 2.1

Our primary MultiEval V2.1 suite is a mixed battery of 10 Japanese and 7 English/general evaluations designed to give a broad picture of overall model performance across a variety of common general language tasks.

Japanese

  • Shaberi v2.1 - Our public fork of LightBlue’s Shaberi suite, extended for reasoning models, updated judges, output viewing, and errata fixes; despite some known issues, it remains our primary functional benchmark for quickly evaluating general Japanese LLM performance. All Shaberi scores in V2.1 are judged by GPT‑5.1 (gpt-5.1-2025-11-13).
    • ELYZA Tasks 100 - A set of 100 complex Japanese instructions and tasks graded on a 5‑point rubric, targeting realistic instruction-following and generation quality.
    • Japanese MT-Bench - A high quality Japanese adaptation of MT-Bench with eight categories of conversational and writing outputs, evaluated by an LLM judge on a 1–10 scale to capture stylistic and qualitative differences.
    • Rakuda - An adaptation of Rakuda, a Japanese-language trivia QA benchmark covering Japanese-focused factual knowledge and recall; scored via exact-match style metrics.
    • Tengu - A heterogeneous grab-bag of Japanese tasks (reasoning, QA, and writing) that is a useful secondary stress test for general capability.
  • M‑IFEval (JA slice) - Our public fork of LightBlue’s multilingual IFEval which fixes a number of errata; in the main Japanese composite we currently use only the Japanese subset, exposing it as M‑IFja, with rule‑based instruction‑compliance scoring. We report the loose score.
  • shisa-jp-ifeval - Shisa.AI's own Japanese-specific IFEval variant, carefully replacing English‑centric constraints (spelling, capitalization, etc.) with verifiable Japanese constraints (mora counting, script choice, honorifics, etc.) and rule‑based scoring.
  • shisa-jp-rp-bench - A Japanese roleplay/persona benchmark based on Aratako’s Japanese-RP-Bench, using multi-turn conversations and pairwise LLM judging (Gemini 2.0 Flash) with a Bradley-Terry model to produce stable RP rankings.
  • shisa-jp-tl-bench – English↔Japanese translation shootout: the target model’s translations are compared pairwise against a frozen base set and judged by a dedicated LLM judge (Gemini 2.5 Flash), then aggregated with a Bradley-Terry logistic model into a 0–10‑style score.
  • kiseki-eval - A private Shisa.AI translation eval focused on subtle aspects of Japanese such as tone, gendering, and terms of endearment; translations are judged by Gemini 2.5 Pro using an Ultrafeedback‑style 1–5 rubric.
  • chotto-eval - Another Shisa.AI internal eval, cross‑lingual multi‑turn interaction set mimicking real conversational flows; we do pairwise LLM‑vs‑LLM comparison against a fixed chotto.chat strong internal baseline model (which has a roughly 50% win/loss against Claude Opus 4.1 and Gemini 2.5 Flash) using Gemini 2.5 Pro as evaluator.

English/General

  • MixEval Easy - A fast, mixed English reasoning benchmark (mixeval_easy task in Lighteval/Inspect) combining free‑form and multiple‑choice questions with 0.96 correlation with 2024 Chatbot Arena rankings; scored both by the task’s exact metrics and by LLM judges (Flow‑Judge flowaicom/Flow-Judge-v0.1 and a GPT judge, default gpt-4.1-mini-2025-04-14) via the HF Lighteval runner.
  • MixEval Hard - A harder subset of MixEval (mixeval_hard) designed to better separate strong models, run through the same Lighteval/Inspect pipeline and Flow‑Judge + GPT‑judge scoring as MixEval Easy.
  • LiveBench - Our public fork of LiveBench, a contamination‑aware, continually updated English benchmark covering coding, math, reasoning, language, data analysis, and instruction following, judged with LiveBench's own rule-based and ground-truth scoring (no LLM judge). We use the latest public dataset LiveBench-2024-11-25. Our fork supports concurrent runs, GPT-5.1 reasoning semantics, and other fixes.
  • GPQA Diamond - PhD‑level multiple‑choice science QA from the Diamond split of GPQA (Lighteval gpqa:diamond task using Idavidrein/gpqa); we score with Inspect’s multiple‑choice choice metric and an additional robustness pass that recovers bare letter answers, so this remains a pure reference‑based metric (no LLM judge).
  • Google IFEval (EN) – An English-language instruction‑following benchmark from Google Research (ifeval task in Lighteval/Inspect over google/IFEval), scored with the original rule‑based check_following functions; we report the loose prompt‑level accuracy, with no LLM judge involved.
  • IFBench – Our public fork of AI2's IFBench IFEval-inspired (but less saturated) instruction‑following suite, loose prompt-level accuracies using IFBench's own verification functions (no LLM judge). We fix some evaluation bugs and add a response generation script for parallel execution against OpenAI-compatible endpoints.
  • HumanEval+ – Using our public fork of EvalPlus, which adds direct OpenAI/Gemini support as well as parallel generation support, we run HumanEval+ and report the plus-pass@1 score (reference/test-based judgement).

Acknowledgements

This model was trained on an 8xMI300X node on the AMD Developer Cloud, supported by AMD.

A big shoutout to Shisa V1 co-trainer Jon Durbin (this model card is being written in the 3rd person by the other Shisa V1 creator, Leonard Lin. 😂)

Downloads last month
15
Safetensors
Model size
266k params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shisa-ai/shisa-v1-7b-v2.1

Finetuned
(1)
this model
Quantizations
2 models

Dataset used to train shisa-ai/shisa-v1-7b-v2.1