tensoropera
/

Fox-1-1.6B

@@ -1,18 +1,25 @@
 ---
 license: apache-2.0
 language:
-- en
 pipeline_tag: text-generation
 ---
 ## Model Card for Fox-1-1.6B
 > [!IMPORTANT]
-> This model is a base pretrained model which requires further finetuning for most use cases. We will release the instruction-tuned version soon.
-Fox-1 is a decoder-only transformer-based small language model (SLM) with 1.6B total parameters developed by [TensorOpera AI](https://tensoropera.ai/). The model was trained with a 3-stage data curriculum on 3 trillion tokens of text and code data in 8K sequence length. Fox-1 uses grouped query attention (GQA) with 4 KV heads and 16 attention heads and has a deeper architecture than other SLMs.
-For the full details of this model please read our [release blog post](https://blog.tensoropera.ai/tensoropera-unveils-fox-foundation-model-a-pioneering-open-source-slm-leading-the-way-against-tech-giants).
 ## Benchmarks
@@ -28,4 +35,4 @@ score of the 6 benchmarks. The model was evaluated on a machine with 8*H100 GPUs
 | HellaSwag     | 62.82%     | 61.55%        | 71.60%   | 70.46%          | 65.23%       |
 | TruthfulQA    | 38.66%     | 39.37%        | 33.05%   | 38.77%          | 36.98%       |
 | Winogrande    | 60.62%     | 65.51%        | 65.51%   | 65.27%          | 61.64%       |
-| Average       | 47.13%     | 46.81%        | 46.36%   | 45.92%          | 38.28%       |

 ---
 license: apache-2.0
 language:
+  - en
 pipeline_tag: text-generation
 ---
 ## Model Card for Fox-1-1.6B
 > [!IMPORTANT]
+> This model is a base pretrained model which requires further finetuning for most use cases.
+> For a more interactive experience, we
+> recommend [tensoropera/Fox-1-1.6B-Instruct-v0.1](https://huggingface.co/tensoropera/Fox-1-1.6B-Instruct-v0.1), the
+> instruction-tuned version of Fox-1.
+Fox-1 is a decoder-only transformer-based small language model (SLM) with 1.6B total parameters developed
+by [TensorOpera AI](https://tensoropera.ai/). The model was trained with a 3-stage data curriculum on 3 trillion
+tokens of text and code data in 8K sequence length. Fox-1 uses Grouped Query Attention (GQA) with 4 key-value heads and
+16 attention heads for faster inference.
+For the full details of this model please read
+our [release blog post](https://blog.tensoropera.ai/tensoropera-unveils-fox-foundation-model-a-pioneering-open-source-slm-leading-the-way-against-tech-giants).
 ## Benchmarks
 | HellaSwag     | 62.82%     | 61.55%        | 71.60%   | 70.46%          | 65.23%       |
 | TruthfulQA    | 38.66%     | 39.37%        | 33.05%   | 38.77%          | 36.98%       |
 | Winogrande    | 60.62%     | 65.51%        | 65.51%   | 65.27%          | 61.64%       |
+| Average       | 47.13%     | 46.81%        | 46.36%   | 45.92%          | 38.28%       |