Llama-3.1-Sherkala-8B-Chat-Quantized-8-Bits

This is an 8-bit quantized version of inceptionai/Llama-3.1-Sherkala-8B-Chat, optimized for efficient inference with reduced memory requirements while maintaining performance quality.

Model Details

Model Description

  • Developed by: InceptionAI (Quantized by FilledVaccum)
  • Base Model: inceptionai/Llama-3.1-Sherkala-8B-Chat
  • Model type: Causal Language Model
  • Architecture: LlamaForCausalLM
  • Language(s): English
  • License: Llama 3.1 License
  • Quantization: 8-bit using BitsAndBytes

Key Features

  • Memory Efficient: ~50% reduction in memory usage compared to the original FP16 model
  • Maintained Performance: Optimized quantization preserving model quality
  • Fast Inference: Accelerated generation with reduced computational overhead
  • Chat Optimized: Fine-tuned for conversational AI applications

Model Architecture

  • Parameters: 8 billion parameters (quantized to 8-bit)
  • Hidden Size: 4,096
  • Attention Heads: 32
  • Key-Value Heads: 8 (Grouped Query Attention)
  • Layers: 32
  • Context Length: 8,192 tokens (extended with RoPE scaling)
  • Vocabulary Size: 159,766 tokens

Quantization Details

This model uses BitsAndBytes 8-bit quantization with the following configuration:

  • Quantization Method: BitsAndBytes
  • Precision: 8-bit integers
  • Threshold: 6.0 for outlier detection
  • Compute Dtype: Maintains FP16 for computations where needed

Usage

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "your-username/Llama-3.1-Sherkala-8B-Chat-Quantized-8-Bits"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

# Chat template
def format_chat(messages):
    return tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )

# Example conversation
messages = [
    {"role": "user", "content": "Hello! Can you help me with a coding problem?"}
]

# Generate response
input_text = format_chat(messages)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Memory Requirements

Precision Memory Usage Relative Size
FP16 (Original) ~16 GB 100%
8-bit (This model) ~8 GB 50%

Performance Considerations

  • Inference Speed: Comparable to FP16 with modern GPUs
  • Quality: Minimal degradation from quantization
  • Hardware: Optimized for NVIDIA GPUs with Tensor Cores
  • Batch Size: Supports larger batch sizes due to reduced memory footprint

Training Details

This model inherits the training methodology from the base Llama-3.1-Sherkala-8B-Chat model. The quantization process preserves the original model's capabilities while optimizing for deployment efficiency.

Evaluation

The quantized model maintains performance characteristics similar to the base model across various benchmarks, with typical 8-bit quantization showing minimal quality loss (typically <2% degradation on most tasks).

Intended Uses

Primary Use Cases

  • Conversational AI: Chat applications and virtual assistants
  • Code Generation: Programming assistance and code completion
  • Content Creation: Writing assistance and creative text generation
  • Educational Tools: Tutoring and explanation systems
  • Research: Academic and commercial NLP research

Out-of-Scope Uses

  • High-stakes decision making without human oversight
  • Generating harmful, biased, or misleading content
  • Medical, legal, or financial advice without professional validation
  • Real-time safety-critical applications

Limitations and Biases

  • Inherits limitations from the base Llama 3.1 model
  • Quantization may introduce minor numerical precision effects
  • Performance may vary across different hardware configurations
  • May exhibit biases present in training data

Technical Specifications

System Requirements

  • Minimum GPU Memory: 8 GB VRAM
  • Recommended GPU Memory: 12+ GB VRAM
  • CUDA Compatibility: CUDA 11.0+
  • Python Version: 3.8+
  • Dependencies: transformers>=4.57.0, torch>=2.0.0, bitsandbytes>=0.41.0

Supported Frameworks

  • Transformers (Hugging Face)
  • PyTorch
  • ONNX (with conversion)
  • TensorRT (with optimization)

Citation

If you use this model, please cite both the original Sherkala model and acknowledge the quantization:

@misc{sherkala-quantized-8bit,
  title={Llama-3.1-Sherkala-8B-Chat-Quantized-8-Bits},
  author={InceptionAI},
  year={2024},
  note={8-bit quantized version of Llama-3.1-Sherkala-8B-Chat}
}

License

This model is released under the same license as the base model. Please refer to the Llama 3.1 License for detailed terms and conditions.

Contact

For questions about this quantized version or technical support, please open an issue in the model repository or contact the development team.


This model card was generated on October 4, 2025. For the most up-to-date information, please refer to the original base model documentation.

Downloads last month
4
Safetensors
Model size
8B params
Tensor type
F32
F16
I8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for FilledVaccum/Llama-3.1-Sherkala-8B-Chat-Quantized-8-Bits

Unable to build the model tree, the base model loops to the model itself. Learn more.