Llama-3.1-Sherkala-8B-Chat-Quantized-8-Bits

This is an 8-bit quantized version of inceptionai/Llama-3.1-Sherkala-8B-Chat, optimized for efficient inference with reduced memory requirements while maintaining performance quality.

Model Details

Model Description

Developed by: InceptionAI (Quantized by FilledVaccum)
Base Model: inceptionai/Llama-3.1-Sherkala-8B-Chat
Model type: Causal Language Model
Architecture: LlamaForCausalLM
Language(s): English
License: Llama 3.1 License
Quantization: 8-bit using BitsAndBytes

Key Features

Memory Efficient: ~50% reduction in memory usage compared to the original FP16 model
Maintained Performance: Optimized quantization preserving model quality
Fast Inference: Accelerated generation with reduced computational overhead
Chat Optimized: Fine-tuned for conversational AI applications

Model Architecture

Parameters: 8 billion parameters (quantized to 8-bit)
Hidden Size: 4,096
Attention Heads: 32
Key-Value Heads: 8 (Grouped Query Attention)
Layers: 32
Context Length: 8,192 tokens (extended with RoPE scaling)
Vocabulary Size: 159,766 tokens

Quantization Details

This model uses BitsAndBytes 8-bit quantization with the following configuration:

Quantization Method: BitsAndBytes
Precision: 8-bit integers
Threshold: 6.0 for outlier detection
Compute Dtype: Maintains FP16 for computations where needed

Usage

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "your-username/Llama-3.1-Sherkala-8B-Chat-Quantized-8-Bits"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

# Chat template
def format_chat(messages):
    return tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )

# Example conversation
messages = [
    {"role": "user", "content": "Hello! Can you help me with a coding problem?"}
]

# Generate response
input_text = format_chat(messages)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Memory Requirements

Precision	Memory Usage	Relative Size
FP16 (Original)	~16 GB	100%
8-bit (This model)	~8 GB	50%

Performance Considerations

Inference Speed: Comparable to FP16 with modern GPUs
Quality: Minimal degradation from quantization
Hardware: Optimized for NVIDIA GPUs with Tensor Cores
Batch Size: Supports larger batch sizes due to reduced memory footprint

Training Details

This model inherits the training methodology from the base Llama-3.1-Sherkala-8B-Chat model. The quantization process preserves the original model's capabilities while optimizing for deployment efficiency.

Evaluation

The quantized model maintains performance characteristics similar to the base model across various benchmarks, with typical 8-bit quantization showing minimal quality loss (typically <2% degradation on most tasks).

Intended Uses

Primary Use Cases

Conversational AI: Chat applications and virtual assistants
Code Generation: Programming assistance and code completion
Content Creation: Writing assistance and creative text generation
Educational Tools: Tutoring and explanation systems
Research: Academic and commercial NLP research

Out-of-Scope Uses

High-stakes decision making without human oversight
Generating harmful, biased, or misleading content
Medical, legal, or financial advice without professional validation
Real-time safety-critical applications

Limitations and Biases

Inherits limitations from the base Llama 3.1 model
Quantization may introduce minor numerical precision effects
Performance may vary across different hardware configurations
May exhibit biases present in training data

Technical Specifications

System Requirements

Minimum GPU Memory: 8 GB VRAM
Recommended GPU Memory: 12+ GB VRAM
CUDA Compatibility: CUDA 11.0+
Python Version: 3.8+
Dependencies: transformers>=4.57.0, torch>=2.0.0, bitsandbytes>=0.41.0

Supported Frameworks

Transformers (Hugging Face)
PyTorch
ONNX (with conversion)
TensorRT (with optimization)

Citation

If you use this model, please cite both the original Sherkala model and acknowledge the quantization:

@misc{sherkala-quantized-8bit,
  title={Llama-3.1-Sherkala-8B-Chat-Quantized-8-Bits},
  author={InceptionAI},
  year={2024},
  note={8-bit quantized version of Llama-3.1-Sherkala-8B-Chat}
}

License

This model is released under the same license as the base model. Please refer to the Llama 3.1 License for detailed terms and conditions.

Contact

For questions about this quantized version or technical support, please open an issue in the model repository or contact the development team.

This model card was generated on October 4, 2025. For the most up-to-date information, please refer to the original base model documentation.

Downloads last month: 4

Safetensors

Model size

8B params

Tensor type

F32

F16

Model tree for FilledVaccum/Llama-3.1-Sherkala-8B-Chat-Quantized-8-Bits

Unable to build the model tree, the base model loops to the model itself. Learn more.