Llama-3.1-Sherkala-8B-Chat-Quantized-8-Bits
This is an 8-bit quantized version of inceptionai/Llama-3.1-Sherkala-8B-Chat, optimized for efficient inference with reduced memory requirements while maintaining performance quality.
Model Details
Model Description
- Developed by: InceptionAI (Quantized by FilledVaccum)
- Base Model: inceptionai/Llama-3.1-Sherkala-8B-Chat
- Model type: Causal Language Model
- Architecture: LlamaForCausalLM
- Language(s): English
- License: Llama 3.1 License
- Quantization: 8-bit using BitsAndBytes
Key Features
- Memory Efficient: ~50% reduction in memory usage compared to the original FP16 model
- Maintained Performance: Optimized quantization preserving model quality
- Fast Inference: Accelerated generation with reduced computational overhead
- Chat Optimized: Fine-tuned for conversational AI applications
Model Architecture
- Parameters: 8 billion parameters (quantized to 8-bit)
- Hidden Size: 4,096
- Attention Heads: 32
- Key-Value Heads: 8 (Grouped Query Attention)
- Layers: 32
- Context Length: 8,192 tokens (extended with RoPE scaling)
- Vocabulary Size: 159,766 tokens
Quantization Details
This model uses BitsAndBytes 8-bit quantization with the following configuration:
- Quantization Method: BitsAndBytes
- Precision: 8-bit integers
- Threshold: 6.0 for outlier detection
- Compute Dtype: Maintains FP16 for computations where needed
Usage
Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
model_name = "your-username/Llama-3.1-Sherkala-8B-Chat-Quantized-8-Bits"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
# Chat template
def format_chat(messages):
return tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Example conversation
messages = [
{"role": "user", "content": "Hello! Can you help me with a coding problem?"}
]
# Generate response
input_text = format_chat(messages)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
Memory Requirements
| Precision | Memory Usage | Relative Size |
|---|---|---|
| FP16 (Original) | ~16 GB | 100% |
| 8-bit (This model) | ~8 GB | 50% |
Performance Considerations
- Inference Speed: Comparable to FP16 with modern GPUs
- Quality: Minimal degradation from quantization
- Hardware: Optimized for NVIDIA GPUs with Tensor Cores
- Batch Size: Supports larger batch sizes due to reduced memory footprint
Training Details
This model inherits the training methodology from the base Llama-3.1-Sherkala-8B-Chat model. The quantization process preserves the original model's capabilities while optimizing for deployment efficiency.
Evaluation
The quantized model maintains performance characteristics similar to the base model across various benchmarks, with typical 8-bit quantization showing minimal quality loss (typically <2% degradation on most tasks).
Intended Uses
Primary Use Cases
- Conversational AI: Chat applications and virtual assistants
- Code Generation: Programming assistance and code completion
- Content Creation: Writing assistance and creative text generation
- Educational Tools: Tutoring and explanation systems
- Research: Academic and commercial NLP research
Out-of-Scope Uses
- High-stakes decision making without human oversight
- Generating harmful, biased, or misleading content
- Medical, legal, or financial advice without professional validation
- Real-time safety-critical applications
Limitations and Biases
- Inherits limitations from the base Llama 3.1 model
- Quantization may introduce minor numerical precision effects
- Performance may vary across different hardware configurations
- May exhibit biases present in training data
Technical Specifications
System Requirements
- Minimum GPU Memory: 8 GB VRAM
- Recommended GPU Memory: 12+ GB VRAM
- CUDA Compatibility: CUDA 11.0+
- Python Version: 3.8+
- Dependencies: transformers>=4.57.0, torch>=2.0.0, bitsandbytes>=0.41.0
Supported Frameworks
- Transformers (Hugging Face)
- PyTorch
- ONNX (with conversion)
- TensorRT (with optimization)
Citation
If you use this model, please cite both the original Sherkala model and acknowledge the quantization:
@misc{sherkala-quantized-8bit,
title={Llama-3.1-Sherkala-8B-Chat-Quantized-8-Bits},
author={InceptionAI},
year={2024},
note={8-bit quantized version of Llama-3.1-Sherkala-8B-Chat}
}
License
This model is released under the same license as the base model. Please refer to the Llama 3.1 License for detailed terms and conditions.
Contact
For questions about this quantized version or technical support, please open an issue in the model repository or contact the development team.
This model card was generated on October 4, 2025. For the most up-to-date information, please refer to the original base model documentation.
- Downloads last month
- 4
Model tree for FilledVaccum/Llama-3.1-Sherkala-8B-Chat-Quantized-8-Bits
Unable to build the model tree, the base model loops to the model itself. Learn more.