sambanovasystems
/

BLOOMChat-176B-v1

@@ -83,10 +83,32 @@ model = AutoModelForCausalLM.from_pretrained("sambanovasystems/BLOOMChat-176B-v1
 Specifically we tested BLOOM inference via command-line in this repository.
 NOTE: Things that we had to modify in order for BLOOMChat to work:
 - Install transformers version 4.27.0
     - `pip install transformers==4.27.0`
 - Change the model name from `bigscience/bloom` to `sambanovasystems/BLOOMChat-176B-v1`

 Specifically we tested BLOOM inference via command-line in this repository.
+Running command:
+```
+python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
+```
 NOTE: Things that we had to modify in order for BLOOMChat to work:
 - Install transformers version 4.27.0
     - `pip install transformers==4.27.0`
 - Change the model name from `bigscience/bloom` to `sambanovasystems/BLOOMChat-176B-v1`
+- Modifying `inference_server/models/hf_accelerate.py`
+    - This is because for our testing of this repo we used 4 80GB A100 GPUs and would run into memory issues
+Modifications for `inference_server/models/hf_accelerate.py`:
+```python
+from accelerate.utils.modeling import get_max_memory
+...
+class HFAccelerateModel(Model):
+    def __init__(self, args: Namespace) -> None:
+        ...
+        original_max_memory_dict = get_max_memory()
+        reduce_max_memory_dict = {device_key: int(original_max_memory_dict[device_key] * 0.85) for device_key in original_max_memory_dict}
+        kwargs["max_memory"] = reduce_max_memory_dict
+```