adding in code changes that I did and necessary steps to reproduce inference on GPU using transformers tutorial
Browse files
README.md
CHANGED
|
@@ -83,10 +83,32 @@ model = AutoModelForCausalLM.from_pretrained("sambanovasystems/BLOOMChat-176B-v1
|
|
| 83 |
|
| 84 |
Specifically we tested BLOOM inference via command-line in this repository.
|
| 85 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
NOTE: Things that we had to modify in order for BLOOMChat to work:
|
| 87 |
- Install transformers version 4.27.0
|
| 88 |
- `pip install transformers==4.27.0`
|
| 89 |
- Change the model name from `bigscience/bloom` to `sambanovasystems/BLOOMChat-176B-v1`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
|
| 92 |
|
|
|
|
| 83 |
|
| 84 |
Specifically we tested BLOOM inference via command-line in this repository.
|
| 85 |
|
| 86 |
+
Running command:
|
| 87 |
+
```
|
| 88 |
+
python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
NOTE: Things that we had to modify in order for BLOOMChat to work:
|
| 92 |
- Install transformers version 4.27.0
|
| 93 |
- `pip install transformers==4.27.0`
|
| 94 |
- Change the model name from `bigscience/bloom` to `sambanovasystems/BLOOMChat-176B-v1`
|
| 95 |
+
- Modifying `inference_server/models/hf_accelerate.py`
|
| 96 |
+
- This is because for our testing of this repo we used 4 80GB A100 GPUs and would run into memory issues
|
| 97 |
+
|
| 98 |
+
Modifications for `inference_server/models/hf_accelerate.py`:
|
| 99 |
+
|
| 100 |
+
```python
|
| 101 |
+
from accelerate.utils.modeling import get_max_memory
|
| 102 |
+
...
|
| 103 |
+
class HFAccelerateModel(Model):
|
| 104 |
+
def __init__(self, args: Namespace) -> None:
|
| 105 |
+
...
|
| 106 |
+
original_max_memory_dict = get_max_memory()
|
| 107 |
+
|
| 108 |
+
reduce_max_memory_dict = {device_key: int(original_max_memory_dict[device_key] * 0.85) for device_key in original_max_memory_dict}
|
| 109 |
+
|
| 110 |
+
kwargs["max_memory"] = reduce_max_memory_dict
|
| 111 |
+
```
|
| 112 |
|
| 113 |
|
| 114 |
|