File size: 6,606 Bytes
0eac974 67aa985 c4bdb36 0eac974 c4bdb36 0eac974 c4bdb36 0eac974 0dc2584 6567feb 0dc2584 6567feb 0dc2584 90e2460 0dc2584 a51100f 964fff5 2a910b7 cdc4d61 486de3a 2a910b7 f0bf3b9 ee1e508 f0bf3b9 61eb454 2a910b7 c4bdb36 2a910b7 fbc0fc4 1a6485f fbc0fc4 011793f b6f06d3 1a6485f fbc0fc4 011793f fbc0fc4 1a6485f fbc0fc4 011793f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
---
language: en
license: mit
tags:
- BERT
- HIPAA
- PHI
- LLM
- sensitive data
- classification
- healthcare
- mHealth Application
- cybersecurity
- database
- column name classifier
- data field classifier
- transformers
- huggingface
model-index:
- name: LLM BERT Model for HIPAA-Sensitive Database Fields Classification
results: []
---
# LLM BERT Model for HIPAA-Sensitive Database Fields Classification
This repository hosts a fine-tuned BERT-base model that classifies database column names as either **PHI HIPAA-sensitive** (e.g., `birthDate`, `ssn`, `address`) or **non-sensitive** (e.g., `color`, `food`, `country`).
Use this model for:
- Masking PHI data fields before sharing database to avoid HIPAA compliance
- Preprocessing before data anonymization
- Identifying patient's sensitive data fields in a dataset before training an AI model
- Enhancing security in healthcare and mHealth applications
---
## π§ Model Info
- **Base Model**: `bert-base-uncased`
- **Task**: Binary classification (PHI HIPAA Sensitive vs Non-sensitive)
- **Trained On**: GAN generated Synthetic and real-world column name examples
- **Framework**: Hugging Face Transformers
- **Model URL**: [https://huggingface.co/barek2k2/bert_hipaa_sensitive_db_schema](https://huggingface.co/barek2k2/bert_hipaa_sensitive_db_schema)
---
## π Usage Example (End-to-End)
### 1. Install Requirements
```bash
pip install torch transformers
```
### 2. Example
```bash
import torch
from transformers import BertTokenizer, BertForSequenceClassification
# Load model and tokenizer
model = BertForSequenceClassification.from_pretrained("barek2k2/bert_hipaa_sensitive_db_schema")
tokenizer = BertTokenizer.from_pretrained("barek2k2/bert_hipaa_sensitive_db_schema")
model.eval()
# Example column names
texts = ["birthDate", "country", "jwtToken", "color"]
# Tokenize input
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=128)
# Predict
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=1)
# Display results
for text, pred in zip(texts, predictions):
label = "Sensitive" if pred.item() == 1 else "Non-sensitive"
print(f"{text}: {label}")
```
### 3. Output
```bash
birthDate: Sensitive
country: Non-sensitive
jwtToken: Sensitive
color: Non-sensitive
```
In the healthcare industry, safeguarding sensitive patient data is of utmost importance, particularly when developing and maintaining software systems that involve database sharing. The Health Insurance Portability and Accountability Act (HIPAA) mandates strict regulations to ensure the privacy and security of Protected Health Information (PHI). Healthcare organizations must comply with these regulations to prevent unauthorized access, breaches, and potential legal consequences. However, ensuring HIPAA compliance becomes a complex challenge when databases are shared among multiple teams for debugging, development, and testing purposes.
This research work proposes a novel approach that uses BERT based LLM for identifying sensitive database columns into the database schema in order to avoid PHI HIPAA violation.
#### Disclaimer
This LLM model is fine-tuned with synthetic dataset(~5K) and is provided for research and educational purposes only. Always verify compliance before using in production environments.
---
## π Model Performance Analysis
**Table 1: Changing hyperparameters and results**
| Step | Learning Rate | Batch Size | Epoch | Weight Decay | Precision | Recall | F1 Score | Accuracy |
|--------|---------------|------------|-------|---------------|-----------|--------|----------|----------|
| 1 | 0 | 16 | 1 | 0.001 | 0.0000 | 0.0000 | 0.0000 | 36.78% |
| 2 | 1e-1 | 16 | 1 | 0.001 | 0.6321 | 1.0000 | 0.7746 | 63.21% |
| 3 | 1e-1 | 32 | 1 | 0.001 | 0.6321 | 1.0000 | 0.7746 | 63.21% |
| 4 | 1e-1 | 32 | 2 | 0.001 | 0.6321 | 1.0000 | 0.7746 | 63.21% |
| 5 | 1e-1 | 32 | 3 | 0.001 | 0.6321 | 1.0000 | 0.7746 | 63.21% |
| 6 | 1e-1 | 32 | 3 | 0.01 | 0.6321 | 1.0000 | 0.7746 | 63.21% |
| 7 | 2e-1 | 32 | 4 | 0.01 | 0.6321 | 1.0000 | 0.7746 | 63.21% |
| 8 | 3e-4 | 32 | 4 | 0.01 | 0.6331 | 0.9982 | 0.7748 | 63.32% |
| 9 | 2e-4 | 32 | 4 | 0.01 | 0.9908 | 0.9730 | 0.9818 | 97.72% |
| 10 | 1e-5 | 32 | 4 | 0.01 | 0.9964 | 0.9928 | 0.9946 | 99.31% |
| 11 | 1e-5 | 32 | 5 | 0.01 | 0.9964 | 0.9928 | 0.9946 | 99.31% |
| **12** | **1e-5** | **16** | **5** | **0.01** | **1.0000**| **0.9964** | **0.9982** | **99.72%** |
| 13 | 1e-5 | 16 | 5 | 0.1 | 1.0000 | 0.9946 | 0.9973 | 99.65% |
| 14 | 1e-5 | 32 | 5 | 0.1 | 1.0000 | 0.9946 | 0.9973 | 99.65% |
| 15 | 1e-5 | 32 | 5 | 1.0 | 0.9964 | 0.9946 | 0.9946 | 99.54% |
| 16 | 1e-6 | 32 | 5 | 1.0 | 0.8342 | 0.9153 | 0.8729 | 83.15% |
### Limitations
One of the main limitations of this work is the use of
a synthetic dataset instead of real-world data to fine-tune
and train the AI models. Although the dataset was carefully
checked for accuracy, it may not fully reflect the complexity
and diversity of actual healthcare records.
## π€ Author
**MD Abdul Barek**
PhD student & GRA @ Intelligent Systems and Robotics
- π« University of West Florida, Florida, USA
- π§ [email protected]
- π§ [email protected]
- π [Hugging Face Profile](https://huggingface.co/barek2k2)
**Advisor:**
Dr. Hakki Erhan Sevil
Associate Professor
Intelligent Systems and Robotics,
University of West Florida
π§ [email protected]
**Supervisors:**
Dr. Guillermo Francia III
Director, Research and Innovation,
Center for Cybersecurity,
University of West Florida
π§ [email protected]
Dr. Hossain Shahriar
Associate Director and Professor, Center for Cybersecurity,
University of West Florida
π§ [email protected]
Dr. Sheikh Iqbal Ahamed
Wehr Professor and Founding Chair of Computer Science Department at Marquette University,
Marquette University
π§ [email protected] |