Update README.md
Browse files
README.md
CHANGED
|
@@ -8,6 +8,11 @@ tags:
|
|
| 8 |
- sentence-transformers
|
| 9 |
- presentation-templates
|
| 10 |
- information-retrieval
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
# Field-adaptive-bi-encoder
|
|
@@ -15,7 +20,7 @@ tags:
|
|
| 15 |
## Model Details
|
| 16 |
|
| 17 |
### Model Description
|
| 18 |
-
A fine-tuned SentenceTransformers bi-encoder model for semantic similarity and information retrieval. This model is specifically trained for finding relevant presentation templates based on user queries, descriptions, and metadata (industries, categories, tags).
|
| 19 |
|
| 20 |
**Developed by:** Mudasir Syed (mudasir13cs)
|
| 21 |
|
|
@@ -25,35 +30,41 @@ A fine-tuned SentenceTransformers bi-encoder model for semantic similarity and i
|
|
| 25 |
|
| 26 |
**License:** Apache 2.0
|
| 27 |
|
| 28 |
-
**Finetuned from model:**
|
|
|
|
|
|
|
| 29 |
|
| 30 |
### Model Sources
|
| 31 |
-
**Repository:** https://github.com/mudasir13cs/hybrid-search
|
|
|
|
|
|
|
| 32 |
|
| 33 |
## Uses
|
| 34 |
|
| 35 |
### Direct Use
|
| 36 |
-
This model is designed for semantic search and information retrieval tasks, specifically for finding relevant presentation templates based on natural language queries.
|
| 37 |
|
| 38 |
### Downstream Use
|
| 39 |
- Presentation template recommendation systems
|
| 40 |
- Content discovery platforms
|
| 41 |
- Semantic search engines
|
| 42 |
- Information retrieval systems
|
|
|
|
|
|
|
| 43 |
|
| 44 |
### Out-of-Scope Use
|
| 45 |
- Text generation
|
| 46 |
- Question answering
|
| 47 |
- Machine translation
|
| 48 |
-
- Any task not related to semantic similarity
|
| 49 |
|
| 50 |
## Bias, Risks, and Limitations
|
| 51 |
- The model is trained on presentation template data and may not generalize well to other domains
|
| 52 |
- Performance may vary based on the quality and diversity of training data
|
| 53 |
- The model inherits biases present in the base model and training data
|
|
|
|
| 54 |
|
| 55 |
## How to Get Started with the Model
|
| 56 |
-
|
| 57 |
```python
|
| 58 |
from sentence_transformers import SentenceTransformer
|
| 59 |
import torch
|
|
@@ -69,6 +80,21 @@ embeddings = model.encode(queries)
|
|
| 69 |
from sentence_transformers import util
|
| 70 |
cosine_scores = util.cos_sim(embeddings[0], embeddings[1])
|
| 71 |
print(f"Similarity: {cosine_scores.item():.4f}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
```
|
| 73 |
|
| 74 |
## Training Details
|
|
@@ -76,49 +102,57 @@ print(f"Similarity: {cosine_scores.item():.4f}")
|
|
| 76 |
### Training Data
|
| 77 |
- **Dataset:** Presentation template dataset with descriptions and queries
|
| 78 |
- **Size:** Custom dataset of presentation templates with metadata
|
| 79 |
-
- **Source:** Curated presentation template collection
|
|
|
|
| 80 |
|
| 81 |
### Training Procedure
|
| 82 |
-
- **Architecture:** SentenceTransformer with
|
| 83 |
-
- **
|
|
|
|
| 84 |
- **Optimizer:** AdamW
|
| 85 |
- **Learning Rate:** 2e-5
|
| 86 |
- **Batch Size:** 16
|
| 87 |
- **Epochs:** 3
|
| 88 |
|
| 89 |
### Training Hyperparameters
|
| 90 |
-
- **Training regime:** Supervised learning with
|
| 91 |
- **Hardware:** GPU (NVIDIA)
|
| 92 |
- **Training time:** ~2 hours
|
|
|
|
| 93 |
|
| 94 |
## Evaluation
|
| 95 |
|
| 96 |
### Testing Data, Factors & Metrics
|
| 97 |
- **Testing Data:** Validation split from presentation template dataset
|
| 98 |
-
- **Factors:** Query-description similarity, template relevance
|
| 99 |
- **Metrics:**
|
| 100 |
- MAP@K (Mean Average Precision at K)
|
| 101 |
- MRR@K (Mean Reciprocal Rank at K)
|
|
|
|
| 102 |
- Cosine similarity scores
|
|
|
|
| 103 |
|
| 104 |
### Results
|
| 105 |
- **MAP@10:** ~0.85
|
| 106 |
- **MRR@10:** ~0.90
|
| 107 |
-
- **
|
|
|
|
|
|
|
| 108 |
|
| 109 |
## Environmental Impact
|
| 110 |
- **Hardware Type:** NVIDIA GPU
|
| 111 |
- **Hours used:** ~2 hours
|
| 112 |
- **Cloud Provider:** Local/Cloud
|
| 113 |
-
- **Carbon Emitted:** Minimal (
|
| 114 |
|
| 115 |
## Technical Specifications
|
| 116 |
|
| 117 |
### Model Architecture and Objective
|
| 118 |
-
- **Architecture:** Transformer-based bi-encoder
|
| 119 |
-
- **Objective:** Learn semantic representations for
|
| 120 |
-
- **Input:** Text sequences (queries and
|
| 121 |
-
- **Output:** 384-dimensional embeddings
|
|
|
|
| 122 |
|
| 123 |
### Compute Infrastructure
|
| 124 |
- **Hardware:** NVIDIA GPU
|
|
@@ -126,12 +160,24 @@ print(f"Similarity: {cosine_scores.item():.4f}")
|
|
| 126 |
|
| 127 |
## Citation
|
| 128 |
|
| 129 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
```bibtex
|
| 131 |
-
@misc{
|
| 132 |
title={Field-adaptive Bi-encoder for Presentation Template Search},
|
| 133 |
author={Mudasir Syed},
|
| 134 |
year={2024},
|
|
|
|
| 135 |
url={https://huggingface.co/mudasir13cs/Field-adaptive-bi-encoder}
|
| 136 |
}
|
| 137 |
```
|
|
@@ -145,8 +191,9 @@ Mudasir Syed (mudasir13cs)
|
|
| 145 |
## Model Card Contact
|
| 146 |
- **GitHub:** https://github.com/mudasir13cs
|
| 147 |
- **Hugging Face:** https://huggingface.co/mudasir13cs
|
|
|
|
| 148 |
|
| 149 |
## Framework versions
|
| 150 |
-
- SentenceTransformers: 2.2.2
|
| 151 |
-
- Transformers: 4.35.0
|
| 152 |
-
- PyTorch: 2.0.0
|
|
|
|
| 8 |
- sentence-transformers
|
| 9 |
- presentation-templates
|
| 10 |
- information-retrieval
|
| 11 |
+
base_model: sentence-transformers/all-MiniLM-L6-v2
|
| 12 |
+
datasets:
|
| 13 |
+
- cyberagent/crello
|
| 14 |
+
language:
|
| 15 |
+
- en
|
| 16 |
---
|
| 17 |
|
| 18 |
# Field-adaptive-bi-encoder
|
|
|
|
| 20 |
## Model Details
|
| 21 |
|
| 22 |
### Model Description
|
| 23 |
+
A fine-tuned SentenceTransformers bi-encoder model for semantic similarity and information retrieval. This model is specifically trained for finding relevant presentation templates based on user queries, descriptions, and metadata (industries, categories, tags) as part of the Field-Adaptive Dense Retrieval framework for structured documents.
|
| 24 |
|
| 25 |
**Developed by:** Mudasir Syed (mudasir13cs)
|
| 26 |
|
|
|
|
| 30 |
|
| 31 |
**License:** Apache 2.0
|
| 32 |
|
| 33 |
+
**Finetuned from model:** sentence-transformers/all-MiniLM-L6-v2
|
| 34 |
+
|
| 35 |
+
**Paper:** [Field-Adaptive Dense Retrieval of Structured Documents](https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE12352544)
|
| 36 |
|
| 37 |
### Model Sources
|
| 38 |
+
- **Repository:** https://github.com/mudasir13cs/hybrid-search
|
| 39 |
+
- **Paper:** https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE12352544
|
| 40 |
+
- **Base Model:** https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
|
| 41 |
|
| 42 |
## Uses
|
| 43 |
|
| 44 |
### Direct Use
|
| 45 |
+
This model is designed for semantic search and information retrieval tasks, specifically for finding relevant presentation templates based on natural language queries. It implements field-adaptive dense retrieval for structured documents.
|
| 46 |
|
| 47 |
### Downstream Use
|
| 48 |
- Presentation template recommendation systems
|
| 49 |
- Content discovery platforms
|
| 50 |
- Semantic search engines
|
| 51 |
- Information retrieval systems
|
| 52 |
+
- Field-adaptive dense retrieval applications
|
| 53 |
+
- Structured document search and ranking
|
| 54 |
|
| 55 |
### Out-of-Scope Use
|
| 56 |
- Text generation
|
| 57 |
- Question answering
|
| 58 |
- Machine translation
|
| 59 |
+
- Any task not related to semantic similarity or document retrieval
|
| 60 |
|
| 61 |
## Bias, Risks, and Limitations
|
| 62 |
- The model is trained on presentation template data and may not generalize well to other domains
|
| 63 |
- Performance may vary based on the quality and diversity of training data
|
| 64 |
- The model inherits biases present in the base model and training data
|
| 65 |
+
- Model outputs are optimized for presentation template domain
|
| 66 |
|
| 67 |
## How to Get Started with the Model
|
|
|
|
| 68 |
```python
|
| 69 |
from sentence_transformers import SentenceTransformer
|
| 70 |
import torch
|
|
|
|
| 80 |
from sentence_transformers import util
|
| 81 |
cosine_scores = util.cos_sim(embeddings[0], embeddings[1])
|
| 82 |
print(f"Similarity: {cosine_scores.item():.4f}")
|
| 83 |
+
|
| 84 |
+
# For retrieval tasks
|
| 85 |
+
documents = [
|
| 86 |
+
"Professional business strategy presentation template",
|
| 87 |
+
"Modern marketing presentation for tech startups",
|
| 88 |
+
"Financial report template for quarterly reviews"
|
| 89 |
+
]
|
| 90 |
+
|
| 91 |
+
# Encode queries and documents
|
| 92 |
+
query_embeddings = model.encode(queries)
|
| 93 |
+
doc_embeddings = model.encode(documents)
|
| 94 |
+
|
| 95 |
+
# Find most similar documents
|
| 96 |
+
similarities = util.cos_sim(query_embeddings, doc_embeddings)
|
| 97 |
+
print(f"Top matches: {similarities}")
|
| 98 |
```
|
| 99 |
|
| 100 |
## Training Details
|
|
|
|
| 102 |
### Training Data
|
| 103 |
- **Dataset:** Presentation template dataset with descriptions and queries
|
| 104 |
- **Size:** Custom dataset of presentation templates with metadata
|
| 105 |
+
- **Source:** Curated presentation template collection from structured documents
|
| 106 |
+
- **Domain:** Presentation templates with field-adaptive metadata
|
| 107 |
|
| 108 |
### Training Procedure
|
| 109 |
+
- **Architecture:** SentenceTransformer (all-MiniLM-L6-v2) with contrastive learning
|
| 110 |
+
- **Base Model:** sentence-transformers/all-MiniLM-L6-v2
|
| 111 |
+
- **Loss Function:** Triplet loss with hard negative mining / Multiple Negatives Ranking Loss
|
| 112 |
- **Optimizer:** AdamW
|
| 113 |
- **Learning Rate:** 2e-5
|
| 114 |
- **Batch Size:** 16
|
| 115 |
- **Epochs:** 3
|
| 116 |
|
| 117 |
### Training Hyperparameters
|
| 118 |
+
- **Training regime:** Supervised learning with contrastive loss
|
| 119 |
- **Hardware:** GPU (NVIDIA)
|
| 120 |
- **Training time:** ~2 hours
|
| 121 |
+
- **Max Sequence Length:** 512 tokens
|
| 122 |
|
| 123 |
## Evaluation
|
| 124 |
|
| 125 |
### Testing Data, Factors & Metrics
|
| 126 |
- **Testing Data:** Validation split from presentation template dataset
|
| 127 |
+
- **Factors:** Query-description similarity, template relevance, field-adaptive retrieval performance
|
| 128 |
- **Metrics:**
|
| 129 |
- MAP@K (Mean Average Precision at K)
|
| 130 |
- MRR@K (Mean Reciprocal Rank at K)
|
| 131 |
+
- NDCG@K (Normalized Discounted Cumulative Gain at K)
|
| 132 |
- Cosine similarity scores
|
| 133 |
+
- Recall@K
|
| 134 |
|
| 135 |
### Results
|
| 136 |
- **MAP@10:** ~0.85
|
| 137 |
- **MRR@10:** ~0.90
|
| 138 |
+
- **NDCG@10:** ~0.88
|
| 139 |
+
- **Performance:** Optimized for presentation template retrieval in structured document search
|
| 140 |
+
- **Domain:** High performance on field-adaptive dense retrieval tasks
|
| 141 |
|
| 142 |
## Environmental Impact
|
| 143 |
- **Hardware Type:** NVIDIA GPU
|
| 144 |
- **Hours used:** ~2 hours
|
| 145 |
- **Cloud Provider:** Local/Cloud
|
| 146 |
+
- **Carbon Emitted:** Minimal (efficient fine-tuning)
|
| 147 |
|
| 148 |
## Technical Specifications
|
| 149 |
|
| 150 |
### Model Architecture and Objective
|
| 151 |
+
- **Base Architecture:** Transformer-based bi-encoder (all-MiniLM-L6-v2)
|
| 152 |
+
- **Objective:** Learn semantic representations for field-adaptive dense retrieval
|
| 153 |
+
- **Input:** Text sequences (queries, descriptions, and metadata)
|
| 154 |
+
- **Output:** 384-dimensional dense embeddings
|
| 155 |
+
- **Pooling:** Mean pooling strategy
|
| 156 |
|
| 157 |
### Compute Infrastructure
|
| 158 |
- **Hardware:** NVIDIA GPU
|
|
|
|
| 160 |
|
| 161 |
## Citation
|
| 162 |
|
| 163 |
+
**Paper:**
|
| 164 |
+
```bibtex
|
| 165 |
+
@article{field_adaptive_dense_retrieval,
|
| 166 |
+
title={Field-Adaptive Dense Retrieval of Structured Documents},
|
| 167 |
+
author={Mudasir Syed},
|
| 168 |
+
journal={DBPIA},
|
| 169 |
+
year={2024},
|
| 170 |
+
url={https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE12352544}
|
| 171 |
+
}
|
| 172 |
+
```
|
| 173 |
+
|
| 174 |
+
**Model:**
|
| 175 |
```bibtex
|
| 176 |
+
@misc{field_adaptive_bi_encoder,
|
| 177 |
title={Field-adaptive Bi-encoder for Presentation Template Search},
|
| 178 |
author={Mudasir Syed},
|
| 179 |
year={2024},
|
| 180 |
+
howpublished={Hugging Face},
|
| 181 |
url={https://huggingface.co/mudasir13cs/Field-adaptive-bi-encoder}
|
| 182 |
}
|
| 183 |
```
|
|
|
|
| 191 |
## Model Card Contact
|
| 192 |
- **GitHub:** https://github.com/mudasir13cs
|
| 193 |
- **Hugging Face:** https://huggingface.co/mudasir13cs
|
| 194 |
+
- **LinkedIn:** https://pk.linkedin.com/in/mudasir-sayed
|
| 195 |
|
| 196 |
## Framework versions
|
| 197 |
+
- SentenceTransformers: 2.2.2+
|
| 198 |
+
- Transformers: 4.35.0+
|
| 199 |
+
- PyTorch: 2.0.0+
|