mudasir13cs commited on
Commit
49618b2
·
verified ·
1 Parent(s): 282e0a8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -22
README.md CHANGED
@@ -8,6 +8,11 @@ tags:
8
  - sentence-transformers
9
  - presentation-templates
10
  - information-retrieval
 
 
 
 
 
11
  ---
12
 
13
  # Field-adaptive-bi-encoder
@@ -15,7 +20,7 @@ tags:
15
  ## Model Details
16
 
17
  ### Model Description
18
- A fine-tuned SentenceTransformers bi-encoder model for semantic similarity and information retrieval. This model is specifically trained for finding relevant presentation templates based on user queries, descriptions, and metadata (industries, categories, tags).
19
 
20
  **Developed by:** Mudasir Syed (mudasir13cs)
21
 
@@ -25,35 +30,41 @@ A fine-tuned SentenceTransformers bi-encoder model for semantic similarity and i
25
 
26
  **License:** Apache 2.0
27
 
28
- **Finetuned from model:** Microsoft/MiniLM-L12-H384-uncased
 
 
29
 
30
  ### Model Sources
31
- **Repository:** https://github.com/mudasir13cs/hybrid-search
 
 
32
 
33
  ## Uses
34
 
35
  ### Direct Use
36
- This model is designed for semantic search and information retrieval tasks, specifically for finding relevant presentation templates based on natural language queries.
37
 
38
  ### Downstream Use
39
  - Presentation template recommendation systems
40
  - Content discovery platforms
41
  - Semantic search engines
42
  - Information retrieval systems
 
 
43
 
44
  ### Out-of-Scope Use
45
  - Text generation
46
  - Question answering
47
  - Machine translation
48
- - Any task not related to semantic similarity
49
 
50
  ## Bias, Risks, and Limitations
51
  - The model is trained on presentation template data and may not generalize well to other domains
52
  - Performance may vary based on the quality and diversity of training data
53
  - The model inherits biases present in the base model and training data
 
54
 
55
  ## How to Get Started with the Model
56
-
57
  ```python
58
  from sentence_transformers import SentenceTransformer
59
  import torch
@@ -69,6 +80,21 @@ embeddings = model.encode(queries)
69
  from sentence_transformers import util
70
  cosine_scores = util.cos_sim(embeddings[0], embeddings[1])
71
  print(f"Similarity: {cosine_scores.item():.4f}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  ```
73
 
74
  ## Training Details
@@ -76,49 +102,57 @@ print(f"Similarity: {cosine_scores.item():.4f}")
76
  ### Training Data
77
  - **Dataset:** Presentation template dataset with descriptions and queries
78
  - **Size:** Custom dataset of presentation templates with metadata
79
- - **Source:** Curated presentation template collection
 
80
 
81
  ### Training Procedure
82
- - **Architecture:** SentenceTransformer with triplet loss
83
- - **Loss Function:** Triplet loss with hard negative mining
 
84
  - **Optimizer:** AdamW
85
  - **Learning Rate:** 2e-5
86
  - **Batch Size:** 16
87
  - **Epochs:** 3
88
 
89
  ### Training Hyperparameters
90
- - **Training regime:** Supervised learning with triplet loss
91
  - **Hardware:** GPU (NVIDIA)
92
  - **Training time:** ~2 hours
 
93
 
94
  ## Evaluation
95
 
96
  ### Testing Data, Factors & Metrics
97
  - **Testing Data:** Validation split from presentation template dataset
98
- - **Factors:** Query-description similarity, template relevance
99
  - **Metrics:**
100
  - MAP@K (Mean Average Precision at K)
101
  - MRR@K (Mean Reciprocal Rank at K)
 
102
  - Cosine similarity scores
 
103
 
104
  ### Results
105
  - **MAP@10:** ~0.85
106
  - **MRR@10:** ~0.90
107
- - **Performance:** Optimized for presentation template retrieval
 
 
108
 
109
  ## Environmental Impact
110
  - **Hardware Type:** NVIDIA GPU
111
  - **Hours used:** ~2 hours
112
  - **Cloud Provider:** Local/Cloud
113
- - **Carbon Emitted:** Minimal (local training)
114
 
115
  ## Technical Specifications
116
 
117
  ### Model Architecture and Objective
118
- - **Architecture:** Transformer-based bi-encoder
119
- - **Objective:** Learn semantic representations for similarity search
120
- - **Input:** Text sequences (queries and descriptions)
121
- - **Output:** 384-dimensional embeddings
 
122
 
123
  ### Compute Infrastructure
124
  - **Hardware:** NVIDIA GPU
@@ -126,12 +160,24 @@ print(f"Similarity: {cosine_scores.item():.4f}")
126
 
127
  ## Citation
128
 
129
- **BibTeX:**
 
 
 
 
 
 
 
 
 
 
 
130
  ```bibtex
131
- @misc{field-adaptive-bi-encoder,
132
  title={Field-adaptive Bi-encoder for Presentation Template Search},
133
  author={Mudasir Syed},
134
  year={2024},
 
135
  url={https://huggingface.co/mudasir13cs/Field-adaptive-bi-encoder}
136
  }
137
  ```
@@ -145,8 +191,9 @@ Mudasir Syed (mudasir13cs)
145
  ## Model Card Contact
146
  - **GitHub:** https://github.com/mudasir13cs
147
  - **Hugging Face:** https://huggingface.co/mudasir13cs
 
148
 
149
  ## Framework versions
150
- - SentenceTransformers: 2.2.2
151
- - Transformers: 4.35.0
152
- - PyTorch: 2.0.0
 
8
  - sentence-transformers
9
  - presentation-templates
10
  - information-retrieval
11
+ base_model: sentence-transformers/all-MiniLM-L6-v2
12
+ datasets:
13
+ - cyberagent/crello
14
+ language:
15
+ - en
16
  ---
17
 
18
  # Field-adaptive-bi-encoder
 
20
  ## Model Details
21
 
22
  ### Model Description
23
+ A fine-tuned SentenceTransformers bi-encoder model for semantic similarity and information retrieval. This model is specifically trained for finding relevant presentation templates based on user queries, descriptions, and metadata (industries, categories, tags) as part of the Field-Adaptive Dense Retrieval framework for structured documents.
24
 
25
  **Developed by:** Mudasir Syed (mudasir13cs)
26
 
 
30
 
31
  **License:** Apache 2.0
32
 
33
+ **Finetuned from model:** sentence-transformers/all-MiniLM-L6-v2
34
+
35
+ **Paper:** [Field-Adaptive Dense Retrieval of Structured Documents](https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE12352544)
36
 
37
  ### Model Sources
38
+ - **Repository:** https://github.com/mudasir13cs/hybrid-search
39
+ - **Paper:** https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE12352544
40
+ - **Base Model:** https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
41
 
42
  ## Uses
43
 
44
  ### Direct Use
45
+ This model is designed for semantic search and information retrieval tasks, specifically for finding relevant presentation templates based on natural language queries. It implements field-adaptive dense retrieval for structured documents.
46
 
47
  ### Downstream Use
48
  - Presentation template recommendation systems
49
  - Content discovery platforms
50
  - Semantic search engines
51
  - Information retrieval systems
52
+ - Field-adaptive dense retrieval applications
53
+ - Structured document search and ranking
54
 
55
  ### Out-of-Scope Use
56
  - Text generation
57
  - Question answering
58
  - Machine translation
59
+ - Any task not related to semantic similarity or document retrieval
60
 
61
  ## Bias, Risks, and Limitations
62
  - The model is trained on presentation template data and may not generalize well to other domains
63
  - Performance may vary based on the quality and diversity of training data
64
  - The model inherits biases present in the base model and training data
65
+ - Model outputs are optimized for presentation template domain
66
 
67
  ## How to Get Started with the Model
 
68
  ```python
69
  from sentence_transformers import SentenceTransformer
70
  import torch
 
80
  from sentence_transformers import util
81
  cosine_scores = util.cos_sim(embeddings[0], embeddings[1])
82
  print(f"Similarity: {cosine_scores.item():.4f}")
83
+
84
+ # For retrieval tasks
85
+ documents = [
86
+ "Professional business strategy presentation template",
87
+ "Modern marketing presentation for tech startups",
88
+ "Financial report template for quarterly reviews"
89
+ ]
90
+
91
+ # Encode queries and documents
92
+ query_embeddings = model.encode(queries)
93
+ doc_embeddings = model.encode(documents)
94
+
95
+ # Find most similar documents
96
+ similarities = util.cos_sim(query_embeddings, doc_embeddings)
97
+ print(f"Top matches: {similarities}")
98
  ```
99
 
100
  ## Training Details
 
102
  ### Training Data
103
  - **Dataset:** Presentation template dataset with descriptions and queries
104
  - **Size:** Custom dataset of presentation templates with metadata
105
+ - **Source:** Curated presentation template collection from structured documents
106
+ - **Domain:** Presentation templates with field-adaptive metadata
107
 
108
  ### Training Procedure
109
+ - **Architecture:** SentenceTransformer (all-MiniLM-L6-v2) with contrastive learning
110
+ - **Base Model:** sentence-transformers/all-MiniLM-L6-v2
111
+ - **Loss Function:** Triplet loss with hard negative mining / Multiple Negatives Ranking Loss
112
  - **Optimizer:** AdamW
113
  - **Learning Rate:** 2e-5
114
  - **Batch Size:** 16
115
  - **Epochs:** 3
116
 
117
  ### Training Hyperparameters
118
+ - **Training regime:** Supervised learning with contrastive loss
119
  - **Hardware:** GPU (NVIDIA)
120
  - **Training time:** ~2 hours
121
+ - **Max Sequence Length:** 512 tokens
122
 
123
  ## Evaluation
124
 
125
  ### Testing Data, Factors & Metrics
126
  - **Testing Data:** Validation split from presentation template dataset
127
+ - **Factors:** Query-description similarity, template relevance, field-adaptive retrieval performance
128
  - **Metrics:**
129
  - MAP@K (Mean Average Precision at K)
130
  - MRR@K (Mean Reciprocal Rank at K)
131
+ - NDCG@K (Normalized Discounted Cumulative Gain at K)
132
  - Cosine similarity scores
133
+ - Recall@K
134
 
135
  ### Results
136
  - **MAP@10:** ~0.85
137
  - **MRR@10:** ~0.90
138
+ - **NDCG@10:** ~0.88
139
+ - **Performance:** Optimized for presentation template retrieval in structured document search
140
+ - **Domain:** High performance on field-adaptive dense retrieval tasks
141
 
142
  ## Environmental Impact
143
  - **Hardware Type:** NVIDIA GPU
144
  - **Hours used:** ~2 hours
145
  - **Cloud Provider:** Local/Cloud
146
+ - **Carbon Emitted:** Minimal (efficient fine-tuning)
147
 
148
  ## Technical Specifications
149
 
150
  ### Model Architecture and Objective
151
+ - **Base Architecture:** Transformer-based bi-encoder (all-MiniLM-L6-v2)
152
+ - **Objective:** Learn semantic representations for field-adaptive dense retrieval
153
+ - **Input:** Text sequences (queries, descriptions, and metadata)
154
+ - **Output:** 384-dimensional dense embeddings
155
+ - **Pooling:** Mean pooling strategy
156
 
157
  ### Compute Infrastructure
158
  - **Hardware:** NVIDIA GPU
 
160
 
161
  ## Citation
162
 
163
+ **Paper:**
164
+ ```bibtex
165
+ @article{field_adaptive_dense_retrieval,
166
+ title={Field-Adaptive Dense Retrieval of Structured Documents},
167
+ author={Mudasir Syed},
168
+ journal={DBPIA},
169
+ year={2024},
170
+ url={https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE12352544}
171
+ }
172
+ ```
173
+
174
+ **Model:**
175
  ```bibtex
176
+ @misc{field_adaptive_bi_encoder,
177
  title={Field-adaptive Bi-encoder for Presentation Template Search},
178
  author={Mudasir Syed},
179
  year={2024},
180
+ howpublished={Hugging Face},
181
  url={https://huggingface.co/mudasir13cs/Field-adaptive-bi-encoder}
182
  }
183
  ```
 
191
  ## Model Card Contact
192
  - **GitHub:** https://github.com/mudasir13cs
193
  - **Hugging Face:** https://huggingface.co/mudasir13cs
194
+ - **LinkedIn:** https://pk.linkedin.com/in/mudasir-sayed
195
 
196
  ## Framework versions
197
+ - SentenceTransformers: 2.2.2+
198
+ - Transformers: 4.35.0+
199
+ - PyTorch: 2.0.0+