Upload data_summary_card.md

#2
by bsnelling - opened
Files changed (1) hide show
  1. data_summary_card.md +149 -0
data_summary_card.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+
4
+
5
+ # Data Summary for microsoft_udop-large
6
+
7
+
8
+
9
+ This was data summary created by Microsoft on behalf of the developer and may contain mistakes
10
+
11
+
12
+
13
+ ## 1. General information
14
+
15
+ **1.0.1 Version of the Summary:** 1.0
16
+
17
+
18
+
19
+ **1.0.2 Last update:** 2-Dec-2025
20
+
21
+
22
+
23
+ ## 1.1 Model Developer Identification
24
+
25
+ **1.1.1 Model Developer name and contact details:** Microsoft Corporation at One Microsoft Way, Redmond, WA 98052. Tel: 425-882-8080
26
+
27
+
28
+
29
+ ## 1.2 Model Identification
30
+
31
+ **1.2.1 Versioned model name(s):** udop-large, udop-large-512, udop-large-512-300k
32
+
33
+
34
+
35
+ **1.2.2 Model release date:** 26-Feb-2024
36
+
37
+
38
+
39
+ ## 1.3 Overall training data size and characteristics
40
+
41
+ ### 1.3.1 Size of dataset and characteristics
42
+
43
+ **1.3.1.A Text training data size:** 1 billion to 10 trillion tokens
44
+
45
+
46
+
47
+ **1.3.1.B Text training data content:** OCR text extracted from scanned documents, including diverse document types such as invoices, forms, letters, receipts, academic papers, and web-like pages with figures, tables, and varied layouts
48
+
49
+
50
+
51
+ **1.3.1.C Image training data size:** 1 million to 10 billion tokens
52
+
53
+
54
+
55
+ **1.3.1.D Image training data content:** Scanned document images, including diverse document types such as invoices, forms, letters, receipts, academic papers, and web-like pages with figures, tables, and varied layouts
56
+
57
+
58
+
59
+ **1.3.1.E Audio training data size:** Not applicable. Audio is not part of the training data
60
+
61
+
62
+
63
+ **1.3.1.F Audio training data content:** Not applicable
64
+
65
+
66
+
67
+ **1.3.1.G Video training data size:** Not applicable. Video data is not part of the training data
68
+
69
+
70
+
71
+ **1.3.1.H Video training data content:** Not applicable
72
+
73
+
74
+
75
+ **1.3.1.I Other training data size:** Not applicable
76
+
77
+
78
+
79
+ **1.3.1.J Other training data content:** Not applicable
80
+
81
+
82
+
83
+ **1.3.2 Latest date of data acquisition/collection for model training:** This information cannot be provided due to unavailability of the underlying data (e.g., loss, corruption, or other access limitations)
84
+
85
+
86
+
87
+ **1.3.3 Is data collection ongoing to update the model with new data collection after deployment?** No
88
+
89
+
90
+
91
+ **1.3.4 Date the training dataset was first used to train the model:**
92
+
93
+
94
+
95
+ **1.3.5 Rationale or purpose of data selection:** Large-scale scanned document images with OCR text and layout provide diverse real-world document structures to learn unified vision-text-layout representations. Incorporating multiple supervised datasets across classification, layout analysis, information extraction, question answering, and NLI supports the model’s intended use for universal document processing tasks and improves performance across varied domains
96
+
97
+
98
+
99
+ ## 2. List of data sources
100
+
101
+ ### 2.1 Publicly available datasets
102
+
103
+ **2.1.1 Have you used publicly available datasets to train the model?** Yes
104
+
105
+
106
+
107
+ ## 2.2 Private non-publicly available datasets obtained from third parties
108
+
109
+ ### 2.2.1 Datasets commercially licensed by rights holders or their representatives
110
+
111
+ **2.2.1.A Have you concluded transactional commercial licensing agreement(s) with rights holder(s) or with their representatives?** Not applicable
112
+
113
+
114
+
115
+ ### 2.2.2 Private datasets obtained from other third-parties
116
+
117
+ **2.2.2.A Have you obtained private datasets from third parties that are not licensed as described in Section 2.2.1, such as data obtained from providers of private databases, or data intermediaries?** No
118
+
119
+
120
+
121
+ ## 2.3 Personal Information
122
+
123
+ **2.3.1 Was personal data used to train the model?** Microsoft follows all relevant laws and regulations pertaining to personal information
124
+
125
+
126
+
127
+ ## 2.4 Synthetic data
128
+
129
+ **2.4.1 Was any synthetic AI-generated data used to train the model?** No
130
+
131
+
132
+
133
+ ## 3. Data processing aspects
134
+
135
+ ### 3.1 Respect of reservation of rights from text and data mining exception or limitation
136
+
137
+ **3.1.1 Does this dataset include any data protected by copyright, trademark, or patent?** Microsoft follows all required regulations and laws for processing data protected by copyright, trademark, or patent
138
+
139
+
140
+
141
+ ## 3.2 Other information
142
+
143
+ **3.2.1 Does the dataset include information about consumer groups without revealing individual consumer identities?** Microsoft follows all required regulations and laws for protecting consumer identities
144
+
145
+
146
+
147
+ **3.2.2 Was the dataset cleaned or modified before model training?** Yes
148
+
149
+