Clara
kyle-gion tgeffner commited on
Commit
3f80e8d
·
verified ·
0 Parent(s):

initial commit

Browse files

Co-authored-by: tgeffner <tgeffner@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ NVIDIA License
2
+
3
+ 1. Definitions
4
+
5
+ “Licensor” means any person or entity that distributes its Work.
6
+ “Work” means (a) the original work of authorship made available under this license, which may include software, documentation, or other files, and (b) any additions to or derivative works thereof that are made available under this license.
7
+ The terms “reproduce,” “reproduction,” “derivative works,” and “distribution” have the meaning as provided under U.S. copyright law; provided, however, that for the purposes of this license, derivative works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work.
8
+ Works are “made available” under this license by including in or with the Work either (a) a copyright notice referencing the applicability of this license to the Work, or (b) a copy of this license.
9
+
10
+ 2. License Grant
11
+
12
+ 2.1 Copyright Grant. Subject to the terms and conditions of this license, each Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free, copyright license to use, reproduce, prepare derivative works of, publicly display, publicly perform, sublicense and distribute its Work and any resulting derivative works in any form.
13
+
14
+ 3. Limitations
15
+
16
+ 3.1 Redistribution. You may reproduce or distribute the Work only if (a) you do so under this license, (b) you include a complete copy of this license with your distribution, and (c) you retain without modification any copyright, patent, trademark, or attribution notices that are present in the Work.
17
+
18
+ 3.2 Derivative Works. You may specify that additional or different terms apply to the use, reproduction, and distribution of your derivative works of the Work (“Your Terms”) only if (a) Your Terms provide that the use limitation in Section 3.3 applies to your derivative works, and (b) you identify the specific derivative works that are subject to Your Terms. Notwithstanding Your Terms, this license (including the redistribution requirements in Section 3.1) will continue to apply to the Work itself.
19
+
20
+ 3.3 Use Limitation. The Work and any derivative works thereof only may be used or intended for use non-commercially. Notwithstanding the foregoing, NVIDIA Corporation and its affiliates may use the Work and any derivative works commercially. As used herein, “non-commercially” means for research or evaluation purposes only.
21
+
22
+ 3.4 Patent Claims. If you bring or threaten to bring a patent claim against any Licensor (including any claim, cross-claim or counterclaim in a lawsuit) to enforce any patents that you allege are infringed by any Work, then your rights under this license from such Licensor (including the grant in Section 2.1) will terminate immediately.
23
+
24
+ 3.5 Trademarks. This license does not grant any rights to use any Licensor’s or its affiliates’ names, logos, or trademarks, except as necessary to reproduce the notices described in this license.
25
+
26
+ 3.6 Termination. If you violate any term of this license, then your rights under this license (including the grant in Section 2.1) will terminate immediately.
27
+
28
+ 4. Disclaimer of Warranty.
29
+
30
+ THE WORK IS PROVIDED “AS IS” WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WARRANTIES OR CONDITIONS OF
31
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT. YOU BEAR THE RISK OF UNDERTAKING ANY ACTIVITIES UNDER THIS LICENSE.
32
+
33
+ 5. Limitation of Liability.
34
+
35
+ EXCEPT AS PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE SHALL ANY LICENSOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR RELATED TO THIS LICENSE, THE USE OR INABILITY TO USE THE WORK (INCLUDING BUT NOT LIMITED TO LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, COMPUTER FAILURE OR MALFUNCTION, OR ANY OTHER DAMAGES OR LOSSES), EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
README.md ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: nsclv1
4
+ license_link: LICENSE
5
+ ---
6
+ # Proteina Model Card <br>
7
+
8
+ The code for using the Proteina model checkpoints is available in the [official Github repository](https://github.com/NVIDIA-Digital-Bio/proteina).
9
+
10
+ # Overview
11
+
12
+ ## Description: <br>
13
+ Proteina is a state-of-the-art generative model of protein structures that generates digital representations of protein backbone structures. It is trained with a flow matching objective and sampled iteratively starting from random noise, using either deterministic or stochastic sampling. It enables a protein designer to generate digital representation of new protein structures unconditionally, with fold class guidance or conditioning on motif structures. Fold class guidance is implemented through a classifier-free guidance scheme.
14
+
15
+ This model is ready for non-commercial use and research and development.<br>
16
+
17
+ ### License/Terms of Use: <br>
18
+ Proteina is released under an NVIDIA license for non-commercial or research purposes only, please see [LICENSE](LICENSE).
19
+
20
+ ### Deployment Geography:
21
+ Global
22
+
23
+ ### Use Case: <br>
24
+ Proteina can be used by protein designers interested in generating novel protein backbone structures.
25
+
26
+ ### Release Date: <br>
27
+ February 28, 2025 <br>
28
+
29
+ ## Reference(s):
30
+ The associated paper, *"Proteina: Scaling Flow-based Protein Structure Generative Models"*, can be found here https://openreview.net/forum?id=TVQLu34bdw. <br>
31
+
32
+ ## Model Architecture: <br>
33
+ **Architecture Type:** Flow model <br>
34
+ **Network Architecture:** Transformer
35
+
36
+ We use a new non-equivariant transformer architecture with pair bias in the attention layers and optional triangle multiplicative layers for refining the pair representation. The architecture operates on the protein backbone’s three-dimensional carbon-alpha coordinates, which are iteratively updated during the generation process. The model parametrizes the flow that maps the noise distribution to the generated distribution. <br>
37
+
38
+ ## Input:<br>
39
+ **Input Type(s):**
40
+
41
+ - Text (time step schedule, noise schedule, sampling mode, motif coordinates) <br>
42
+
43
+ - Number (number of residues, noise scale, time step size, seed, noise schedule exponent, guidance weight, autoguidance weight) <br>
44
+
45
+ - Binary (use of self-conditioning, use of fold conditioning) <br>
46
+
47
+ **Input Format(s):**
48
+
49
+ - Text: Strings (time step schedule, noise schedule, sampling mode), PDB file (motif coordinates) <br>
50
+
51
+ - Number: Integers (number of residues, seed), floats (noise scale, time step size, noise schedule exponent, guidance weight, autoguidance weight) <br>
52
+
53
+ - Binary: Booleans <br>
54
+
55
+ **Input Parameters:**
56
+
57
+ - Text: 1D or text file (PDB file)
58
+
59
+ - Number: 1D
60
+
61
+ - Binary: 1D
62
+
63
+ **Other Properties Related to Input:** All inputs are handled and specified in the config yaml files, see README.
64
+
65
+ ## Output: <br>
66
+ **Output Type(s):** Text (generated protein backbone coordinates) <br>
67
+ **Output Format:** Text: PDB file (generated protein backbone coordinates) <br>
68
+
69
+ ## Software Integration: <br>
70
+ **Runtime Engine(s):** Pytorch <br>
71
+
72
+ **Supported Hardware Microarchitecture Compatibility:** <br>
73
+ NVIDIA Ampere (tested on A100) <br>
74
+
75
+ **[Preferred/Supported] Operating System(s):** <br>
76
+ Linux <br>
77
+
78
+ ## Model Version(s):
79
+ We release eight model checkpoints:
80
+ - Proteina v1.1 (trained on D_FS, with ~200M transformer and ~15M triangle layer parameters)
81
+ - Proteina v1.2 (trained on D_FS, with ~200M transformer parameters, no triangle layers)
82
+ - Proteina v1.3 (trained on D_FS, with ~60M transformer parameters, no triangle layers)
83
+ - Proteina v1.4 (trained on D_21M, with ~400M transformer and ~15M triangle layer parameters)
84
+ - Proteina v1.5 (v1.1, fine-tuned with LoRA on PDB subset)
85
+ - Proteina v1.6 (v1.2, fine-tuned for long protein generation)
86
+ - Proteina v1.7 (trained on D_FS for motif scaffolding, with ~60M transformer and ~12M triangle layer parameters)
87
+ - Proteina v1.8 (a "weak" checkpoint of Proteina v1.4 from early in training, after 10k steps. This checkpoint is used as a guidance model in the autoguidance experiments)
88
+
89
+ # Training and Evaluation Datasets:
90
+
91
+ For additional information regarding the datasets, please see the paper here https://openreview.net/forum?id=TVQLu34bdw.
92
+
93
+ ## Training Datasets:
94
+
95
+ AlphaFold Protein Structure Database (AFDB)
96
+ - Link: https://alphafold.ebi.ac.uk/
97
+ - Data Collection Method by dataset: Synthetic (AlphaFold predictions)
98
+ - Labeling Method by dataset: N/A (no labels)
99
+ - Properties: The AlphaFold Protein Structure Database (AFDB) contains approx. 214M synthetic three-dimensional protein structures predicted by AlphaFold2, along with their corresponding sequences. We trained Proteina on two filtered subsets of the AFDB, one comprising 588,318 structures, the other one comprising 20,874,485 structures.
100
+
101
+ Protein Data Bank (PDB)
102
+ - Link: https://www.rcsb.org/
103
+ - Data Collection Method by dataset: Automatic/Sensors/Human (experimental protein structure determination)
104
+ - Labeling Method by dataset: N/A (no labels)
105
+ - Properties: The Protein Data Bank (PDB) contains approx. 200K experimentally determined three-dimensional structures of large biological molecules, such as proteins and nucleic acids, along with auxiliary information such as the protein sequences. In one experiment, we used LoRA to fine-tune Proteina on a filtered subset of the PDB, comprising 90,423 proteins.
106
+
107
+ The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database
108
+ - Link: https://zenodo.org/records/13908086
109
+ - Data Collection Method by dataset: Synthetic
110
+ - Labeling Method by dataset: Automated
111
+ - Properties: TED provides the CATH fold class labels for the majority of the structures in the AFDB. We use all available labels for our AFDB-based training sets, excluding the homologous superfamily level labels (for the 588,318-sized training set, 99.9% of the structures are labeled; for the 20,874,485-sized training set, 69.7% of the structures are labeled).
112
+
113
+ ## Evaluation Datasets :
114
+ AlphaFold Protein Structure Database (AFDB)
115
+ - Link: https://alphafold.ebi.ac.uk/
116
+ - Data Collection Method by dataset: Synthetic (AlphaFold predictions)
117
+ - Labeling Method by dataset: N/A (no labels)
118
+ - Properties: The AFDB subset of 588,318 structures that was used to train Proteina was also used as a reference set in evaluations.
119
+
120
+ Protein Data Bank (PDB)
121
+ - Link: https://www.rcsb.org/
122
+ - Data Collection Method by dataset: Automatic/Sensors/Human (experimental protein structure determination)
123
+ - Labeling Method by dataset: N/A (no labels)
124
+ - Properties: In different evaluations we used either the entire PDB or a subset of size 15,357 as reference set.
125
+
126
+ ## Evaluation Results
127
+ Extensive benchmarks and evaluations can be found in the associated paper, https://openreview.net/forum?id=TVQLu34bdw.
128
+
129
+ ## Inference:
130
+ **Engine:** Pytorch <br>
131
+ **Test Hardware:** A100 <br>
132
+
133
+ ## Ethical Considerations:
134
+ Users are responsible for ensuring the physical properties of model-generated molecules and proteins are appropriately evaluated and comply with applicable safety regulations and ethical standards.
135
+
136
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
137
+
138
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
139
+
140
+
proteina_v1.1_DFS_200M_tri.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:982b68117ac3b16aa09c8b9a3b872d6e9987200fe45280c87d1dcc2a018f8d5a
3
+ size 2496977815
proteina_v1.2_DFS_200M_notri.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b5f4c1457fac0e7f54b909245b36b95ddd6bfd314ddd51973a202a3a5a2b6ce9
3
+ size 2293906151
proteina_v1.5_DFS_200M_tri_PDB_LoRA.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:556284730d35ae039cbb4d4fe3f0d592c6ebd582ac05d9bfff4b7bdd6d1d9bd2
3
+ size 922311690
proteina_v1.6_DFS_200M_notri_long_chain_generation.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6170e3d9d17c8785af0fd4e0be7c57c0b72827e274afb9994ee08b7de37f9396
3
+ size 2293906599