saanikat
commited on
Commit
Β·
3877882
1
Parent(s):
0c93e0f
restructuring
Browse files- README.md +39 -17
- bedbase_schema/bedbase_schema_design.yaml +47 -0
- label_encoder_bedbase.pkl β bedbase_schema/label_encoder_bedbase.pkl +0 -0
- model_bedbase.pth β bedbase_schema/model_bedbase.pth +0 -0
- vectorizer_bedbase.pkl β bedbase_schema/vectorizer_bedbase.pkl +0 -0
- encode_schema/encode_schema_design.yaml +59 -0
- label_encoder_encode.pkl β encode_schema/label_encoder_encode.pkl +0 -0
- model_encode.pth β encode_schema/model_encode.pth +0 -0
- vectorizer_encode.pkl β encode_schema/vectorizer_encode.pkl +0 -0
- fairtracks_schema/fairtracks_schema_design.yaml +48 -0
- label_encoder_fairtracks.pkl β fairtracks_schema/label_encoder_fairtracks.pkl +0 -0
- model_fairtracks.pth β fairtracks_schema/model_fairtracks.pth +0 -0
- vectorizer_fairtracks.pkl β fairtracks_schema/vectorizer_fairtracks.pkl +0 -0
README.md
CHANGED
@@ -1,24 +1,46 @@
|
|
1 |
### Model Description
|
2 |
|
3 |
-
This repository
|
4 |
-
Both of these models are used by the `attribute-standardizer` for standardizing the metadata based on user choice.
|
5 |
|
6 |
-
###
|
7 |
-
1. [model_encode.pth](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/model_encode.pth) : This has the ENCODE metadata trained model.
|
8 |
-
2. [model_fairtracks.pth](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/model_fairtracks.pth) : This has the FAIRTRACKS BLUEPRINT metadata trained model.
|
9 |
-
3. [vectorizer_encode.pkl](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/vectorizer_encode.pkl) : This is a pickle file which contains a serialized `CountVectorizer` instance from the `scikit-learn` library. It is used for Bag of Words encoding which is used an an input to the model when the user selects ENCODE schema.
|
10 |
-
4. [vectorizer_fairtracks.pkl](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/vectorizer_fairtracks.pkl): This is a pickle file which contains a serialized `CountVectorizer` instance from the `scikit-learn` library. It is used for Bag of Words encoding which is used an an input to the model when the user selects FAIRTRACKS schema.
|
11 |
-
5. [label_encoder_encode.pkl](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/label_encoder_encode.pkl): This is a pickle file which contains the unqiue label values derived from the training data. The model classifies the output into these labels for ENCODE schema.
|
12 |
-
6. [label_encoder_fairtracks.pkl](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/label_encoder_fairtracks.pkl): This is a pickle file which contains the unqiue label values derived from the training data. The model classifies the output into these labels for FAIRTRACKS schema.
|
13 |
|
14 |
-
### Usage
|
15 |
-
To load this model:
|
16 |
```
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
```
|
22 |
-
To use this model, refer to the GitHub repository of `bedmess`:
|
23 |
|
24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
### Model Description
|
2 |
|
3 |
+
This repository hosts three pre-trained models desgined for metadata attribute standardization for genomic regions metadata. The three pre-trained models are: `ENCODE`, `FAIRTRACKS` and `BEDBASE`. These models, along with their associated files and schema designs are used for standardization by `BEDMS` (BED Metadata Standardizer). To know more about BEDMS, you can visit: https://github.com/databio/bedms
|
|
|
4 |
|
5 |
+
### Directory struture
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
|
|
|
|
|
7 |
```
|
8 |
+
/attribute-standardizer-model6
|
9 |
+
/bedbase_schema
|
10 |
+
- bedbase_schema_design.yaml # BEDBASE schema
|
11 |
+
- label_encoder_bedbase.pkl # Unqiue label values derived from training data, model classifies the output into these labels for BEDBASE schema
|
12 |
+
- model_bedbase.pth # BEDBASE schema trained model
|
13 |
+
- vectorizer_bedbase.pkl # CountVectorizer instance from the `scikit-learn` library for Bag of Words encoding used as input to the model
|
14 |
+
/encode_schema
|
15 |
+
- encode_schema_design.yaml #ENCODE schema
|
16 |
+
- label_encoder_encode.pkl # Unqiue label values derived from training data, model classifies the output into these labels for ENCODE schema
|
17 |
+
- model_encode.pth # ENCODE schema trained model
|
18 |
+
- vectorizer_encode.pkl # CountVectorizer instance from the `scikit-learn` library for Bag of Words encoding used as input to the model
|
19 |
+
/fairtracks_schema
|
20 |
+
- fairtracks_schema_design.yaml # FAIRTRACKS schema
|
21 |
+
- label_encoder_fairtracks.pkl # Unqiue label values derived from training data, model classifies the output into these labels for FAIRTRACKS schema
|
22 |
+
- model_fairtracks.pth #FAIRTRACKS schema trained model
|
23 |
+
- vectorizer_fairtracks.pkl # CountVectorizer instance from the `scikit-learn` library for Bag of Words encoding used as input to the model
|
24 |
```
|
|
|
25 |
|
26 |
+
### Usage
|
27 |
+
|
28 |
+
To use this model, refer to the GitHub repository of `bedms`:
|
29 |
+
|
30 |
+
[BEDMS](https://github.com/databio/bedms)
|
31 |
+
|
32 |
+
### Contribution
|
33 |
+
|
34 |
+
To add a schema model:
|
35 |
+
1. You should first train the new model using [BEDMS](https://github.com/databio/bedms).
|
36 |
+
2. Create a new directory within this repository with the name of the new schema. ( For example, "new_schema").
|
37 |
+
3. Maintain the directory structure like this:
|
38 |
+
|
39 |
+
```
|
40 |
+
/attribute-standardizer-model6
|
41 |
+
/new_schema
|
42 |
+
- new_schema_design.yaml
|
43 |
+
- label_encoder_new_schema.pkl
|
44 |
+
- model_new_schema.pth
|
45 |
+
- vectorizer_new_schema.pkl
|
46 |
+
```
|
bedbase_schema/bedbase_schema_design.yaml
ADDED
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
description: Attribute Standardizer Output Schema in alignment with BEDBASE schema
|
2 |
+
|
3 |
+
properties:
|
4 |
+
sample_name: #Not predicted by the model.
|
5 |
+
type: string
|
6 |
+
description: "Name of the sample"
|
7 |
+
genome:
|
8 |
+
type: string
|
9 |
+
description: "Type of Genome Assemblies (eg. GRCh38)"
|
10 |
+
species_name:
|
11 |
+
type: string
|
12 |
+
description: "Name of species. e.g. Homo sapiens.", alias="organism"
|
13 |
+
species_id:
|
14 |
+
type: string
|
15 |
+
description: "Species identifier, resolvable by identifiers.org (eg. taxonomy:9606)"
|
16 |
+
genotype:
|
17 |
+
type: string
|
18 |
+
description: "Genotype of the sample"
|
19 |
+
phenotype:
|
20 |
+
type: string
|
21 |
+
description: "Phenotype of the sample"
|
22 |
+
cell_type:
|
23 |
+
type: string
|
24 |
+
description: "Cell type, population of cells that can be grown indefinitely in the lab, used for research, drug testing, and studying biological processes"
|
25 |
+
cell_line:
|
26 |
+
type: string
|
27 |
+
description: "A cultured, immortalized cell population derived from a single cell type, used for experimental research or therapeutic purposes."
|
28 |
+
tissue:
|
29 |
+
type: string
|
30 |
+
description: "Tissue type"
|
31 |
+
library_source:
|
32 |
+
type: string
|
33 |
+
description: "Library source (e.g. genomic, transcriptomic)"
|
34 |
+
assay:
|
35 |
+
type: string
|
36 |
+
description: "Experimental protocol (e.g. ChIP-seq)", alias="exp_protocol"
|
37 |
+
antibody:
|
38 |
+
type: string
|
39 |
+
description: "Antibody used in the assay"
|
40 |
+
target:
|
41 |
+
type: string
|
42 |
+
description: "Target of the assay (e.g. H3K4me3)"
|
43 |
+
treatment:
|
44 |
+
type: string
|
45 |
+
description: "Treatment of the sample (e.g. drug treatment)"
|
46 |
+
required:
|
47 |
+
- None
|
label_encoder_bedbase.pkl β bedbase_schema/label_encoder_bedbase.pkl
RENAMED
File without changes
|
model_bedbase.pth β bedbase_schema/model_bedbase.pth
RENAMED
File without changes
|
vectorizer_bedbase.pkl β bedbase_schema/vectorizer_bedbase.pkl
RENAMED
File without changes
|
encode_schema/encode_schema_design.yaml
ADDED
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
description: Attribute Standardizer Output Schema in alignment with ENCODE schema
|
2 |
+
|
3 |
+
properties:
|
4 |
+
File accession:
|
5 |
+
type: string
|
6 |
+
description: "File ID/Accession"
|
7 |
+
File Type :
|
8 |
+
type: string
|
9 |
+
description: "File Types (eg. bed, bigWig)"
|
10 |
+
File format type:
|
11 |
+
type: string
|
12 |
+
description: "File Convention/Format (eg. narrowPeak)"
|
13 |
+
Output type:
|
14 |
+
type: string
|
15 |
+
description: "The output type provides additional information about the expected contents in the file."
|
16 |
+
File assembly:
|
17 |
+
type: string
|
18 |
+
description: "Type of Genome Assemblies (eg. GRCh38)"
|
19 |
+
Assay:
|
20 |
+
type: string
|
21 |
+
description: "The name of the assay performed"
|
22 |
+
Biosample term name:
|
23 |
+
type: string
|
24 |
+
description: "The human readable ontology name used to describe the biosample."
|
25 |
+
Biosample type:
|
26 |
+
type: string
|
27 |
+
description: "A categorization of biosamples into major groups(eg.induced pluripotent stem cell, stem cell)."
|
28 |
+
Biosample Organism:
|
29 |
+
type: string
|
30 |
+
description: "The species of the biosample."
|
31 |
+
Biosample treatments:
|
32 |
+
type: string
|
33 |
+
description: "The name of the chemical or biological agent applied to a biosample in order to elicit a response."
|
34 |
+
Biosample genetic modifications methods:
|
35 |
+
type: string
|
36 |
+
description: "Experimental Techniques used for genetic modification (eg. CRISPR)"
|
37 |
+
Biosample genetic modifications categories:
|
38 |
+
type: string
|
39 |
+
description: "Type of genetic modification (eg. insertion)"
|
40 |
+
Experiment Target:
|
41 |
+
type: string
|
42 |
+
description: "Experimental Targets (eg. H3K27ac-human)"
|
43 |
+
Library made from:
|
44 |
+
type: string
|
45 |
+
description: "Types of libraries created from biological samples."
|
46 |
+
Experiment date released:
|
47 |
+
type: string
|
48 |
+
description: "Date of the experiment release"
|
49 |
+
Project:
|
50 |
+
type: string
|
51 |
+
description: "The project under which the experiment was performed( eg. ENCODE)"
|
52 |
+
Lab:
|
53 |
+
type: string
|
54 |
+
description: "Lab where the processing took place."
|
55 |
+
File Download URL:
|
56 |
+
type: string
|
57 |
+
description: "File Download URL"
|
58 |
+
required:
|
59 |
+
- None
|
label_encoder_encode.pkl β encode_schema/label_encoder_encode.pkl
RENAMED
File without changes
|
model_encode.pth β encode_schema/model_encode.pth
RENAMED
File without changes
|
vectorizer_encode.pkl β encode_schema/vectorizer_encode.pkl
RENAMED
File without changes
|
fairtracks_schema/fairtracks_schema_design.yaml
ADDED
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
description: Attribute Standardizer Output Schema in alignment with FAIRtracks schema
|
2 |
+
|
3 |
+
properties:
|
4 |
+
global_id:
|
5 |
+
type: string
|
6 |
+
description: "Global sample identifier, resolvable by identifiers.org"
|
7 |
+
local_id:
|
8 |
+
type: string
|
9 |
+
description: "Submitter-local identifier for sample (eg. S00B1LH1)"
|
10 |
+
species_id:
|
11 |
+
type: string
|
12 |
+
description: "Species identifier, resolvable by identifiers.org (eg. taxonomy:9606)"
|
13 |
+
species_name:
|
14 |
+
type: string
|
15 |
+
description: "Species name (eg. Homo sapiens)"
|
16 |
+
donor_age:
|
17 |
+
type: string
|
18 |
+
description: "Sample donor age ranges (eg. 50-60)"
|
19 |
+
donor_ethnicity:
|
20 |
+
type: string
|
21 |
+
description: "Ethnicity of the donor (eg. Northern European)"
|
22 |
+
donor_health_status:
|
23 |
+
type: string
|
24 |
+
description: "Health of the donor during sample collection eg. Normal, Chronic Lymphocytic Leukemia"
|
25 |
+
donor_id:
|
26 |
+
type: string
|
27 |
+
description: "Donor identifier eg.182CLL"
|
28 |
+
donor_sex:
|
29 |
+
type: string
|
30 |
+
description: "Sex of the donor eg. Male, Female, Unknown"
|
31 |
+
biospecimen_class_term_id:
|
32 |
+
type: string
|
33 |
+
description: "URL of ontology term used for classification of the sample"
|
34 |
+
biospecimen_class_term_label:
|
35 |
+
type: string
|
36 |
+
description: "Structural unit for the classification for the sample eg. Cell, Organism Part"
|
37 |
+
sample_type_term_id:
|
38 |
+
type: string
|
39 |
+
description: "URL of the sample term"
|
40 |
+
sample_type_term_label:
|
41 |
+
type: string
|
42 |
+
description: "Main classification of the sample eg. venous blood"
|
43 |
+
phenotype_term_id:
|
44 |
+
type: string
|
45 |
+
description: "Identifier for the phenotype"
|
46 |
+
phenotype_term_label:
|
47 |
+
type: string
|
48 |
+
description: "Main phenotype related to the sample eg. Acute Myeloid Leukemia"
|
label_encoder_fairtracks.pkl β fairtracks_schema/label_encoder_fairtracks.pkl
RENAMED
File without changes
|
model_fairtracks.pth β fairtracks_schema/model_fairtracks.pth
RENAMED
File without changes
|
vectorizer_fairtracks.pkl β fairtracks_schema/vectorizer_fairtracks.pkl
RENAMED
File without changes
|