saanikat commited on
Commit
3877882
Β·
1 Parent(s): 0c93e0f

restructuring

Browse files
README.md CHANGED
@@ -1,24 +1,46 @@
1
  ### Model Description
2
 
3
- This repository has two models - `model_encode.pth` and `model_fairtracks.pth`.
4
- Both of these models are used by the `attribute-standardizer` for standardizing the metadata based on user choice.
5
 
6
- ### Files Description
7
- 1. [model_encode.pth](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/model_encode.pth) : This has the ENCODE metadata trained model.
8
- 2. [model_fairtracks.pth](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/model_fairtracks.pth) : This has the FAIRTRACKS BLUEPRINT metadata trained model.
9
- 3. [vectorizer_encode.pkl](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/vectorizer_encode.pkl) : This is a pickle file which contains a serialized `CountVectorizer` instance from the `scikit-learn` library. It is used for Bag of Words encoding which is used an an input to the model when the user selects ENCODE schema.
10
- 4. [vectorizer_fairtracks.pkl](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/vectorizer_fairtracks.pkl): This is a pickle file which contains a serialized `CountVectorizer` instance from the `scikit-learn` library. It is used for Bag of Words encoding which is used an an input to the model when the user selects FAIRTRACKS schema.
11
- 5. [label_encoder_encode.pkl](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/label_encoder_encode.pkl): This is a pickle file which contains the unqiue label values derived from the training data. The model classifies the output into these labels for ENCODE schema.
12
- 6. [label_encoder_fairtracks.pkl](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/label_encoder_fairtracks.pkl): This is a pickle file which contains the unqiue label values derived from the training data. The model classifies the output into these labels for FAIRTRACKS schema.
13
 
14
- ### Usage
15
- To load this model:
16
  ```
17
- from huggingface_hub import hf_hub_download
18
-
19
- model_fairtracks = hf_hub_download(repo_id="databio/attribute-standardizer-model6", filename="model_fairtracks.pth")
20
- model_encode = hf_hub_download(repo_id="databio/attribute-standardizer-model6", filename="model_encode.pth")
 
 
 
 
 
 
 
 
 
 
 
 
21
  ```
22
- To use this model, refer to the GitHub repository of `bedmess`:
23
 
24
- [BEDMess](https://github.com/databio/bedmess)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ### Model Description
2
 
3
+ This repository hosts three pre-trained models desgined for metadata attribute standardization for genomic regions metadata. The three pre-trained models are: `ENCODE`, `FAIRTRACKS` and `BEDBASE`. These models, along with their associated files and schema designs are used for standardization by `BEDMS` (BED Metadata Standardizer). To know more about BEDMS, you can visit: https://github.com/databio/bedms
 
4
 
5
+ ### Directory struture
 
 
 
 
 
 
6
 
 
 
7
  ```
8
+ /attribute-standardizer-model6
9
+ /bedbase_schema
10
+ - bedbase_schema_design.yaml # BEDBASE schema
11
+ - label_encoder_bedbase.pkl # Unqiue label values derived from training data, model classifies the output into these labels for BEDBASE schema
12
+ - model_bedbase.pth # BEDBASE schema trained model
13
+ - vectorizer_bedbase.pkl # CountVectorizer instance from the `scikit-learn` library for Bag of Words encoding used as input to the model
14
+ /encode_schema
15
+ - encode_schema_design.yaml #ENCODE schema
16
+ - label_encoder_encode.pkl # Unqiue label values derived from training data, model classifies the output into these labels for ENCODE schema
17
+ - model_encode.pth # ENCODE schema trained model
18
+ - vectorizer_encode.pkl # CountVectorizer instance from the `scikit-learn` library for Bag of Words encoding used as input to the model
19
+ /fairtracks_schema
20
+ - fairtracks_schema_design.yaml # FAIRTRACKS schema
21
+ - label_encoder_fairtracks.pkl # Unqiue label values derived from training data, model classifies the output into these labels for FAIRTRACKS schema
22
+ - model_fairtracks.pth #FAIRTRACKS schema trained model
23
+ - vectorizer_fairtracks.pkl # CountVectorizer instance from the `scikit-learn` library for Bag of Words encoding used as input to the model
24
  ```
 
25
 
26
+ ### Usage
27
+
28
+ To use this model, refer to the GitHub repository of `bedms`:
29
+
30
+ [BEDMS](https://github.com/databio/bedms)
31
+
32
+ ### Contribution
33
+
34
+ To add a schema model:
35
+ 1. You should first train the new model using [BEDMS](https://github.com/databio/bedms).
36
+ 2. Create a new directory within this repository with the name of the new schema. ( For example, "new_schema").
37
+ 3. Maintain the directory structure like this:
38
+
39
+ ```
40
+ /attribute-standardizer-model6
41
+ /new_schema
42
+ - new_schema_design.yaml
43
+ - label_encoder_new_schema.pkl
44
+ - model_new_schema.pth
45
+ - vectorizer_new_schema.pkl
46
+ ```
bedbase_schema/bedbase_schema_design.yaml ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ description: Attribute Standardizer Output Schema in alignment with BEDBASE schema
2
+
3
+ properties:
4
+ sample_name: #Not predicted by the model.
5
+ type: string
6
+ description: "Name of the sample"
7
+ genome:
8
+ type: string
9
+ description: "Type of Genome Assemblies (eg. GRCh38)"
10
+ species_name:
11
+ type: string
12
+ description: "Name of species. e.g. Homo sapiens.", alias="organism"
13
+ species_id:
14
+ type: string
15
+ description: "Species identifier, resolvable by identifiers.org (eg. taxonomy:9606)"
16
+ genotype:
17
+ type: string
18
+ description: "Genotype of the sample"
19
+ phenotype:
20
+ type: string
21
+ description: "Phenotype of the sample"
22
+ cell_type:
23
+ type: string
24
+ description: "Cell type, population of cells that can be grown indefinitely in the lab, used for research, drug testing, and studying biological processes"
25
+ cell_line:
26
+ type: string
27
+ description: "A cultured, immortalized cell population derived from a single cell type, used for experimental research or therapeutic purposes."
28
+ tissue:
29
+ type: string
30
+ description: "Tissue type"
31
+ library_source:
32
+ type: string
33
+ description: "Library source (e.g. genomic, transcriptomic)"
34
+ assay:
35
+ type: string
36
+ description: "Experimental protocol (e.g. ChIP-seq)", alias="exp_protocol"
37
+ antibody:
38
+ type: string
39
+ description: "Antibody used in the assay"
40
+ target:
41
+ type: string
42
+ description: "Target of the assay (e.g. H3K4me3)"
43
+ treatment:
44
+ type: string
45
+ description: "Treatment of the sample (e.g. drug treatment)"
46
+ required:
47
+ - None
label_encoder_bedbase.pkl β†’ bedbase_schema/label_encoder_bedbase.pkl RENAMED
File without changes
model_bedbase.pth β†’ bedbase_schema/model_bedbase.pth RENAMED
File without changes
vectorizer_bedbase.pkl β†’ bedbase_schema/vectorizer_bedbase.pkl RENAMED
File without changes
encode_schema/encode_schema_design.yaml ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ description: Attribute Standardizer Output Schema in alignment with ENCODE schema
2
+
3
+ properties:
4
+ File accession:
5
+ type: string
6
+ description: "File ID/Accession"
7
+ File Type :
8
+ type: string
9
+ description: "File Types (eg. bed, bigWig)"
10
+ File format type:
11
+ type: string
12
+ description: "File Convention/Format (eg. narrowPeak)"
13
+ Output type:
14
+ type: string
15
+ description: "The output type provides additional information about the expected contents in the file."
16
+ File assembly:
17
+ type: string
18
+ description: "Type of Genome Assemblies (eg. GRCh38)"
19
+ Assay:
20
+ type: string
21
+ description: "The name of the assay performed"
22
+ Biosample term name:
23
+ type: string
24
+ description: "The human readable ontology name used to describe the biosample."
25
+ Biosample type:
26
+ type: string
27
+ description: "A categorization of biosamples into major groups(eg.induced pluripotent stem cell, stem cell)."
28
+ Biosample Organism:
29
+ type: string
30
+ description: "The species of the biosample."
31
+ Biosample treatments:
32
+ type: string
33
+ description: "The name of the chemical or biological agent applied to a biosample in order to elicit a response."
34
+ Biosample genetic modifications methods:
35
+ type: string
36
+ description: "Experimental Techniques used for genetic modification (eg. CRISPR)"
37
+ Biosample genetic modifications categories:
38
+ type: string
39
+ description: "Type of genetic modification (eg. insertion)"
40
+ Experiment Target:
41
+ type: string
42
+ description: "Experimental Targets (eg. H3K27ac-human)"
43
+ Library made from:
44
+ type: string
45
+ description: "Types of libraries created from biological samples."
46
+ Experiment date released:
47
+ type: string
48
+ description: "Date of the experiment release"
49
+ Project:
50
+ type: string
51
+ description: "The project under which the experiment was performed( eg. ENCODE)"
52
+ Lab:
53
+ type: string
54
+ description: "Lab where the processing took place."
55
+ File Download URL:
56
+ type: string
57
+ description: "File Download URL"
58
+ required:
59
+ - None
label_encoder_encode.pkl β†’ encode_schema/label_encoder_encode.pkl RENAMED
File without changes
model_encode.pth β†’ encode_schema/model_encode.pth RENAMED
File without changes
vectorizer_encode.pkl β†’ encode_schema/vectorizer_encode.pkl RENAMED
File without changes
fairtracks_schema/fairtracks_schema_design.yaml ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ description: Attribute Standardizer Output Schema in alignment with FAIRtracks schema
2
+
3
+ properties:
4
+ global_id:
5
+ type: string
6
+ description: "Global sample identifier, resolvable by identifiers.org"
7
+ local_id:
8
+ type: string
9
+ description: "Submitter-local identifier for sample (eg. S00B1LH1)"
10
+ species_id:
11
+ type: string
12
+ description: "Species identifier, resolvable by identifiers.org (eg. taxonomy:9606)"
13
+ species_name:
14
+ type: string
15
+ description: "Species name (eg. Homo sapiens)"
16
+ donor_age:
17
+ type: string
18
+ description: "Sample donor age ranges (eg. 50-60)"
19
+ donor_ethnicity:
20
+ type: string
21
+ description: "Ethnicity of the donor (eg. Northern European)"
22
+ donor_health_status:
23
+ type: string
24
+ description: "Health of the donor during sample collection eg. Normal, Chronic Lymphocytic Leukemia"
25
+ donor_id:
26
+ type: string
27
+ description: "Donor identifier eg.182CLL"
28
+ donor_sex:
29
+ type: string
30
+ description: "Sex of the donor eg. Male, Female, Unknown"
31
+ biospecimen_class_term_id:
32
+ type: string
33
+ description: "URL of ontology term used for classification of the sample"
34
+ biospecimen_class_term_label:
35
+ type: string
36
+ description: "Structural unit for the classification for the sample eg. Cell, Organism Part"
37
+ sample_type_term_id:
38
+ type: string
39
+ description: "URL of the sample term"
40
+ sample_type_term_label:
41
+ type: string
42
+ description: "Main classification of the sample eg. venous blood"
43
+ phenotype_term_id:
44
+ type: string
45
+ description: "Identifier for the phenotype"
46
+ phenotype_term_label:
47
+ type: string
48
+ description: "Main phenotype related to the sample eg. Acute Myeloid Leukemia"
label_encoder_fairtracks.pkl β†’ fairtracks_schema/label_encoder_fairtracks.pkl RENAMED
File without changes
model_fairtracks.pth β†’ fairtracks_schema/model_fairtracks.pth RENAMED
File without changes
vectorizer_fairtracks.pkl β†’ fairtracks_schema/vectorizer_fairtracks.pkl RENAMED
File without changes