root commited on Nov 27, 2024

Commit

1e6a1f0

1 Parent(s): 9a73cb0

uploading data folder

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

fuson_plm/data/README.md +91 -0
fuson_plm/data/__init__.py +0 -0
fuson_plm/data/__pycache__/__init__.cpython-310.pyc +0 -0
fuson_plm/data/__pycache__/clean.cpython-310.pyc +0 -0
fuson_plm/data/__pycache__/cluster.cpython-310.pyc +0 -0
fuson_plm/data/__pycache__/config.cpython-310.pyc +0 -0
fuson_plm/data/__pycache__/split_vis.cpython-310.pyc +0 -0
fuson_plm/data/blast/README.md +113 -0
fuson_plm/data/blast/__pycache__/blast_fusions.cpython-310.pyc +0 -0
fuson_plm/data/blast/__pycache__/plot.cpython-310.pyc +0 -0
fuson_plm/data/blast/blast_fusions.py +838 -0
fuson_plm/data/blast/blast_outputs/best_htg_alignments_swissprot_seqs.pkl +3 -0
fuson_plm/data/blast/blast_outputs/ht_uniprot_query.txt +3 -0
fuson_plm/data/blast/blast_outputs/swissprot_blast_output_analyzed.pkl +3 -0
fuson_plm/data/blast/blast_outputs/swissprot_blast_stats.csv +3 -0
fuson_plm/data/blast/blast_outputs/swissprot_no_match.csv +3 -0
fuson_plm/data/blast/blast_outputs/swissprot_no_match.txt +3 -0
fuson_plm/data/blast/blast_outputs/swissprot_top_alignments.csv +3 -0
fuson_plm/data/blast/extract_blast_seqs.py +62 -0
fuson_plm/data/blast/figures/identities_hist.png +0 -0
fuson_plm/data/blast/fusion_blast_log.txt +3 -0
fuson_plm/data/blast/fuson_ht_db.csv +3 -0
fuson_plm/data/blast/plot.py +75 -0
fuson_plm/data/clean.py +594 -0
fuson_plm/data/cluster.py +50 -0
fuson_plm/data/clustering/input.fasta +3 -0
fuson_plm/data/clustering/mmseqs_full_results.csv +3 -0
fuson_plm/data/clustering_log.txt +3 -0
fuson_plm/data/config.py +34 -0
fuson_plm/data/data_cleaning_log.txt +3 -0
fuson_plm/data/fuson_db.csv +3 -0
fuson_plm/data/head_tail_data/ensembl_ht_idmap.txt +3 -0
fuson_plm/data/head_tail_data/gene_to_ensembl_dict.pkl +3 -0
fuson_plm/data/head_tail_data/genename_ht_idmap.txt +3 -0
fuson_plm/data/head_tail_data/htgenes_uniprotids.csv +3 -0
fuson_plm/data/head_tail_data/isoform_fasta_id_output_formatted.fasta +3 -0
fuson_plm/data/head_tail_data/uniprot_idmap_inputs/head_tail_ens.txt +3 -0
fuson_plm/data/head_tail_data/uniprot_idmap_inputs/head_tail_genes.txt +3 -0
fuson_plm/data/raw_data/FOdb_SD5.csv +3 -0
fuson_plm/data/raw_data/FOdb_all.csv +3 -0
fuson_plm/data/raw_data/FOdb_puncta.csv +3 -0
fuson_plm/data/raw_data/FusionPDB.txt +3 -0
fuson_plm/data/raw_data/FusionPDB_cleaned.csv +3 -0
fuson_plm/data/split.py +120 -0
fuson_plm/data/split_vis.py +333 -0
fuson_plm/data/splits/combined_plot.png +0 -0
fuson_plm/data/splits/test_cluster_split.csv +3 -0
fuson_plm/data/splits/test_df.csv +3 -0
fuson_plm/data/splits/train_cluster_split.csv +3 -0
fuson_plm/data/splits/train_df.csv +3 -0

fuson_plm/data/README.md ADDED Viewed

	@@ -0,0 +1,91 @@

+## Training Data Curation and Processing
+The `data` folder and its subfolders hold all raw data and processed data used to assemble FusOn-DB, as well as all processing scripts. Additional benchmarking datasets can be found in the `benchmarking` folder.
+### From raw data to train/val/test splits and head/tail data
+This section will outline the pipeline for converting the raw FusionPDB and FOdb datasets into the train/val/test splits used in FusOn-pLM. This process included data cleaning, clustering, and splitting. During the cleaning process, we also extracted data about the heads and tails of each fusion oncoprpotein.
+```
+data/
+└── clustering/
+    ├── input.fasta
+    ├── mmseqs_full_results.csv
+└── head_tail_data/
+    └── uniprot_idmap_inputs/
+└── raw_data/
+    ├── FOdb_all.csv
+    ├── FOdb_puncta.csv
+    ├── FOdb_SD5.csv
+    ├── FusionPDB_cleaned.csv
+    ├── FusionPDB.txt
+    ├── gene_to_ensembl_dict.pkl
+└── splits/
+    ├── combined_plot.png
+    ├── train_df.csv
+    ├── train_cluster_split.csv
+    ├── val_df.csv
+    ├── val_cluster_split.csv
+    ├── test_df.csv
+    ├── test_cluster_split.csv
+├── clean.py
+├── cluster.py
+├── config.py
+├── split.py
+├── split_vis.py
+├── data_cleaning_log.txt
+├── clustering_log.txt
+├── splitting_log.txt
+├── fuson_db.csv
+```
+- **`clean.py`**: script for cleaning the datasets in `raw_data`. Print statements in this code produce `data_cleaning_log.txt`.
+- **`cluster.py`**: script for clustering the processed data in fuson_db.csv. Print statements in this code produce `clustering_log.txt`.
+- **`config.py`**: configs for the cleaning, clustering, and splitting scripts.
+- **`split.py`**: script for splitting the data, post-clusteirng. Print statements in this code produce `splitting_log.txt`.
+- **`split_vis.py`** script with code for the plots in `splits/combined_plot.png`, which describe the content of the train, validation, and test splits (length distribution, Shannon Entropy, amino acid frequencies, and cluster sizes)
+#### Usage
+To repeat our cleaning, clustering, and splitting process, proceed as follows.
+1. Install MMSeqs2 at `/*/FusOn-pLM/fuson_plm/mmseqs2` according to these instructions: https://github.com/soedinglab/MMseqs2. Make sure that in `config.py`, CLUSTER.PATH_TO_MMSEQS points to your mmseqs installation.
+2. Run the cleaning script:
+```python
+python clean.py
+```
+This script will create the following files:
+- **`fuson_db.csv`**: FusOn-DB. Our full database of 44,414 fusion oncoproteins.
+- **`raw_data/FusionPDB_cleaned.csv`**: a processed version of the FusionPDB database with the following columns: `aa_seq`,`n_fusiongenes`,`fusiongenes`,`cancers`,`primary_sources`,`secondary_source`.
+- **`head_tail_data/uniprot_idmap_inputs/head_tail_ens.txt`** and **`head_tail_data/uniprot_idmap_inputs/head_tail_genes.txt`**: all unique Ensembl IDs and gene symbols for all unique head/tail proteins corresponding to any fusion oncoproteins in FusOn-DB. These were submitted to the UniProt ID-mapping tool to create **`head_tail_data/ensembl_ht_idmap.txt`** and **`head_tail_data/genename_ht_idmap.txt`, respectively.
+- **`head_tail_data/uniprot_idmap_inputs/gene_to_ensembl_dict.pkl`**: a dictionary mapping each unique gene symbol to a comma-separated list of its associated Ensembl IDs, according to FusionPDB.
+- **`head_tail_data/uniprot_idmap_inputs/htgenes_uniprotids.csv`** a file with each unique gene symbol (`Gene`), a comma-separated list of all associated UniProt IDs (`UniProtID`), and a concatenated list of 1s and 0s representing whether each ID in the `UniProtID` column is reviewed or not (`Reviewed`).
+    - For example, a `Reviewed` value of "100" means the first ID in the `UniProtID` column of the same row is reviewed (1) and the second and third are not (0)
+3. Run the clustering script:
+```python
+python cluster.py
+```
+The command entered by this script to the clustering software is:
+```bash
+mmseqs easy-cluster clustering/input.fasta clustering/raw_output/mmseqs clustering/raw_output --min-seq-id 0.3 -c 0.8 --cov-mode 0
+```
+This script will cluster all sequences length 2000 or shorter (see `config.py`) and create the following files:
+- **`clustering/input.fasta`**: the input file used by MMSeqs2 to cluster the fusion oncoprotein sequences. Headers are our assigned sequence IDs (can be found in the `seq_id` column of `fuson_db.csv`.)
+- **`clustering/mmseqs_full_results.csv`**: clustering results. Columns:
+    - `representative seq_id`: the seq_id of the sequence representing this cluster
+    - `member seq_id`: the seq_id of a member of the cluster
+    - `representative seq`: the amino acid sequence of the cluster representative (representative seq_id)
+    - `member seq`: the amino acid sequence of the cluster member
+4. Run the splitting script:
+```python
+python split.py
+```
+This script will create the following files:
+- **`splits/train_cluster_split.csv`, `splits/val_cluster_split.csv`, `splits/test_cluster_split.csv`**: The subsets of `clustering/mmseqs_full_results.csv` that have been partitioned into the train, validation, and test sets respectively.
+- **`splits/train_df.csv`, `splits/val_df.csv`, `splits/test_df.csv`**: The train, validation, and testing splits used to train FusOn-pLM. Columns: `sequence`,`member length`
+- **`splits/combined_plot.png`**: plot displaying the composition of the train, validation, and test splits.
+### BLAST
+We ran BLAST to get the best alignment of each sequence in FusOn-DB to a protein in SwissProt. See the README in the `blast` folder for more details.

fuson_plm/data/__init__.py ADDED Viewed

File without changes

fuson_plm/data/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (148 Bytes). View file

fuson_plm/data/__pycache__/clean.cpython-310.pyc ADDED Viewed

Binary file (15.1 kB). View file

fuson_plm/data/__pycache__/cluster.cpython-310.pyc ADDED Viewed

Binary file (6.02 kB). View file

fuson_plm/data/__pycache__/config.cpython-310.pyc ADDED Viewed

Binary file (1.11 kB). View file

fuson_plm/data/__pycache__/split_vis.cpython-310.pyc ADDED Viewed

Binary file (10.1 kB). View file

fuson_plm/data/blast/README.md ADDED Viewed

	@@ -0,0 +1,113 @@

+We ran local BLAST to get the best alignment of each fusion oncoprotein sequence to every protein in SwissProt.
+### Downloading BLAST Executables and Database
+First, we needed to downloaded the BLAST executables by entering the following in terminal (if you don't have a Linux system, find the correct download for your system at https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST):
+```
+wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.16.0+-x64-linux.tar.gz
+tar -zxvf ncbi-blast-2.16.0+-x64-linux.tar.gz
+rm ncbi-blast-2.16.0+-x64-linux.tar.gz
+cd ncbi-blast-2.16.0+
+mkdir swissprot
+cd swissprot
+perl ../bin/update_blastdb.pl --passive --decompress swissprot
+chmod +x "blast/ncbi-blast-2.16.0+/bin/blastp"
+sudo chmod -R 755 FusOn-pLMfuson_plm/data/blast/ncbi-blast-2.16.0+
+```
+### Running BLAST
+The directory is structured as follows:
+```
+data/
+└── blast/
+    └── blast_outputs/
+        ├── swissprot_blast_output_analyzed.pkl
+        ├── swissprot_blast_stats.csv
+        ├── swissprot_no_match.csv
+        ├── swissprot_no_match.txt
+        ├── swissprot_top_alignments.csv
+        ├── best_htg_alignments_swissprot_seqs.pkl
+        ├── ht_uniprot_query.txt
+    └── figures/
+        ├── identities_hist.png
+    ├── blast_fusions.py
+    ├── extract_blast_seqs.py
+    ├── plot.py
+    ├── fusion_blast_log.txt
+    ├── fuson_ht_db.csv
+```
+- **`blast_fusions.py`**: script that will prepare FusOn-DB for BLAST, run BLAST against SwissProt (given you've installed BLAST software properly), extract top alignments and calculate statistics on the BLAST results, and make results plots. Print statements in this script create the log file `fusion_blast_log.txt`.
+- **`extract_blast_seqs.py`**: script that will extract sequences of all the head/tail proteins that formed the best alignment during BLAST, directly from the SwissProt BLAST database. Creates the file `blast_outputs/best_htg_alignments_swissprot_seqs.pkl`.
+- **`plot.py`**: script to make the plot found at `figures/identities_hist.png`. This plot displays the maximum % identity of each fusion oncoprotein sequence with a SwissProt sequence, based on BLAST. This plot is also automatically created by `blast_fusions.py`.
+- **`fuson_ht_db.csv`**: Database that merges FusOn-DB (`/*/FusOn-pLM/fuson_plm/data/fuson_db.csv`) with `/*/FusOn-pLM/fuson_plm/data/head_tail_data/htgenes_uniprotids.csv`, which simplifies the process of analyzing BLAST results. In FusOn-DB, certain amino acid sequences are associated with multiple fusion oncoproteins, whose names are comma-separated in the `fusiongenes` column. In `fuson_ht_db.csv`, the `fusiongenes` column is exploded such that exach row only has one fusion gene. Therefore, this database has more rows than FusOn-DB, and some duplicate sequences.
+To run BLAST search and analysis, we recommend using nohup as the process will take a long time.
+```python
+nohup python blast_fusions.py > blastrun.out 2> blastrun.err &
+```
+### Understanding the output files
+Here, we will break down each file in the `blast/blast_outputs` directory.
+- **`best_htg_alignments_swissprot_seqs.pkl`**: a dictionary where the keys are UniProt IDs, "."-concatenated to their isoform (e.g. "Q8NFI3.1"), and the values are the amino acid sequence corresponding to that isoform. The sequences were pulled directly from the SwissProt BLAST dataase.
+- **`ht_uniprot_query.txt`**: a list of all head and tail proteins producing top SwissProt alignments, in the format described above (e.g. "Q8NFI3.1"). Used to query the SwissProt database and create the `best_htg_alignments_swissprot_seqs.pkl` file.
+- **`swissprot_blast_output_analyzed.pkl`**: dictionary that summarizes key BLAST results for each fusion protein. The keys are seq_ids, each corresponding to a fusion oncoprotein sequence in FusOn-DB. The values are dictionaries holding BLAST results for that seq_id. Each UniProt ID corresponding to a known head or tail (stored in `fuson_ht_db.csv`) is checked for an alignment. If there is no alignment, the value is None (e.g. `swissprot_blast_output_analyzed['seq18']['F8WED0']` is `None`). If there is an alignment, we store the Isoform, Score, Expect, Query_Aligned, Subject_Aligned, H_or_T (whether this ID is for teh head or tail protein), Best (whether this is the best - highest-scoring - alignment to this fusion oncoprotein), Identities, Positives, Gaps, Query_Start, Query_End, Subject_Start, and Subject_End. If the best alignment is not a known head or tail, this alignment is also stored. Below is the example dictionary for seq18.
+```python
+swissprot_blast_output_analyzed['seq18'] =
+{
+    "F8WED0": None,
+    "Q9Y2X3": {
+        "Isoform": 1,
+        "Score": 452.0,
+        "Expect": "6e-148",
+        "Query_Aligned": "AGTGSLLNLAKHAASTVQILGAEKALFRALKSRRDTPKYGLIYHASLVGQTSPKHKGKISRMLAAKTVLAIRYDAFGEDSSSAMGVENRAKLEARLRTLEDRGIRKISGTGKALAKTEKYEHKSEVKTYDPSGDSTLPTCSKKRKIEQVDKEDEITEKKAKKAKIKVKVEEEEEEKVAEEEETSVKKKKKRGKKKHIKEEPLSEEEPCTSTAIASPEKKKKKKKKRENED",
+        "Subject_Aligned": "AHAGSLLNLAKHAASTVQILGAEKALFRALKSRRDTPKYGLIYHASLVGQTSPKHKGKISRMLAAKTVLAIRYDAFGEDSSSAMGVENRAKLEARLRTLEDRGIRKISGTGKALAKTEKYEHKSEVKTYDPSGDSTLPTCSKKRKIEQVDKEDEITEKKAKKAKIKVKVEEEEEEKVAEEEETSVKKKKKRGKKKHIKEEPLSEEEPCTSTAIASPEKKKKKKKKRENED",
+        "H_or_T": "Tail",
+        "Best": False,
+        "Identities": "228/230 (99%)",
+        "Positives": "228/230 (99%)",
+        "Gaps": "0/230 (0%)",
+        "Query_Start": 754,
+        "Query_End": 983,
+        "Sbjct_Start": 300,
+        "Sbjct_End": 529,
+    },
+    "L0R804": None,
+    "A0A096LP60": None,
+    "A0A096LNZ0": None,
+    "H7BZ72": None,
+    "A0A096LP25": None,
+    "Q9BUD9": None,
+    "B7ZLC4": None,
+    "Q2M2I8": {
+        "Isoform": 3,
+        "Score": 1558.0,
+        "Expect": "0.0",
+        "Query_Aligned": "MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVVNLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNAVEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCLIRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAPRQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQQTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQGGSQQQLMQNFYQQQQQQQQQQQQQQLATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAPQPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAAAEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIPGFQSTQGDAFATTSFSAGTG",
+        "Subject_Aligned": "MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVVNLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNAVEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCLIRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAPRQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQQTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQGGSQQQLMQNFYQQQQQQQQQQQQQQLATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAPQPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAAAEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIPGFQSTQGDAFATTSFSAGTA",
+        "H_or_T": "Head",
+        "Best": True,
+        "Identities": "756/757 (99%)",
+        "Positives": "756/757 (99%)",
+        "Gaps": "0/757 (0%)",
+        "Query_Start": 1,
+        "Query_End": 757,
+        "Sbjct_Start": 1,
+        "Sbjct_End": 757,
+    },
+    "E9PG46": None,
+}
+```
+- **`swissprot_blast_stats.csv`**: a database summarizing the BLAST scores across all fusion oncoproteins. Columns are: seq_id,hgAlignments,tgAlignments,totalAlignments,best_hgScore,best_tgScore,best_Score,h_or_t_alignment,h_and_t_alignment
+    - `h_or_t_alignment` is True if either the head or tail has an alignment returned by BLAST. `h_and_t_alignment` is True if both the head and tail have an alignment returned by BLAST.
+- **`swissprot_no_match.txt`**: names of the BLAST output files that said "No hits found"
+- **`swissprot_no_match.csv`**: more information on the fusion oncoproteins indicated in swissprot_no_match.txt
+- **`swissprot_top_alignments.csv`**: a database summarizing the most important information acquired by BLAST across all fusion oncoproteins. Columns are: seq_id,top_hg_UniProtID,top_hg_UniProt_isoform,top_hg_UniProt_fus_indices,top_tg_UniProtID,top_tg_UniProt_isoform,top_tg_UniProt_fus_indices,top_UniProtID,top_UniProt_isoform,top_UniProt_fus_indices,top_UniProt_nIdentities,top_UniProt_nPositives,aa_seq_len
+    - All indices (e.g. `top_hg_UniProt_fus_indices`) are 1-indexed.
+    - This database can be used to eestimate breakpoints using the `top_hg_UniProt_fus_indices` and `top_tg_UniProt_fus_indices` columns. For example, if `top_hg_UniProt_fus_indices` is "1,300" and `top_tg_UniProt_fus_indices` is "301,546", then that means residues 1-300 of the fusion protein aligned with the head protein indicated in `top_hg_UniProtID` and `top_hg_isoform`, and residues 301-546 of the fusion protein aligned with the tail protein indicated in `top_tg_UniProtID` and `top_tg_isoform`. The breakpoint is between residues 300 and 301.

fuson_plm/data/blast/__pycache__/blast_fusions.cpython-310.pyc ADDED Viewed

Binary file (25.1 kB). View file

fuson_plm/data/blast/__pycache__/plot.cpython-310.pyc ADDED Viewed

Binary file (9.08 kB). View file

fuson_plm/data/blast/blast_fusions.py ADDED Viewed

	@@ -0,0 +1,838 @@

+### Prepare to BLAST all of our sequences against UniProt
+import pandas as pd
+import os
+import subprocess
+import time
+import re
+import pickle
+import numpy as np
+from fuson_plm.utils.logging import log_update, open_logfile
+from fuson_plm.utils.embedding import redump_pickle_dictionary
+from fuson_plm.data.blast.plot import group_difference_plot, group_swiss_and_ht_plot, group_box_plot, group_pos_id_plot
+def prepare_blast_inputs():
+    log_update("\nPreparing BLAST Inputs. Logging every 1000 sequences... ")
+    # make directory for input and output
+    os.makedirs("blast_inputs", exist_ok=True)
+    # read the fuson database
+    fuson_db = pd.read_csv('../fuson_db.csv')
+    # make dictionary mapping sequences to seqids (for naming input filess)
+    fuson_db_dict = dict(zip(fuson_db['aa_seq'],fuson_db['seq_id']))
+    # convert the database into fasta format
+    new_fa_files_created = 0
+    old_fa_files_found = 0
+    total_seqs_processed=0
+    for i, (seq, seqid) in enumerate(fuson_db_dict.items()):
+        total_seqs_processed+=1
+        # if the path already exists, skip
+        if os.path.exists(f"blast_inputs/{seqid}.fa"):
+            old_fa_files_found+=1
+        else:
+            new_fa_files_created+=1
+            with open(f"blast_inputs/{seqid}.txt", 'w') as f:
+                fasta_lines = '>' + seqid + '\n' + seq
+                f.write(fasta_lines)
+            # rename it to .fa
+            os.rename(f"blast_inputs/{seqid}.txt", f"blast_inputs/{seqid}.fa")
+        if i%1000==0:
+            log_update(f"\t\t{i}\t{seqid}:{seq}")
+    log_update("\tFinished preparing BLAST Inputs (results in blast_inputs folder)")
+    log_update(f"\t\tSequences processed: {total_seqs_processed}/{len(fuson_db)} seqs in FusOn-DB\n\t\tFasta files found: {old_fa_files_found}\n\t\tNew fasta files created: {new_fa_files_created}")
+def run_blast(blast_inputs_dir, database="swissprot",n=1,interval=2000):
+    """
+    Run BLAST on all files in blast_inputs_dir
+    """
+    # Must change the PATH variable to include the BLAST executables
+    os.environ['PATH'] += ":./ncbi-blast-2.16.0+/bin"
+    os.environ['BLASTDB'] = f"ncbi-blast-2.16.0+/{database}"
+    # make directory for outputs
+    os.makedirs("blast_outputs", exist_ok=True)
+    os.makedirs(f"blast_outputs/{database}", exist_ok=True)
+    already_blasted = os.listdir(f"blast_outputs/{database}")
+    blast_input_files = os.listdir(blast_inputs_dir)
+    # Sort the list using a custom key to extract the numeric part
+    blast_input_files = sorted(blast_input_files, key=lambda x: int(re.search(r'\d+', x).group()))
+    # print how many we've already blasted
+    log_update(f"Running BLAST.\n\t{len(blast_input_files)} input files\n\t{len(already_blasted)} already blasted\n")
+    tot_seqs_processed = 0
+    total_blast_time = 0
+    start_i = interval*(n-1)
+    end_i = interval*n
+    if end_i>len(blast_input_files): end_i = len(blast_input_files)
+    for i, blast_input_file in enumerate(blast_input_files[start_i:end_i]):
+        tot_seqs_processed+=1
+        # blast_input_file is of the format seqid.fa
+        seqid = blast_input_file.split('.fa')[0]
+        input_path = f"blast_inputs/{blast_input_file}"
+        output_path = f"blast_outputs/{database}/{seqid}_{database}_results.out"
+        if os.path.exists(output_path):
+            log_update(f"\t{i+1}.\tAlready blasted {seqid}")
+            continue
+        # Construct the command as a list of arguments
+        command = [
+            "ncbi-blast-2.16.0+/bin/blastp",
+            "-db", database,
+            "-query", input_path,
+            "-out", output_path
+        ]
+        # Run the command, and time it
+        blast_start_time = time.time()
+        result = subprocess.run(command, capture_output=True, text=True)
+        blast_end_time = time.time()
+        blast_seq_time = blast_end_time-blast_start_time
+        total_blast_time+=blast_seq_time
+        # Check if there was an error
+        if result.returncode != 0:
+            log_update(f"\t{i+1}.\tError running BLAST for {seqid}: {result.stderr} ({blast_seq_time:.2f}s)")
+        else:
+            log_update(f"\t{i+1}.\tBLAST search completed for {seqid} ({blast_seq_time:.2f}s)")
+    log_update(f"\tFinished processing {tot_seqs_processed} sequences ({total_blast_time:.2f}s)")
+def remove_incomplete_blasts(database="swissprot"):
+    incomplete_list = []
+    for fname in os.listdir(f"blast_outputs/{database}"):
+        complete=False
+        with open(f"blast_outputs/{database}/{fname}", "r") as f:
+            lines = f.readlines()
+            if len(lines)>1 and "Window for multiple hits:" in lines[-1]:
+                complete=True
+            if not complete:
+                incomplete_list.append(fname)
+    log_update(f"\t{len(incomplete_list)} BLAST files are incomplete (due to BLAST errors). Deleting them. Rerun these")
+    # remove all these files
+    for fname in incomplete_list:
+        os.remove(f"blast_outputs/{database}/{fname}")
+def find_nomatch_blasts(fuson_ht_db, database="swissprot"):
+    no_match_list = []
+    for fname in os.listdir(f"blast_outputs/{database}"):
+        match=True
+        with open(f"blast_outputs/{database}/{fname}", "r") as f:
+            lines = f.readlines()
+            if len(lines)>1 and "No hits found" in lines[28]:   # it'll say no hits found if there are no hits
+                match=False
+            if not match:
+                no_match_list.append(fname)
+    log_update(f"\t{len(no_match_list)} sequence IDs had no match in the BLAST database {database}")
+    # write no match list to a file in blast_outputs
+    with open(f"blast_outputs/{database}_no_match.txt","w") as f:
+        for i, fname in enumerate(no_match_list):
+            if i!=len(no_match_list)-1:
+                f.write(f"{fname}\n")
+            else:
+                f.write(f"{fname}")
+    # write a subset of fuson_ht_db containing these sequences as well
+    no_match_ids = [x.split('_')[0] for x in no_match_list]
+    subset = fuson_ht_db.loc[
+        fuson_ht_db['seq_id'].isin(no_match_ids)
+    ].reset_index(drop=True)
+    subset.to_csv(f"blast_outputs/{database}_no_match.csv",index=False)
+    return no_match_ids
+def make_fuson_ht_db(path_to_fuson_db="../fuson_db.csv", path_to_unimap="../head_tail_data/htgenes_uniprotids.csv",savepath="fuson_ht_db.csv"):
+    """
+    Make a version of the fuson_db that has all the heads and tails for each of the genes. Will make it easier to analyze blast results
+    """
+    if os.path.exists(savepath):
+        df = pd.read_csv(savepath)
+        return df
+    # read both of teh databases
+    fuson_db = pd.read_csv(path_to_fuson_db)
+    ht_db = pd.read_csv(path_to_unimap)
+    # Make it such that each row of fuson_db just has ONE head and ONE tail
+    fuson_ht_db = fuson_db.copy(deep=True)
+    fuson_ht_db['fusiongenes'] = fuson_ht_db['fusiongenes'].apply(lambda x: x.split(','))
+    fuson_ht_db = fuson_ht_db.explode('fusiongenes')
+    fuson_ht_db['hgene'] = fuson_ht_db['fusiongenes'].str.split('::',expand=True)[0]
+    fuson_ht_db['tgene'] = fuson_ht_db['fusiongenes'].str.split('::',expand=True)[1]
+    # Merge on head, then merge on tail
+    fuson_ht_db = pd.merge(             # merge on head
+        fuson_ht_db,
+        ht_db.rename(columns={
+            'Gene': 'hgene',
+            'UniProtID': 'hgUniProt',
+            'Reviewed': 'hgUniProtReviewed'
+        }),
+        on='hgene',
+        how='left'
+    )
+    fuson_ht_db = pd.merge(             # merge on tail
+        fuson_ht_db,
+        ht_db.rename(columns={
+            'Gene': 'tgene',
+            'UniProtID': 'tgUniProt',
+            'Reviewed': 'tgUniProtReviewed'
+        }),
+        on='tgene',
+        how='left'
+    )
+    # Make sure we haven't lost anything
+    tot_og_seqids = len(fuson_db['seq_id'].unique())
+    tot_final_seqids = len(fuson_ht_db['seq_id'].unique())
+    log_update(f"\tTotal sequence IDs in combined database = {tot_final_seqids}. Matches expected: {tot_final_seqids==tot_og_seqids}")
+    # Each fusion should have the same number of ROWS as it does commas+1
+    fuson_db['n_commas'] = fuson_db['fusiongenes'].str.count(',') + 1
+    seqid_rows_map = dict(zip(fuson_db['seq_id'],fuson_db['n_commas']))
+    vc = fuson_ht_db['seq_id'].value_counts().reset_index()
+    vc['expected_count'] = vc['index'].map(seqid_rows_map)
+    log_update(f"\tEach seq_id has the expected number of head-tail combos: {(vc['expected_count']==vc['seq_id']).all()}")
+    log_update(f"\tPreview of combined database:")
+    prev = fuson_ht_db.head(10)
+    prev['aa_seq'] = prev['aa_seq'].apply(lambda x: x[0:10]+'...')
+    log_update(prev.to_string(index=False))
+    fuson_ht_db.to_csv(savepath, index=False)
+    return fuson_ht_db
+def format_dict(d, indent=0):
+    """
+    Recursively formats a dictionary for display purposes.
+    Args:
+        d (dict): The dictionary to format.
+        indent (int): The current level of indentation.
+    Returns:
+        str: A formatted string representing the dictionary.
+    """
+    formatted_str = ""
+    # Iterate through each key-value pair in the dictionary
+    for key, value in d.items():
+        # Create the current indentation
+        current_indent = " " * (indent * 4)
+        # Add the key
+        formatted_str += f"{current_indent}{repr(key)}: "
+        # Check the type of the value
+        if isinstance(value, dict):
+            # If dictionary, call format_dict recursively
+            formatted_str += "{\n" + format_dict(value, indent + 1) + current_indent + "},\n"
+        elif isinstance(value, list):
+            # If list, convert it to a formatted string
+            formatted_str += f"[{', '.join(repr(item) for item in value)}],\n"
+        elif isinstance(value, str):
+            # If string, enclose in quotes
+            formatted_str += f"'{value}',\n"
+        elif value is None:
+            # If None, display as 'None'
+            formatted_str += "None,\n"
+        else:
+            formatted_str += f"{repr(value)},\n"
+    return formatted_str
+def parse_blast_output(file_path, head_ids, tail_ids):
+    """
+    Args:
+        - file_path: /path/to/blast/output
+        - head_ids: list of all UniProt IDs for the head protien
+        - tail_ids: list of all UniProt IDs for the tail protein
+    """
+    target_ids = list(set(head_ids + tail_ids))    # make a list to make some functions easier
+    with open(file_path, 'r') as file:
+        best_data = {tid: None for tid in target_ids}   # stores the best alignment for each ID we care about
+        current_data = {tid: {} for tid in target_ids}  # stores the current data for each ID we care about (most recent alignment we read)
+        best_score = {tid: -float('inf') for tid in target_ids} # stores the best score for each ID we care about
+        capture = {tid: False for tid in target_ids}    # whether we are currently processing this ID
+        replace_best = {tid: False for tid in target_ids}   # whether we should replace the best_data with the current_data for this ID
+        isoform_dict = {tid: None for tid in target_ids}    # dictionary of isoforms for
+        # variables that will only be used for getting the best alignment
+        alignment_count = 0
+        cur_id = None
+        on_best_alignment=False
+        # Iterate through lines
+        for line in file:
+            line = line.strip()
+            # if NEW ID (not necessarily new alignment! can be multiple alignmetns under one >)
+            if line.startswith('>'):
+                found_tid_in_header=False   # assume we have not found a target ID we are looking for
+                alignment_count+=1
+                if alignment_count==1:  # we're on the best alignment because this is the one that's listed first! it should be
+                    on_best_alignment=True
+                else:
+                    on_best_alignment = False
+                ## We may have just finisehd processing an ID. Check for the one who currently has capture set to true
+                just_captured = None
+                total_captured = 0
+                for k, v in capture.items():
+                    if v:
+                        total_captured+=1
+                        just_captured = k
+                # we should never be capturing more than one thing at a time. make sure of this
+                assert total_captured<2
+                if just_captured is not None:
+                    if replace_best[just_captured]:   # if we just finished an alignment for the just_captured ID, and it's the best one, put it in
+                        best_data[just_captured] = current_data[just_captured].copy()
+                        replace_best[just_captured] = False     # we just did the replacement, so reset it
+                # Check if the line contains any of the target IDs.
+                # This means EITHER [UniProtID] or [UniProtID.Isoform] or [UniProtID-Isoform] is in the line
+                for tid in target_ids:
+                    pattern = fr">{tid}([.-]\d+)? "    # for ID P02671, would match ">P02671 ", ">P02671.2 " and ">P02671-2 "
+                    if re.search(pattern, line):    # if this ID matches
+                        isoform_dict[tid] = None    # set it to None, update it if we need to
+                        if "." in line: # look for isoform denoted by . if there is one, otherwise it'll stay as None
+                            isoform = int(line.split(".")[1].split(" ")[0])
+                            isoform_dict[tid] = isoform
+                            #print(f"\t\tID = {tid} (is a head or tail), isoform={isoform}")
+                        elif "-" in line: # look for isoform denoted by - if there is one, otherwise it'll stay as None
+                            isoform = int(line.split("-")[1].split(" ")[0])
+                            isoform_dict[tid] = isoform
+                            #print(f"\t\tID = {tid} (is a head or tail), isoform={isoform}")
+                        capture[tid] = True
+                        current_data[tid] = {'header': line}
+                        found_tid_in_header=True   # we've found the tid that's in this line, so no need to check theothers
+                    else:
+                        capture[tid] = False
+                if on_best_alignment:   # if this is the best alignment
+                    if not(found_tid_in_header):    # if none of our TIDs are it
+                        cur_id_full = line.split('>')[1].split(' ')[0]
+                        cur_id, isoform = cur_id_full, None
+                        isoform_dict[cur_id] = None # change this if we need
+                        if "." in cur_id_full:  # if there's a dot, it's an isoform.
+                            cur_id = cur_id_full.split(".")[0]
+                            isoform = int(cur_id_full.split(".")[1])
+                            isoform_dict[cur_id] = isoform
+                            #log_update(f"\t\tID = {cur_id} (best alignment, not a head or tail), isoform={isoform}")
+                            #log_update(f"\t\t\tFull line: {line}")  # so we can see the gene name. does it make sense?
+                        elif "-" in cur_id_full:  # if there's a -, it's an isoform.
+                            cur_id = cur_id_full.split("-")[0]
+                            isoform = int(cur_id_full.split("-")[1])
+                            isoform_dict[cur_id] = isoform
+                            #log_update(f"\t\tID = {cur_id} (best alignment, not a head or tail), isoform={isoform}")
+                            #log_update(f"\t\t\tFull line: {line}")  # so we can see the gene name. does it make sense?
+                        # add this id to all the dictionaries
+                        best_data[cur_id] = None
+                        current_data[cur_id] = {}
+                        best_score[cur_id] = -float('inf')
+                        capture[cur_id] = False
+                        replace_best[cur_id] = False
+            for tid in target_ids:
+                if capture[tid]:    # if we're currently on an alignment for a tid we care about
+                    if 'Score =' in line:
+                        if replace_best[tid]:   # if we're replacing the best alignment with this one, within the same ID, do it
+                            best_data[tid] = current_data[tid].copy()
+                            # now reset the variable!
+                            replace_best[tid] = False
+                        score_value = float(line.split()[2])  # Assuming "Score = 1053 bits (2723)" format
+                        current_data[tid] = {}  # Reset current_data for this ID
+                        current_data[tid]['Isoform'] = isoform_dict[tid]
+                        current_data[tid]['Score'] = score_value
+                        current_data[tid]['Expect'] = line.split('Expect =')[1].split(', Method')[0].strip()
+                        current_data[tid]['Query_Aligned'] = []
+                        current_data[tid]['Subject_Aligned'] = []
+                        # Set the ID as a head or tail, or neither (neither shouldn't happen here though)
+                        if tid in head_ids:
+                            current_data[tid]['H_or_T'] = 'Head'
+                            if tid in tail_ids:
+                                current_data[tid]['H_or_T'] = 'Head,Tail'
+                        elif tid in tail_ids:
+                            current_data[tid]['H_or_T'] = 'Tail'
+                        else:
+                            current_data[tid]['H_or_T'] = np.nan
+                        current_data[tid]['Best'] = True if on_best_alignment else False
+                        if score_value > best_score[tid]:   # if this is the best score we have for an alignment of this protein
+                            best_score[tid] = score_value
+                            replace_best[tid] = True
+                        else:
+                            replace_best[tid] = False
+                    if 'Identities =' in line:
+                        idents = line.split(', ')
+                        current_data[tid]['Identities'] = idents[0].split('=')[1].strip()
+                        current_data[tid]['Positives'] = idents[1].split('=')[1].strip()
+                        current_data[tid]['Gaps'] = idents[2].split('=')[1].strip()
+                    if line.startswith('Query'):
+                        parts = line.split()
+                        if 'Query_Start' not in current_data[tid]:
+                            current_data[tid]['Query_Start'] = int(parts[1])
+                        current_data[tid]['Query_End'] = int(parts[3])
+                        current_data[tid]['Query_Aligned'].append(parts[2])
+                    if line.startswith('Sbjct'):
+                        parts = line.split()
+                        if 'Sbjct_Start' not in current_data[tid]:
+                            current_data[tid]['Sbjct_Start'] = int(parts[1])
+                        current_data[tid]['Sbjct_End'] = int(parts[3])
+                        current_data[tid]['Subject_Aligned'].append(parts[2])
+            # if we're on the best alignment and it's not one of our target_ids, still process it the same way
+            if on_best_alignment:
+                if not(found_tid_in_header):
+                    if 'Score =' in line:
+                        if replace_best[cur_id]:   # if we're replacing the best alignment with this one, within the same ID, do it
+                            best_data[cur_id] = current_data[cur_id].copy()
+                            # now reset the variable!
+                            replace_best[cur_id] = False
+                        score_value = float(line.split()[2])  # Assuming "Score = 1053 bits (2723)" format
+                        current_data[cur_id] = {}  # Reset current_data for this ID
+                        current_data[cur_id]['Isoform'] = isoform_dict[cur_id]
+                        current_data[cur_id]['Score'] = score_value
+                        current_data[cur_id]['Expect'] = line.split('Expect =')[1].split(', Method')[0].strip()
+                        current_data[cur_id]['Query_Aligned'] = []
+                        current_data[cur_id]['Subject_Aligned'] = []
+                        # Set the ID as a head or tail, or neither
+                        if cur_id in head_ids:
+                            current_data[cur_id]['H_or_T'] = 'Head'
+                            if cur_id in tail_ids:
+                                current_data[cur_id]['H_or_T'] = 'Head,Tail'
+                        elif cur_id in tail_ids:
+                            current_data[cur_id]['H_or_T'] = 'Tail'
+                        else:
+                            current_data[cur_id]['H_or_T'] = np.nan
+                        current_data[cur_id]['Best'] = True
+                        if score_value > best_score[cur_id]:   # if this is the best score we have for an alignment of this protein
+                            best_score[cur_id] = score_value
+                            replace_best[cur_id] = True
+                        else:
+                            replace_best[cur_id] = False
+                    if 'Identities =' in line:
+                        idents = line.split(', ')
+                        current_data[cur_id]['Identities'] = idents[0].split('=')[1].strip()
+                        current_data[cur_id]['Positives'] = idents[1].split('=')[1].strip()
+                        current_data[cur_id]['Gaps'] = idents[2].split('=')[1].strip()
+                    if line.startswith('Query'):
+                        parts = line.split()
+                        if 'Query_Start' not in current_data[cur_id]:
+                            current_data[cur_id]['Query_Start'] = int(parts[1])
+                        current_data[cur_id]['Query_End'] = int(parts[3])
+                        current_data[cur_id]['Query_Aligned'].append(parts[2])
+                    if line.startswith('Sbjct'):
+                        parts = line.split()
+                        if 'Sbjct_Start' not in current_data[cur_id]:
+                            current_data[cur_id]['Sbjct_Start'] = int(parts[1])
+                        current_data[cur_id]['Sbjct_End'] = int(parts[3])
+                        current_data[cur_id]['Subject_Aligned'].append(parts[2])
+        # add cur_id to target_ids if it's not none
+        if not(cur_id is None):
+            target_ids += [cur_id]
+        # Check at the end of the file if the last scores are the best
+        for tid in target_ids:
+            if replace_best[tid]:
+                best_data[tid] = current_data[tid].copy()
+        # Combine sequences into single strings for the best data for each ID
+        for tid in target_ids:
+            #print(tid)
+            if best_data[tid]:
+                #print(f"there is a best alignment for {tid}")
+                #print(f"best: {best_data[tid]}")
+                #print(f"current: {current_data[tid]}")
+                best_data[tid]['Query_Aligned'] = ''.join(best_data[tid]['Query_Aligned'])
+                best_data[tid]['Subject_Aligned'] = ''.join(best_data[tid]['Subject_Aligned'])
+    return best_data
+def parse_all_blast_results(fuson_ht_db, database="swissprot"):
+    """
+    Analyze the BLAST outputs for each fusion protein against UniProt.
+    Use the fuson_ht_db to look for the heads and tails that we expect. If they can't be found, ... ?
+    """
+    output_file=f"blast_outputs/{database}_blast_output_analyzed.pkl"
+    all_seq_ids = fuson_ht_db['seq_id'].unique().tolist()
+    all_seq_ids = sorted(all_seq_ids, key=lambda x: int(re.search(r'\d+', x).group()))  # sort by the number. seq1, seq2, ...
+    prior_results = {}
+    if os.path.exists(output_file):
+        with open(output_file, "rb") as f:
+            prior_results = pickle.load(f)
+    # Iterate through seq_ids
+    total_parse_time = 0
+    tot_seqs_processed = 0
+    for seq_id in all_seq_ids:
+        try:
+            tot_seqs_processed+=1
+            # If we've already processed it, skip
+            if seq_id in prior_results:
+                log_update(f"\tAlready processed {seq_id} blast results. Continuing")
+                continue
+            file_path = f"blast_outputs/{database}/{seq_id}_{database}_results.out"
+            aa_seq = fuson_ht_db.loc[
+                fuson_ht_db['seq_id']==seq_id
+            ]['aa_seq'].tolist()[0]
+            # Remember, fuson_ht_db has all the IDs for ALL the different head and tail gene identifiers.
+            fusion_genes = fuson_ht_db.loc[
+                fuson_ht_db['seq_id']==seq_id
+            ]['fusiongenes'].tolist()
+            ##### Process heads
+            head_ids = fuson_ht_db.loc[
+                fuson_ht_db['seq_id']==seq_id
+            ]['hgUniProt'].dropna().tolist()
+            head_reviewed, head_reviewed_dict = "", {}
+            if len(head_ids)>0: # if we found head IDs, we can process them and figure out if they're reviewed
+                head_ids = ",".join(head_ids).split(",")
+                head_reviewed = fuson_ht_db.loc[
+                    fuson_ht_db['seq_id']==seq_id
+                ]['hgUniProtReviewed'].dropna().tolist()
+                head_reviewed = list("".join(head_reviewed))
+                head_reviewed_dict = dict(zip(head_ids, head_reviewed))
+                head_ids = list(head_reviewed_dict.keys())      # there may be some duplicates, so separate them out again
+                head_reviewed = list(head_reviewed_dict.values())
+            head_genes = fuson_ht_db.loc[
+                fuson_ht_db['seq_id']==seq_id
+            ]['hgene'].unique().tolist()
+            ##### Process tails - same logic
+            tail_ids = fuson_ht_db.loc[
+                fuson_ht_db['seq_id']==seq_id
+            ]['tgUniProt'].dropna().tolist()
+            tail_reviewed, tail_reviewed_dict = "", {}
+            if len(tail_ids)>0: # if we found tail IDs, we can process them and figure out if they're reviewed
+                tail_ids = ",".join(tail_ids).split(",")
+                tail_reviewed = fuson_ht_db.loc[
+                    fuson_ht_db['seq_id']==seq_id
+                ]['tgUniProtReviewed'].dropna().tolist()
+                tail_reviewed = list("".join(tail_reviewed))
+                tail_reviewed_dict = dict(zip(tail_ids, tail_reviewed))
+                tail_ids = list(tail_reviewed_dict.keys())      # there may be some duplicates, so separate them out again
+                tail_reviewed = list(tail_reviewed_dict.values())
+            tail_genes = fuson_ht_db.loc[
+                fuson_ht_db['seq_id']==seq_id
+            ]['tgene'].unique().tolist()
+            ###### Log what we just found
+            log_update(f"\tEvaluating {seq_id}, fusion genes = {fusion_genes}, len = {len(aa_seq)}...\n\t\tfile_path={file_path}")
+            #log_update(f"\n\t\thead genes={head_genes}\n\t\thead_ids={head_ids}\n\t\ttail genes={tail_genes}\n\t\ttail_ids={tail_ids}")
+            ### Do the analysis and time it
+            parse_start_time = time.time()       # time it
+            blast_data = parse_blast_output(file_path, head_ids, tail_ids)
+            parse_end_time = time.time()
+            parse_seq_time = parse_end_time-parse_start_time
+            total_parse_time+=parse_seq_time
+            log_update(f"\t\tBLAST output analysis completed for {seq_id} ({parse_seq_time:.2f}s)")
+            # Give preview of results. Logging the whole dict would be too much, so let's just see what we found
+            #log_update(format_dict(blast_data,indent=3))
+            n_og_reviewed_head_ids = len([x for x in head_reviewed if x=='1'])
+            found_head_ids = [x for x in list(blast_data.keys()) if (blast_data[x] is not None) and (blast_data[x].get('H_or_T',None) in ['Head','Head,Tail'])]
+            n_found_reviewed_head_ids = len([x for x in found_head_ids if head_reviewed_dict[x]=='1'])
+            n_og_reviewed_tail_ids = len([x for x in tail_reviewed if x=='1'])
+            found_tail_ids = [x for x in list(blast_data.keys()) if (blast_data[x] is not None) and (blast_data[x].get('H_or_T',None) in ['Tail','Head,Tail'])]
+            n_found_reviewed_tail_ids = len([x for x in found_tail_ids if tail_reviewed_dict[x]=='1'])
+            #log_update(f"\t\t{len(found_head_ids)}/{len(head_ids)} head protein UniProt IDs ({n_found_reviewed_head_ids}/{n_og_reviewed_head_ids} REVIEWED heads) had alignments")
+            #log_update(f"\t\t{len(found_tail_ids)}/{len(tail_ids)} tail protein UniProt IDs ({n_found_reviewed_tail_ids}/{n_og_reviewed_tail_ids} REVIEWED tails) had alignments")
+            # write results to pickle file
+            to_pickle_dict = {seq_id: blast_data}
+            with open(output_file, 'ab+') as f:
+                pickle.dump(to_pickle_dict, f)
+        except:
+            log_update(f"{seq_id} failed")
+            # redump the pickle even if we hit an error, so that we can fix the error and continue processing results
+            redump_pickle_dictionary(output_file)
+    # Log total time
+    log_update(f"\tFinished processing {tot_seqs_processed} sequences ({total_parse_time:.2f}s)")
+    # redump the pickle
+    redump_pickle_dictionary(output_file)
+def analyze_blast_results(fuson_ht_db, database="swissprot"):
+    blast_results_path=f"blast_outputs/{database}_blast_output_analyzed.pkl"
+    stats_df_savepath = f"blast_outputs/{database}_blast_stats.csv"
+    top_alignments_df_savepath = f"blast_outputs/{database}_top_alignments.csv"
+    stats_df, top_alignments_df = None, None
+    if os.path.exists(stats_df_savepath) and os.path.exists(top_alignments_df_savepath):
+        stats_df = pd.read_csv(stats_df_savepath)
+        top_alignments_df = pd.read_csv(top_alignments_df_savepath, dtype={'top_hg_UniProt_isoform':'str',
+                                                                            'top_tg_UniProt_isoform': 'str',
+                                                                            'top_UniProt_isoform': 'str'})
+    else:
+        with open(blast_results_path, "rb") as f:
+            results = pickle.load(f)
+        # analyze the results
+        # first, basic stats. How many of them have at least one head or tail alignment??
+        seqid_stats = {}
+        top_alignments_dict = {}
+        for seq_id in list(results.keys()):
+            seqid_stats[seq_id] = {
+                'hgAlignments': 0,
+                'tgAlignments': 0,
+                'totalAlignments': 0,
+                'best_hgScore': 0,
+                'best_tgScore': 0,
+                'best_Score': 0
+            }
+            top_alignments_dict[seq_id] = {
+                'top_hg_UniProtID': None,
+                'top_hg_UniProt_isoform': None,
+                'top_hg_UniProt_fus_indices': None,
+                'top_tg_UniProtID': None,
+                'top_tg_UniProt_isoform': None,
+                'top_tg_UniProt_fus_indices': None,
+                'top_UniProtID': None,
+                'top_UniProt_isoform': None,
+                'top_UniProt_fus_indices': None
+            }
+            for uniprot, d in results[seq_id].items():
+                if not(d is None):
+                    isoform = d['Isoform']
+                    # set up the indices string
+                    query_start = d['Query_Start']
+                    if (query_start is None) or (type(query_start)==float and np.isnan(query_start)):
+                        query_start = ''
+                    else:
+                        query_start = int(query_start)
+                    query_end = d['Query_End']
+                    if (query_end is None) or (type(query_end)==float and np.isnan(query_end)):
+                        query_end = ''
+                    else:
+                        query_end = int(query_end)
+                    fus_indices = f"{query_start},{query_end}".strip(",")
+                    if d['H_or_T'] in ['Head', 'Head,Tail']:
+                        seqid_stats[seq_id]['hgAlignments'] +=1
+                        if d['Score'] > seqid_stats[seq_id]['best_hgScore']:
+                            seqid_stats[seq_id]['best_hgScore'] = d['Score']
+                            if type(uniprot)==float or uniprot is None:
+                                top_alignments_dict[seq_id]['top_hg_UniProtID'] = ''
+                            else:
+                                top_alignments_dict[seq_id]['top_hg_UniProtID'] = uniprot
+                            if (type(isoform)==float and np.isnan(isoform)) or isoform is None:
+                                top_alignments_dict[seq_id]['top_hg_UniProt_isoform'] = ''
+                            else:
+                                top_alignments_dict[seq_id]['top_hg_UniProt_isoform'] = str(int(isoform))
+                            top_alignments_dict[seq_id]['top_hg_UniProt_fus_indices'] = fus_indices
+                    if d['H_or_T'] in ['Tail','Head,Tail']:
+                        seqid_stats[seq_id]['tgAlignments'] +=1
+                        if d['Score'] > seqid_stats[seq_id]['best_tgScore']:
+                            seqid_stats[seq_id]['best_tgScore'] = d['Score']
+                            if type(uniprot)==float or uniprot is None:
+                                top_alignments_dict[seq_id]['top_tg_UniProtID'] = ''
+                            else:
+                                top_alignments_dict[seq_id]['top_tg_UniProtID'] = uniprot
+                            if (type(isoform)==float and np.isnan(isoform)) or isoform is None:
+                                top_alignments_dict[seq_id]['top_tg_UniProt_isoform'] = ''
+                            else:
+                                top_alignments_dict[seq_id]['top_tg_UniProt_isoform'] = str(int(isoform))
+                            top_alignments_dict[seq_id]['top_tg_UniProt_fus_indices'] = fus_indices
+                    # increment total no matter what type of alignment it is
+                    seqid_stats[seq_id]['totalAlignments']+=1
+                    #if d['Score'] > seqid_stats[seq_id]['best_Score']:
+                    if d['Best']==True: # should be indicated if this is the best!!
+                        seqid_stats[seq_id]['best_Score'] = d['Score']
+                        if type(uniprot)==float or uniprot is None:
+                            top_alignments_dict[seq_id]['top_UniProtID'] = ''
+                        else:
+                            top_alignments_dict[seq_id]['top_UniProtID'] = uniprot
+                        if (type(isoform)==float and np.isnan(isoform)) or isoform is None:
+                            top_alignments_dict[seq_id]['top_UniProt_isoform'] = ''
+                        else:
+                            top_alignments_dict[seq_id]['top_UniProt_isoform'] = str(int(isoform))
+                        top_alignments_dict[seq_id]['top_UniProt_fus_indices'] = fus_indices
+                        # now get positives and identities
+                        if 'Identities' not in d: print(seq_id, uniprot, d.keys())
+                        identities = d['Identities']
+                        identities = int(identities.split('/')[0])
+                        positives = d['Positives']
+                        positives = int(positives.split('/')[0])
+                        top_alignments_dict[seq_id]['top_UniProt_nIdentities'] = identities
+                        top_alignments_dict[seq_id]['top_UniProt_nPositives'] = positives
+        stats_df = pd.DataFrame.from_dict(seqid_stats, orient='index').reset_index().rename(columns={'index':'seq_id'})
+        stats_df['h_or_t_alignment'] = stats_df.apply(lambda row: True if (row['hgAlignments']>0 or row['tgAlignments']>0) else False, axis=1)
+        stats_df['h_and_t_alignment'] = stats_df.apply(lambda row: True if (row['hgAlignments']>0 and row['tgAlignments']>0) else False, axis=1)
+        stats_df.to_csv(stats_df_savepath,index=False)
+        top_alignments_df = pd.DataFrame.from_dict(top_alignments_dict, orient='index').reset_index().rename(columns={'index':'seq_id'})
+        # add in the sequence length so we can get percentages
+        fusion_id_seq_dict = dict(zip(fuson_ht_db['seq_id'],fuson_ht_db['aa_seq']))
+        assert len(fusion_id_seq_dict) == len(fuson_ht_db['seq_id'].unique()) == len(fuson_ht_db['aa_seq'].unique())
+        top_alignments_df['aa_seq_len'] = top_alignments_df['seq_id'].map(fusion_id_seq_dict).str.len()
+        top_alignments_df.to_csv(top_alignments_df_savepath,index=False)
+    # also, find which ones have no match at all
+    # does it match?
+    no_match_list1 = find_nomatch_blasts(fuson_ht_db, database=database)
+    log_update(stats_df.head(10).to_string())
+    # how many have at least one head or tail?
+    log_update(f"Total sequences: {len(stats_df)}")
+    log_update(f"Sequences with >=1 head alignment: {len(stats_df.loc[stats_df['hgAlignments']>0])}")
+    log_update(f"Sequences with >=1 tail alignment: {len(stats_df.loc[stats_df['tgAlignments']>0])}")
+    log_update(f"Sequences with >=1 head OR tail alignment: {len(stats_df.loc[stats_df['h_or_t_alignment']])}")
+    log_update(f"Sequences with >=1 head AND tail alignment: {len(stats_df.loc[stats_df['h_and_t_alignment']])}")
+    log_update(f"Sequences with ANY alignment: {len(stats_df.loc[stats_df['totalAlignments']>0])}")
+    top_alignments_df = top_alignments_df.replace({None: ''})
+    log_update(f"Preview of top alignments for {database} search:\n{top_alignments_df.head(10).to_string(index=False)}")
+    top_alignments_df['hiso'] = top_alignments_df['top_hg_UniProtID']+'-'+top_alignments_df['top_hg_UniProt_isoform']
+    top_alignments_df['tiso'] = top_alignments_df['top_tg_UniProtID']+'-'+top_alignments_df['top_tg_UniProt_isoform']
+    top_alignments_df['biso'] = top_alignments_df['top_UniProtID']+'-'+top_alignments_df['top_UniProt_isoform']
+    top_hgs = set([x.strip('-') for x in top_alignments_df['hiso'].tolist()])   # if things don't have isoforms they'll just end in -
+    top_tgs = set([x.strip('-') for x in top_alignments_df['tiso'].tolist()])
+    top_bgs = set([x.strip('-') for x in top_alignments_df['biso'].tolist()])
+    top_gs = top_hgs | top_tgs | top_bgs
+    log_update(f"\nTotal unique head proteins (including isoform) producing top head alignments: {len(top_hgs)}")
+    log_update(f"\nTotal unique tail proteins (including isoform) producing top tail alignments: {len(top_tgs)}")
+    log_update(f"\nTotal unique proteins (including isoform) - head, tail, or neither - producing top alignments: {len(top_gs)}")
+    return stats_df, top_alignments_df
+def compare_database_blasts(fuson_ht_db, swissprot_blast_stats, fusion_hts_blast_stats, make_new_plots=True):
+    # let's start by just returning a list of IDs that were
+    # cols = seq_id  hgAlignments  tgAlignments  totalAlignments  best_hgScore  best_tgScore  best_Score  h_or_t_alignment  h_and_t_alignment
+    # distinguish the columns
+    og_cols = list(swissprot_blast_stats.columns)[1::]
+    for c in og_cols:
+        if c!='seq_id':
+            swissprot_blast_stats = swissprot_blast_stats.rename(columns={c: f"swiss_{c}"})
+    for c in og_cols:
+        if c!='seq_id':
+            fusion_hts_blast_stats = fusion_hts_blast_stats.rename(columns={c: f"hts_{c}"})
+    # merge
+    merged = pd.merge(swissprot_blast_stats,
+                      fusion_hts_blast_stats,
+                      on='seq_id',
+                      how='outer')
+    diff_cols = og_cols[0:-2]
+    differences = pd.DataFrame(columns=diff_cols)
+    log_update(f"Making volcano plots of the differences between fusion head-tail BLAST and swissprot BLAST in the following columns:\n\t{','.join(diff_cols)}")
+    for c in diff_cols:
+        differences[c] = merged[f"hts_{c}"] - merged[f"swiss_{c}"]
+    # make some box plots of differences
+    # Generate volcano plots for each column
+    if make_new_plots:
+        os.makedirs("figures",exist_ok=True)
+        os.makedirs("figures/database_comparison",exist_ok=True)
+        os.makedirs("figures/database_comparison/differences",exist_ok=True)
+        os.makedirs("figures/database_comparison/values",exist_ok=True)
+        os.makedirs("figures/database_comparison/box",exist_ok=True)
+        group_difference_plot(differences)
+        group_swiss_and_ht_plot(merged.drop(columns=['seq_id']), diff_cols)
+        group_box_plot(merged.drop(columns=['seq_id']), diff_cols)
+def fasta_to_dataframe(fasta_file):
+    # Read the file into a DataFrame with a single column
+    df = pd.read_fwf(fasta_file, header=None, colspecs=[(0, None)], names=['content'])
+    # Select even and odd lines using pandas slicing
+    ids = df.iloc[::2].reset_index(drop=True)  # Even-indexed lines (IDs)
+    sequences = df.iloc[1::2].reset_index(drop=True)  # Odd-indexed lines (sequences)
+    # Combine into a new DataFrame
+    fasta_df = pd.DataFrame({'ID': ids['content'], 'Sequence': sequences['content']})
+    fasta_df['ID'] = fasta_df['ID'].str.split('>',expand=True)[1]
+    fasta_df['Sequence'] = fasta_df['Sequence'].str.strip().str.strip('\n')
+    # print a preview of this
+    temp = fasta_df.head(10)
+    temp['Sequence'] = temp['Sequence'].apply(lambda x: x[0:10]+'...')
+    log_update(f"Preview of head/tail fasta sequences in a dataframe:\n{temp.to_string(index=False)}")
+    return fasta_df
+def get_ht_uniprot_query(swissprot_top_alignments_df):
+    '''
+    Use swissprot_top_alignments_df to curate all the unique UniProt IDs (ID.Isoform) that created top head and tail alignments
+    '''
+    swissprot_top_alignments_df['top_hg_full'] = swissprot_top_alignments_df['top_hg_UniProtID']+'.'+swissprot_top_alignments_df['top_hg_UniProt_isoform']
+    swissprot_top_alignments_df['top_tg_full'] = swissprot_top_alignments_df['top_tg_UniProtID']+'.'+swissprot_top_alignments_df['top_tg_UniProt_isoform']
+    unique_heads = swissprot_top_alignments_df.loc[
+        swissprot_top_alignments_df['top_hg_UniProtID'].notna()
+    ]['top_hg_full'].unique().tolist()
+    unique_tails = swissprot_top_alignments_df.loc[
+        swissprot_top_alignments_df['top_tg_UniProtID'].notna()
+    ]['top_tg_full'].unique().tolist()
+    unique_ht = set(unique_heads).union(set(unique_tails))
+    unique_ht = list(unique_ht)
+    unique_ht = [x for x in unique_ht if len(x)>1]    # not just "."
+    with open("blast_outputs/ht_uniprot_query.txt", "w") as f:
+        for i, ht in enumerate(unique_ht):
+            if i!= len(unique_ht)-1:
+                f.write(f"{ht}\n")
+            else:
+                f.write(f"{ht}")
+def main():
+    # Later, add the argparse thing back in here and change where the log is and what happens depending on wht the user decides
+    # May need to separate blast prep from actual blast for the manuscript, but worry about this later
+    with open_logfile(f"fusion_blast_log.txt"):
+        # Start by preparing BLAST inputs
+        prepare_blast_inputs()
+        # Then run BLAST
+        run_blast("blast_inputs",database="swissprot")
+        ###### Analyze BLAST results
+        # Make database with head and tail info for each fusion, so we know what to expect
+        fuson_ht_db = make_fuson_ht_db(savepath="fuson_ht_db.csv")
+        #parse_all_blast_results(fuson_ht_db, database="swissprot")
+        swissprot_blast_stats, swissprot_top_alignments_df = analyze_blast_results(fuson_ht_db,database="swissprot")
+        swissprot_top_alignments_df = pd.read_csv("blast_outputs/swissprot_top_alignments.csv")
+        get_ht_uniprot_query(swissprot_top_alignments_df)
+        os.makedirs("figures/top_blast_visuals",exist_ok=True)
+        group_pos_id_plot(swissprot_top_alignments_df)
+if __name__ == '__main__':
+    main()

fuson_plm/data/blast/blast_outputs/best_htg_alignments_swissprot_seqs.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d65107b19f8119ac3d37e2269857070be4736facdeebcdb6a7ddbcc339a5d7dc
+size 6855252

fuson_plm/data/blast/blast_outputs/ht_uniprot_query.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d06e54181e6ef2ecd6b1bb09cb6e202f3bfcd2638e9991a9959398ded385f8a6
+size 83285

fuson_plm/data/blast/blast_outputs/swissprot_blast_output_analyzed.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c94fdc967e9e3a84ccf84eff110b36fad3bf3966a8069fe16c1c5f74202ba4cf
+size 96067168

fuson_plm/data/blast/blast_outputs/swissprot_blast_stats.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7b4adfd714310a6c5df1490318dbb61ffa7cea6e50ca4d2c0454dd3d1747fa6e
+size 1915092

fuson_plm/data/blast/blast_outputs/swissprot_no_match.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a2ed60f354663bc37deaca18480cf7864588d782dad37997bb7917b5498e933c
+size 43049

fuson_plm/data/blast/blast_outputs/swissprot_no_match.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4aab65af73b8ef76f9df2866a1dc4d93819ebcaafff229a5133f517df02386fd
+size 2680

fuson_plm/data/blast/blast_outputs/swissprot_top_alignments.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4e97b1341d9a878ff55f4e0940d4b9556c26a2b93ff5a52220cebb6d97150d6f
+size 3293203

fuson_plm/data/blast/extract_blast_seqs.py ADDED Viewed

	@@ -0,0 +1,62 @@

+# Quick script to just get sequences out of
+import subprocess
+import os
+import pandas as pd
+import pickle
+def get_sequences_from_blastdb(database_path, entries):
+    """
+    Retrieves sequences for a list of entries from a BLAST database.
+    Parameters:
+    - database_path (str): Path to the BLAST database (without file extension).
+    - entries (list): List of entry IDs to query.
+    Returns:
+    - dict: A dictionary with entry IDs as keys and sequences as values.
+    """
+    sequences = {}
+    os.chdir("ncbi-blast-2.16.0+/swissprot")
+    for entry in entries:
+        try:
+            # Run blastdbcmd command to retrieve the sequence for each entry
+            result = subprocess.run(
+                ["blastdbcmd", "-db", database_path, "-entry", entry],
+                capture_output=True, text=True, check=True
+            )
+            # Store the output in the dictionary (entry ID as key, sequence as value)
+            # make sure the ID is what we think
+            result = result.stdout.strip()
+            id = result.split(' ',1)[0].split('>')[1]
+            assert id==entry
+            seq = result.split('\n',1)[1]
+            seq = seq.replace('\n','').strip('').strip('\n')
+            sequences[entry] = seq
+        except subprocess.CalledProcessError as e:
+            print(f"Error retrieving entry {entry}: {e}")
+            sequences[entry] = None  # Store None if there's an error for this entry
+    return sequences
+def main():
+    # Query SwissProt database for the sequences of all the head and tail genes that produced the top alignments
+    htgs = pd.read_csv("blast_outputs/ht_uniprot_query.txt",header=None)
+    htgs = list(htgs[0])
+    database_path = "swissprot"  # Path to the BLAST database without extension
+    entries = htgs
+    sequences_dict = get_sequences_from_blastdb(database_path, entries)
+    with open("blast_outputs/best_htg_alignments_swissprot_seqs.pkl", "wb") as f:
+        pickle.dump(sequences_dict, f)
+    # Now look at the file you just wrote
+    with open("blast_outputs/best_htg_alignments_swissprot_seqs.pkl", "rb") as f:
+        d = pickle.load(f)
+if __name__ == '__main__':
+    main()

fuson_plm/data/blast/figures/identities_hist.png ADDED Viewed

fuson_plm/data/blast/fusion_blast_log.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c22070ae219a5f7f2bf11c2a87527fbad55d2387464a119849102ea80f84174c
+size 9721

fuson_plm/data/blast/fuson_ht_db.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a92b06df9a78f4a969b691bc545d11923ff07145e02adacfb25fba879573a885
+size 45861419

fuson_plm/data/blast/plot.py ADDED Viewed

	@@ -0,0 +1,75 @@

+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+from fuson_plm.utils.visualizing import set_font
+global pos_id_label_dict
+pos_id_label_dict = {
+    'top_UniProt_nIdentities': 'Identities',
+    'top_UniProt_nPositives': 'Positives'   # Just makes it easier to label these on plots
+}
+def plot_pos_or_id_pcnt_hist(data, column_name, save_path=None, ax=None):
+    """
+    column_name is Positives or Identities
+    """
+    set_font()
+    if ax is None:
+        fig, ax = plt.subplots(figsize=(10, 7))
+    # Make the sample data
+    data = data[['aa_seq_len', column_name]].dropna()  # only keep those with alignments
+    data[column_name] = data[column_name]*100 # so it's %
+    data[f"{column_name} Percent Coverage"] = data[column_name] / data['aa_seq_len']
+    # Calculate the mean and median of the percent coverage
+    mean_coverage = data[f"{column_name} Percent Coverage"].mean()
+    median_coverage = data[f"{column_name} Percent Coverage"].median()
+    # Plot histogram for percent coverage
+    ax.hist(data[f"{column_name} Percent Coverage"], bins=50, edgecolor='grey', alpha=0.8, color='mediumpurple')
+    # Add vertical line for the mean
+    ax.axvline(mean_coverage, color='black', linestyle='--', linewidth=2)
+    # Add vertical line for the median
+    ax.axvline(median_coverage, color='black', linestyle='-', linewidth=2)
+    # Add text label for the mean line
+    ax.text(mean_coverage, ax.get_ylim()[1] * 0.9, f'Mean: {mean_coverage:.1f}%', color='black',
+            ha='center', va='top', fontsize=40, backgroundcolor='white')
+    # Add text label for the median line
+    ax.text(median_coverage, ax.get_ylim()[1] * 0.8, f'Median: {median_coverage:.1f}%', color='black',
+            ha='center', va='top', fontsize=40, backgroundcolor='white')
+    # Labels and title
+    plt.xticks(fontsize=24)
+    plt.yticks(fontsize=24)
+    ax.set_xlabel(f"Max % {pos_id_label_dict[column_name]}", fontsize=40)
+    ax.set_ylabel("Count", fontsize=40)
+    #ax.set_title(f"{pos_id_label_dict[column_name]} Percent Coverage (n={len(data):,})", fontsize=40)
+    plt.tight_layout()
+    # Save the plot
+    if save_path is not None:
+        plt.savefig(save_path, dpi=300)
+    # Show the plot if no ax is provided
+    if ax is None:
+        plt.show()
+def group_pos_id_plot(data):
+    set_font()
+    plot_pos_or_id_pcnt_hist(data, 'top_UniProt_nIdentities', save_path=f"figures/identities_hist.png", ax=None)
+def main():
+    swissprot_top_alignments_df = pd.read_csv("blast_outputs/swissprot_top_alignments.csv")
+    plot_pos_or_id_pcnt_hist(swissprot_top_alignments_df,
+                             'top_UniProt_nIdentities', save_path=f"figures/identities_hist.png", ax=None)
+if __name__ == '__main__':
+    main()

fuson_plm/data/clean.py ADDED Viewed

	@@ -0,0 +1,594 @@

+## Imports
+import pandas as pd
+import numpy as np
+import os
+import sys
+import pickle
+from fuson_plm.utils.constants import TCGA_CODES, FODB_CODES, VALID_AAS, DELIMITERS
+from fuson_plm.utils.logging import open_logfile, log_update
+from fuson_plm.utils.data_cleaning import clean_rows_and_cols, check_columns_for_listlike, check_item_for_listlike, find_delimiters, find_invalid_chars
+from fuson_plm.data.config import CLEAN
+def clean_fusionpdb(fusionpdb: pd.DataFrame, tcga_codes, delimiters, valid_aas) -> pd.DataFrame:
+    """
+    Return a cleaned version of the raw FusionPDB database, downloaded from FusionPDB website "Level 1" link
+    Args:
+        fusionpdb (pd.DataFrame): The raw FusionPDB database
+        delimiters: delimiters to check for
+    Returns:
+        pd.DataFrame: A cleaned version of the raw FusionPDB database with no duplicate sequences.
+            Columns:
+            - `aa_seq`:	amino acid sequence of fusion oncoprotein. each is unique.
+            - `n_fusiongenes`: total number of fusion genes with this amino acid sequence.
+            - `fusiongenes`:	comma-separated list of fusion genes (hgene::tgene) for this sequence. e.g., "MINK1::SPNS3,UBE2G1::SPNS3"
+            - `cancers`:	comma-separated list of cancer types for this sequence. e.g., "breast invasive carcinoma,stomach adenocarcinoma"
+            - `primary_source`:	source FusionPDB pulled the data from
+            - `secondary_source`:
+    """
+    # Process and clean FusionPDB database
+    log_update("Cleaning FusionPDB raw data")
+    # FusionPDB is downloaded with no column labels. Fill in column labels here.
+    log_update(f"\tfilling in column names...")
+    fusionpdb = fusionpdb.rename(columns={
+        0: 'ORF_type',
+        1: 'hgene_ens',
+        2: 'tgene_ens',
+        3: '', # no data in this column
+        4: 'primary_source', # database FusionPDB pulled from
+        5: 'cancer',
+        6: 'database_id',
+        7: 'hgene',
+        8: 'hgene_chr',
+        9: 'hgene_bp',
+        10: 'hgene_strand',
+        11: 'tgene',
+        12: 'tgene_chr',
+        13: 'tgene_bp',
+        14: 'tgene_strand',
+        15: 'bp_dna_transcript',
+        16: 'dna_transcript',
+        17: 'aa_seq_len',
+        18: 'aa_seq',
+        19: 'predicted_start_dna_transcript',
+        20: 'predicted_end_dna_transcript'
+    })
+    # Clean rows and columns
+    fusionpdb = clean_rows_and_cols(fusionpdb)
+    # Check for list-like qualities in the columns we plan to keep
+    cols_of_interest =  ['hgene','tgene','cancer','aa_seq','primary_source']
+    listlike_dict = check_columns_for_listlike(fusionpdb, cols_of_interest, delimiters)
+    # Add a new column for fusiongene, which combines hgene::tgene. e.g., EWS::FLI1
+    log_update("\tadding a column for fusiongene = hgene::tgene")
+    fusionpdb['fusiongene'] = (fusionpdb['hgene'] + '::' + fusionpdb['tgene']).astype(str)
+    # Make 'cancer' column type string to ease downstream processing
+    log_update("\tcleaning the cancer column...")
+    # turn '.' and nan entries into empty string
+    fusionpdb = fusionpdb.replace('.',np.nan)
+    fusionpdb['cancer'] = fusionpdb['cancer'].astype(str).replace('nan','')
+    log_update("\t\tconverting cancer acronyms into full cancer names...")
+    fusionpdb['cancer'] = fusionpdb['cancer'].apply(lambda x: tcga_codes[x].lower() if x in tcga_codes else x.lower())
+    log_update("\t\tconverting all lists into comma-separated...")
+    fusionpdb['cancer'] = fusionpdb['cancer'].str.replace(';',',')
+    fusionpdb['cancer'] = fusionpdb['cancer'].str.replace(', ', ',')
+    fusionpdb['cancer'] = fusionpdb['cancer'].str.strip()
+    fusionpdb['cancer'] = fusionpdb['cancer'].str.strip(',')
+    log_update(f"\t\tchecking for delimiters in the cleaned column...")
+    check_columns_for_listlike(fusionpdb, ['cancer'], delimiters)
+    # Now that we've dealt with listlike instances, make dictionary of hgene and tgene to their ensembl strings
+    log_update("\tcreating dictionary of head and tail genes mapped to Ensembl IDs, to be used later for aquiring UniProtAcc for head and tail genes (needed for BLAST analysis)")
+    hgene_to_ensembl_dict = fusionpdb.groupby('hgene').agg(
+        {
+            'hgene_ens': lambda x: ','.join(set(x))
+        }
+    ).reset_index()
+    hgene_to_ensembl_dict = dict(zip(hgene_to_ensembl_dict['hgene'],hgene_to_ensembl_dict['hgene_ens']))
+    tgene_to_ensembl_dict = fusionpdb.groupby('tgene').agg(
+        {
+            'tgene_ens': lambda x: ','.join(set(x))
+        }
+    ).reset_index()
+    tgene_to_ensembl_dict = dict(zip(tgene_to_ensembl_dict['tgene'],tgene_to_ensembl_dict['tgene_ens']))
+    # now, we might have some of the same heads and tails being mapped to different things
+    all_keys = set(hgene_to_ensembl_dict.keys()).union(set(tgene_to_ensembl_dict.keys()))
+    gene_to_ensembl_dict = {}
+    for k in all_keys:
+        ens = hgene_to_ensembl_dict.get(k,'') + ',' + tgene_to_ensembl_dict.get(k,'')
+        ens = ','.join(set(list(ens.strip(',').split(','))))
+        gene_to_ensembl_dict[k] = ens
+    os.makedirs("head_tail_data",exist_ok=True)
+    with open(f"head_tail_data/gene_to_ensembl_dict.pkl", "wb") as f:
+        pickle.dump(gene_to_ensembl_dict, f)
+    total_unique_ens_ids = list(gene_to_ensembl_dict.values())
+    total_unique_ens_ids = set(",".join(total_unique_ens_ids).split(","))
+    log_update(f"\t\tTotal unique head/tail genes: {len(gene_to_ensembl_dict)}\n\t\tTotal unique ensembl ids: {len(total_unique_ens_ids)}")
+    # To deal with duplicate sequences, group FusionPDB by sequence and concatenate fusion gene names, cancer types, and primary source
+    log_update(f"\tchecking FusionPDB for duplicate protein sequences...\n\t\toriginal size: {len(fusionpdb)}")
+    duplicates = fusionpdb[fusionpdb.duplicated('aa_seq')]['aa_seq'].unique().tolist()
+    n_fgenes_with_duplicates = len(fusionpdb[fusionpdb['aa_seq'].isin(duplicates)]['fusiongene'].unique())
+    n_rows_with_duplicates = len(fusionpdb[fusionpdb['aa_seq'].isin(duplicates)])
+    log_update(f"\t\t{len(duplicates)} duplicated sequences, corresponding to {n_rows_with_duplicates} rows and {n_fgenes_with_duplicates} distinct fusiongenes")
+    log_update(f"\tgrouping FusionPDB by amino acid sequence...")
+    # Merge step
+    fusionpdb = pd.merge(
+        fusionpdb.groupby('aa_seq').agg({
+            'fusiongene': lambda x: x.nunique()}).reset_index().rename(columns={'fusiongene':'n_fusiongenes'}),
+        fusionpdb.groupby('aa_seq').agg({
+            'fusiongene': lambda x: ','.join(x),
+            'cancer': lambda x: ','.join(x),
+            'primary_source': lambda x: ','.join(x)}).reset_index().rename(columns={'fusiongene':'fusiongenes', 'cancer': 'cancers', 'primary_source':'primary_sources'}).reset_index(drop=True).rename(columns={'fusiongene':'fusiongenes'}),
+        on='aa_seq'
+    )
+    # Turn each aggregated column into sorted, comma-separated list
+    fusionpdb['fusiongenes'] = fusionpdb['fusiongenes'].apply(lambda x: (',').join(sorted(set(x.split(','))))).str.strip(',')
+    fusionpdb['cancers'] = fusionpdb['cancers'].apply(lambda x: (',').join(sorted(set(x.split(','))))).str.strip(',')
+    fusionpdb['primary_sources'] = fusionpdb['primary_sources'].apply(lambda x: (',').join(sorted(set(x.split(','))))).str.strip(',')
+    # Count and display sequences with >1 fusion gene
+    duplicates = fusionpdb.loc[fusionpdb['n_fusiongenes']>1]['aa_seq'].tolist()
+    log_update(f"\t\treorganized database contains {len(duplicates)} proteins with >1 fusion gene")
+    log_update(f"\t\treorganized database contains {len(fusionpdb)} unique oncofusion sequences")
+    # Find invalid amino acids for each sequence and log_update the results
+    fusionpdb['invalid_chars'] = fusionpdb['aa_seq'].apply(lambda x: find_invalid_chars(x, valid_aas))
+    fusionpdb[fusionpdb['invalid_chars'].str.len()>0].sort_values(by='aa_seq')
+    all_invalid_chars = set().union(*fusionpdb['invalid_chars'])
+    log_update(f"\tchecking for invalid characters...\n\t\tset of all invalid characters discovered within FusionPDB: {all_invalid_chars}")
+    # Filter out any sequences with invalid amino acids
+    fusionpdb = fusionpdb[fusionpdb['invalid_chars'].str.len()==0].reset_index(drop=True).drop(columns=['invalid_chars'])
+    log_update(f"\tremoving invalid characters...\n\t\tremaining sequences with valid AAs only: {len(fusionpdb)}")
+    # Add a column for secondary source - FusionPDB.
+    fusionpdb['secondary_source'] = ['FusionPDB']*len(fusionpdb)
+    # Final checks of database cleanliness
+    log_update(f"\tperforming final checks on cleaned FusionPDB...")
+    duplicates = len(fusionpdb.loc[fusionpdb['aa_seq'].duplicated()]['aa_seq'].tolist())
+    log_update(f"\t\t{duplicates} duplicate sequences")
+    invalids=0
+    for x in all_invalid_chars:
+        invalids += len(fusionpdb.loc[fusionpdb['aa_seq'].str.contains(x)])
+    log_update(f"\t\t{invalids} proteins containing invalid chracters")
+    all_unique_seqs = len(fusionpdb)==len(fusionpdb['aa_seq'].unique())
+    log_update(f"\t\tevery row contains a unique oncofusion sequence: {all_unique_seqs}")
+    return fusionpdb
+def clean_fodb(fodb: pd.DataFrame, fodb_codes, delimiters, valid_aas) -> pd.DataFrame:
+    """
+    Cleans the FOdb database
+    Args:
+        fodb (pd.DataFrame): raw FOdb.
+        fodb_codes:
+        delimiters:
+        valid_aas:
+    Returns:
+        pd.DataFrame: a cleaned version of FOdb with no duplicate sequences.
+        Columns:
+        - `aa_seq`:	amino acid sequence of fusion oncoprotein. each is unique.
+        - `n_fusiongenes`: total number of fusion genes with this amino acid sequence.
+        - `fusiongenes`:    comma-separated list of fusion genes (hgene::tgene) for this sequence. e.g., "MINK1::SPNS3,UBE2G1::SPNS3"
+        - `cancers`:    comma-separated list of cancer types for this sequence. e.g., "breast invasive carinoma,stomach adenocarcinoma"
+        - `primary_source`:	source FOdb pulled the data from
+        - `secondary_source`: FOdb
+    """
+    log_update("Cleaning FOdb raw data")
+    fodb['FO_Name'] = fodb['FO_Name'].apply(lambda x: x.split("_")[0]+"::"+x.split("_")[1])
+    fodb = fodb.rename(columns={'Sequence_Source': 'primary_source', 'FO_Name': 'fusiongene', 'AA_Sequence': 'aa_seq'})
+    fodb.head()
+    # Clean rows and columns
+    fodb = clean_rows_and_cols(fodb)
+    # HEY1::NCOA2 has a "-" on the end by mistake. Replace this with '' for benchmarking purposes
+    special_seq = "MKRAHPEYSSSDSELDETIEVEKESADENGNLSSALGSMSPTTSSQILARKRRRGIIEKRRRDRINNSLSELRRLVPSAFEKQGSAKLEKAEILQMTVDHLKMLHTAGGKAFNNPRPGQLGRLLPNQNLPLDITLQSPTGAGPFPPIRNSSPYSVIPQPGMMGNQGMIGNQGNLGNSSTGMIGNSASRPTMPSGEWAPQSSAVRVTCAATTSAMNRPVQGGMIRNPAASIPMRPSSQPGQRQTLQSQVMNIGPSELEMNMGGPQYSQQQAPPNQTAPWPESILPIDQASFASQNRQPFGSSPDDLLCPHPAAESPSDEGALLDQLYLALRNFDGLEEIDRALGIPELVSQSQAVDPEQFSSQDSNIMLEQKAPVFPQQYASQAQMAQGSYSPMQDPNFHTMGQRPSYATLRMQPRPGLRPTGLVQNQPNQLRLQLQHRLQAQQNRQPLMNQISNVSNVNLTLRPGVPTQAPINAQMLAQRQREILNQHLRQRQMHQQQQVQQRTLMMRGQGLNMTPSMVAPSGIPATMSNPRIPQANAQQFPFPPNYGISQQPDPGFTGATTPQSPLMSPRMAHTQSPMMQQSQANPAYQAPSDINGWAQGNMGGNSMFSQQSPPHFGQQANTSMYSNNMNINVSMATNTGGMSSMNQMTGQISMTSVTSVPTSGLSSMGPEQVNDPALRGGNLFPNQLPGMDMIKQEGDTTRKYC-"
+    special_seq_name = "HEY1::NCOA2"
+    fodb.loc[
+        (fodb['fusiongene']==special_seq_name) &
+        (fodb['aa_seq']==special_seq), 'aa_seq'
+    ] = special_seq.replace('-','')
+    # filter out anything remaining with invalid characters
+    fodb['invalid_chars'] = fodb['aa_seq'].apply(lambda x: find_invalid_chars(x, valid_aas))
+    all_invalid_chars = set().union(*fodb['invalid_chars'])
+    log_update(f"\tchecking for invalid characters...\n\t\tset of all invalid characters discovered within FOdb: {all_invalid_chars}")
+    fodb = fodb[fodb['invalid_chars'].str.len()==0].reset_index(drop=True).drop(columns=['invalid_chars'])
+    log_update(f"\tremoving invalid characters...\n\t\tremaining sequences with valid AAs only: {len(fodb)}")
+    # aggregate the cancer data - if there's a 1 in the column, add it to the list of affected cancers
+    # acronym -> cancer conversions based on Supplementary Table 3 of FOdb paper (Tripathi et al. 2023 Defining)
+    log_update(f"\taggregating cancer data from {len(fodb.columns)-4} individual cancer columns into one...")
+    log_update(f"\t\tchanging cancer names from acronyms to full")
+    cancers = list(fodb.columns)[4::]
+    fodb['cancers'] = ['']*len(fodb)
+    for cancer in cancers:
+        mapped_cancer = fodb_codes[cancer].lower() if cancer in fodb_codes else cancer
+        fodb['cancers'] = fodb.apply(
+            lambda row: row['cancers'] + f'{mapped_cancer},' if row[cancer] == 1 else row['cancers'],
+            axis=1
+        )
+    fodb['cancers'] = fodb['cancers'].str.strip(',').replace('nan','')
+    fodb = fodb.drop(columns=['Patient_Count']+cancers)
+    # Check for list-like qualities in the columns we plan to keep
+    cols_of_interest =  ['primary_source','fusiongene','aa_seq','cancers']
+    listlike_dict = check_columns_for_listlike(fodb, cols_of_interest, delimiters)
+    # To deal with duplicate sequences, group fodb by sequence and concatenate fusion gene names, cancer types, and primary source
+    log_update(f"\tchecking fodb for duplicate protein sequences...\n\t\toriginal size: {len(fodb)}")
+    duplicates = fodb[fodb.duplicated('aa_seq')]['aa_seq'].unique().tolist()
+    n_fgenes_with_duplicates = len(fodb[fodb['aa_seq'].isin(duplicates)]['fusiongene'].unique())
+    n_rows_with_duplicates = len(fodb[fodb['aa_seq'].isin(duplicates)])
+    log_update(f"\t\t{len(duplicates)} duplicated sequences, corresponding to {n_rows_with_duplicates} rows and {n_fgenes_with_duplicates} distinct fusiongenes")
+    log_update(f"\tgrouping fodb by amino acid sequence...")
+    # Merge step
+    fodb = pd.merge(
+        fodb.groupby('aa_seq').agg({
+            'fusiongene': lambda x: x.nunique()}).reset_index().rename(columns={'fusiongene':'n_fusiongenes'}),
+        fodb.groupby('aa_seq').agg({
+            'fusiongene': lambda x: ','.join(x),
+            'cancers': lambda x: ','.join(x),
+            'primary_source': lambda x: ','.join(x)}).reset_index().rename(columns={'fusiongene':'fusiongenes', 'primary_source':'primary_sources'}).reset_index(drop=True).rename(columns={'fusiongene':'fusiongenes'}),
+        on='aa_seq'
+    )
+    # Turn each aggregated column into sorted, comma-separated list
+    fodb['fusiongenes'] = fodb['fusiongenes'].apply(lambda x: (',').join(sorted(set(x.split(','))))).str.strip(',')
+    fodb['cancers'] = fodb['cancers'].apply(lambda x: (',').join(sorted(set(x.split(','))))).str.strip(',')
+    fodb['primary_sources'] = fodb['primary_sources'].apply(lambda x: (',').join(sorted(set(x.split(','))))).str.strip(',')
+    # Count and display sequences with >1 fusion gene
+    duplicates = fodb.loc[fodb['n_fusiongenes']>1]['aa_seq'].tolist()
+    log_update(f"\t\treorganized database contains {len(duplicates)} proteins with >1 fusion gene")
+    log_update(f"\t\treorganized database contains {len(fodb)} unique oncofusion sequences")
+    # Add secondary source column because FOdb is the secondary source here.
+    fodb['secondary_source'] = ['FOdb']*len(fodb)
+    # Final checks of database cleanliness
+    log_update(f"\tperforming final checks on cleaned FOdb...")
+    duplicates = len(fodb.loc[fodb['aa_seq'].duplicated()]['aa_seq'].tolist())
+    log_update(f"\t\t{duplicates} duplicate sequences")
+    invalids=0
+    for x in all_invalid_chars:
+        invalids += len(fodb.loc[fodb['aa_seq'].str.contains(x)])
+    log_update(f"\t\t{invalids} proteins containing invalid chracters")
+    all_unique_seqs = len(fodb)==len(fodb['aa_seq'].unique())
+    log_update(f"\t\tevery row contains a unique oncofusion sequence: {all_unique_seqs}")
+    return fodb
+def create_fuson_db(fusionpdb: pd.DataFrame, fodb: pd.DataFrame) -> pd.DataFrame:
+    """
+    Merges cleaned FusionPDB and FOdb to create fuson_db (the full set of fusion sequences for training/benchmarking FusOn-pLM)
+    Args:
+        fusionpdb (pd.DataFrame):
+    """
+    log_update("Creating the merged database...")
+    log_update("\tconcatenating cleaned FusionPDb and cleaned FOdb...")
+    fuson_db = pd.concat(
+        [
+            fusionpdb.rename(columns={'secondary_source':'secondary_sources'}),
+            fodb.rename(columns={'secondary_source':'secondary_sources'})
+        ]
+    )
+    # Handle dupliate amino acid sequences
+    log_update(f"\tchecking merged database for duplicate protein sequences...\n\t\toriginal size: {len(fuson_db)}")
+    duplicates = fuson_db[fuson_db.duplicated('aa_seq')]['aa_seq'].unique().tolist()
+    n_fgenes_with_duplicates = len(fuson_db[fuson_db['aa_seq'].isin(duplicates)]['fusiongenes'].unique())
+    n_rows_with_duplicates = len(fuson_db[fuson_db['aa_seq'].isin(duplicates)])
+    log_update(f"\t\t{len(duplicates)} duplicated sequences, corresponding to {n_rows_with_duplicates} rows and {n_fgenes_with_duplicates} distinct fusiongenes")
+    log_update(f"\tgrouping database by amino acid sequence...")
+    fuson_db = fuson_db.groupby('aa_seq').agg(
+        {
+            'fusiongenes': lambda x: ','.join(x),
+            'cancers': lambda x: ','.join(x),
+            'primary_sources': lambda x: ','.join(x),
+            'secondary_sources': lambda x: ','.join(x)
+        }
+    ).reset_index()
+    duplicates = fuson_db.loc[fuson_db['fusiongenes'].str.count(',')>0]['aa_seq'].tolist()
+    log_update(f"\t\treorganized database contains {len(duplicates)} proteins with >1 fusion gene")
+    log_update(f"\t\treorganized database contains {len(fuson_db)} unique oncofusion sequences")
+    # Turn each aggregated column into a set of only the unique entires
+    for column in fuson_db.columns[1::]:
+        fuson_db[column] = fuson_db[column].apply(lambda x: (',').join(sorted(set(
+            [y for y in x.split(',') if len(y)>0]))))
+    # Add a column for length
+    log_update(f"\tadding a column for length...")
+    fuson_db['length'] = fuson_db['aa_seq'].apply(lambda x: len(x))
+    # Sort by fusiongenes, then length
+    log_update(f"\tsorting by fusion gene name, then length...")
+    fuson_db = fuson_db.sort_values(by=['fusiongenes','length'],ascending=[True,True]).reset_index(drop=True)
+    # Add a seq_id column: seq1, seq2, ..., seqn
+    log_update(f"\tadding sequence ids: seq1, seq2, ..., seqn")
+    fuson_db['seq_id'] = ['seq'+str(i+1) for i in range(len(fuson_db))]
+    # Final checks of database cleanliness
+    log_update(f"\tperforming final checks on fuson_db...")
+    duplicates = len(fuson_db.loc[fuson_db['aa_seq'].duplicated()]['aa_seq'].tolist())
+    log_update(f"\t\t{duplicates} duplicate sequences")
+    all_unique_seqs = len(fuson_db)==len(fuson_db['aa_seq'].unique())
+    log_update(f"\t\tevery row contains a unique oncofusion sequence: {all_unique_seqs}")
+    return fuson_db
+def head_tail_mappings(fuson_db):
+    log_update("\nGenes and Ensembl IDs corresponding to the head and tail proteins have been mapped on UniProt. Now, combining these results.")
+    # Read the ensembl map, gene name map, and dictionary from gene --> ensembl ids
+    ensembl_map = pd.read_csv("head_tail_data/ensembl_ht_idmap.txt",sep="\t")
+    name_map = pd.read_csv("head_tail_data/genename_ht_idmap.txt",sep="\t")
+    with open("head_tail_data/gene_to_ensembl_dict.pkl", "rb") as f:
+        gene_ens_dict = pickle.load(f)
+    log_update(f"\tCheck: ensembl map and gene name map have same columns: {set(ensembl_map.columns)==set(name_map.columns)}")
+    log_update(f"\t\tColumns = {list(ensembl_map.columns)}")
+    # Prepare to merge
+    log_update(f"\tMerging the ensembl map and gene name map:")
+    ensembl_map = ensembl_map.rename(columns={'From': 'ensembl_id'})    # mapped from ensembl ids
+    name_map = name_map.rename(columns={'From': 'htgene'})              # mapped from head or tail genes
+    name_map['ensembl_id'] = name_map['htgene'].map(gene_ens_dict)      # add ensembl id column bsed on head and tail genes
+    name_map['ensembl_id'] = name_map['ensembl_id'].apply(lambda x: x.split(',') if type(x)==str else x)   # make it a string if multiple matches
+    log_update(f"\t\tLength of gene-based map before exploding ensembl_id column: {len(name_map)}")
+    name_map = name_map.explode('ensembl_id')       # explode so each ensembl id is its own line
+    log_update(f"\t\tLength of gene-based map after exploding ensembl_id column: {len(name_map)}")
+    log_update(f"\t\tLength of ensembl-based map: {len(ensembl_map)}")
+    unimap = pd.merge(name_map[['htgene','ensembl_id','Entry','Reviewed']],
+                    ensembl_map[['ensembl_id','Entry','Reviewed']],
+                    on=['ensembl_id','Entry','Reviewed'],
+                    how='outer'
+                    )
+    unimap['Reviewed'] = unimap['Reviewed'].apply(lambda x: '1' if x=='reviewed' else '0' if x=='unreviewed' else 'N')  # N for nan
+    log_update(f"\t\tLength of merge: {len(unimap)}. Merge preview:")
+    log_update(unimap.head())
+    unimap = unimap.drop_duplicates(['htgene','Entry','Reviewed']).reset_index(drop=True)
+    log_update(f"\t\tLength of merge after dropping rows where only ensembl_id changed: {len(unimap)}. Merge preview: ")
+    log_update(unimap.head())
+    unimap = unimap.groupby('htgene').agg(
+        {
+            'Entry': lambda x: ','.join(x),
+            'Reviewed': lambda x: ''.join(x)
+        }
+    ).reset_index()
+    unimap = unimap.rename(columns={
+        'htgene': 'Gene',
+        'Entry': 'UniProtID',
+    })
+    log_update(f"\t\tLength of merge after grouping by gene name: {len(unimap)}. Merge preview:")
+    log_update(unimap.head())
+    # what are the proteins whose head or tail genes are in this list?
+    log_update(f"\tChecking which fusion proteins have unmappable heads and/or tails:")
+    temp = fuson_db.copy(deep=True)
+    temp['fusiongenes'] = temp['fusiongenes'].apply(lambda x: x.split(','))
+    temp = temp.explode('fusiongenes')
+    temp['hgene'] = temp['fusiongenes'].str.split('::',expand=True)[0]
+    temp['tgene'] = temp['fusiongenes'].str.split('::',expand=True)[1]
+    # See which gene IDs weren't covered
+    log_update(f"\tChecking which gene IDs were not mapped by either method")
+    all_geneids = temp['hgene'].tolist() +temp['tgene'].tolist()
+    all_geneids = list(set(all_geneids))
+    all_mapped_genes = unimap['Gene'].unique().tolist()
+    unmapped_geneids = set(all_geneids) - set(all_mapped_genes)
+    log_update(f"\t\t{len(all_mapped_genes)}/{len(all_geneids)} were mapped\n\t\t{len(unmapped_geneids)}/{len(all_geneids)} were unmapped")
+    log_update(f"\t\tUnmapped geneids: {','.join(unmapped_geneids)}")
+    # Find the ok ones and print
+    ok_seqs =  temp.loc[
+        (temp['hgene'].isin(all_mapped_genes)) |    # head gene was found, OR
+        (temp['tgene'].isin(all_mapped_genes))      # tail gene was found
+    ]['seq_id'].unique().tolist()
+    ok_seqsh =  temp.loc[
+        (temp['hgene'].isin(all_mapped_genes))      # head gene was found
+    ]['seq_id'].unique().tolist()
+    ok_seqst =  temp.loc[
+        (temp['tgene'].isin(all_mapped_genes))      # tail gene was found
+    ]['seq_id'].unique().tolist()
+    ok_seqsboth =  temp.loc[
+        (temp['hgene'].isin(all_mapped_genes)) &    # head gene was found, AND
+        (temp['tgene'].isin(all_mapped_genes))      # tail gene was found
+    ]['seq_id'].unique().tolist()
+    log_update(f"\tTotal fusion sequence ids: {len(temp['seq_id'].unique())}")
+    log_update(f"\tFusion sequences with at least 1 mapped constituent:\
+        \n\t\tMapped head: {len(ok_seqsh)}\
+            \n\t\tMapped tail: {len(ok_seqst)}\
+                \n\t\tMapped head or tail: {len(ok_seqs)}\
+                    \n\t\tMapped head AND tail: {len(ok_seqsboth)}")
+    # Now look at the bad side
+    atleast_1_lost = temp.loc[
+        ((temp['hgene'].isin(unmapped_geneids)) & ~(temp['seq_id'].isin(ok_seqsh))) |   # head not found in row, AND head not found for seq_id - OR
+        ((temp['tgene'].isin(unmapped_geneids)) & ~(temp['seq_id'].isin(ok_seqst)))     # tail not found in row, AND tail not found for seq_id
+    ]['seq_id'].unique().tolist()
+    atleast_1_losth = temp.loc[
+        (temp['hgene'].isin(unmapped_geneids)) &                # head not found in this row AND
+        ~(temp['seq_id'].isin(ok_seqsh))                        # head not found for this seq id
+    ]['seq_id'].unique().tolist()
+    atleast_1_lostt = temp.loc[
+        (temp['tgene'].isin(unmapped_geneids)) &                # tail not found in this row AND
+        ~(temp['seq_id'].isin(ok_seqst))                        # tail not found for this seq id
+    ]['seq_id'].unique().tolist()
+    both_lost = temp.loc[
+        ((temp['hgene'].isin(unmapped_geneids)) & ~(temp['seq_id'].isin(ok_seqsh))) &   # there's no head, and this seq id has no head - AND
+        ((temp['tgene'].isin(unmapped_geneids)) & ~(temp['seq_id'].isin(ok_seqst)))     # there's no tail, and this seq id has no tail
+    ]['seq_id'].unique().tolist()
+    log_update(f"\tFusion sequences with at least 1 unmapped constituent:")
+    log_update(f"\t\tUnmapped head: {len(atleast_1_losth)}\
+        \n\t\tUnmapped tail: {len(atleast_1_lostt)}\
+            \n\t\tUnmapped head or tail: {len(atleast_1_lost)}\
+                \n\t\tUnmapped head AND tail: {len(both_lost)}")
+    log_update(f"\tseq_ids with at least 1 unmapped part: {atleast_1_lost}")
+    assert len(ok_seqsboth)+ len(atleast_1_lost) == len(temp['seq_id'].unique())
+    log_update(f"\tFusions with H&T covered plus Fusions with H|T lost = total = {len(ok_seqsboth)}+ {len(atleast_1_lost)} = {len(ok_seqsboth)+ len(atleast_1_lost)} = {len(temp['seq_id'].unique())}")
+    ### Save the unimap
+    unimap.to_csv('head_tail_data/htgenes_uniprotids.csv',index=False)
+def assemble_uniprot_query(path_to_gene_ens_dict="head_tail_data/gene_to_ensembl_dict.pkl",path_to_fuson_db="fuson_db.csv"):
+    """
+    To analyze the BLAST results effectively, we must know which UniProt accessions we *expect* to see for each fusion oncoprotein.
+    We will try to map each FO to its head and tail accessions by searching UniProt ID map by gene name and Ensembl ID.
+    This method will create two input lists for UniProt:
+        - gene_name_inputs.txt: list of all uinque head and tail gene names
+        - ensembl_inputs.txt
+    """
+    log_update("\nMaking inputs for UniProt ID map, to find accessions for head and tail genes")
+    if not(os.path.exists(path_to_gene_ens_dict)):
+        raise Exception(f"File {path_to_gene_ens_dict} does not exist")
+    with open(path_to_gene_ens_dict, "rb") as f:
+        gene_ens_dict = pickle.load(f)
+    all_htgenes_temp = list(gene_ens_dict.keys())
+    all_ens = list(gene_ens_dict.values())
+    all_ens = list(set(",".join(all_ens).split(",")))
+    log_update(f"\tTotal unique head and tail genes, only accounting for FusionPDB: {len(all_htgenes_temp)}")
+    # need to add other htgenes from UniProt
+    fuson_db = pd.read_csv(path_to_fuson_db)
+    fuson_db['fusiongenes'] = fuson_db['fusiongenes'].apply(lambda x: x.split(','))
+    fuson_db = fuson_db.explode('fusiongenes')
+    fuson_db['hgene'] = fuson_db['fusiongenes'].str.split('::',expand=True)[0]
+    fuson_db['tgene'] = fuson_db['fusiongenes'].str.split('::',expand=True)[1]
+    fuson_htgenes = fuson_db['hgene'].tolist() + fuson_db['tgene'].tolist()
+    fuson_htgenes = set(fuson_htgenes)
+    all_htgenes = set(all_htgenes_temp).union(set(fuson_htgenes))
+    all_htgenes = list(set(all_htgenes))
+    log_update(f"\tTotal unique head and tail genes after adding FOdb: {len(all_htgenes)}")
+    log_update(f"\tTotal unique ensembl IDs: {len(all_ens)}")
+    # go through each and write a file
+    input_dir = "head_tail_data/uniprot_idmap_inputs"
+    os.makedirs(input_dir,exist_ok=True)
+    if os.path.exists(f"{input_dir}/head_tail_genes.txt"):
+        log_update("\nAlready assembled UniProt ID mapping input for head and tail genes. Continuing")
+    else:
+        with open(f"{input_dir}/head_tail_genes.txt", "w") as f:
+            for i, gene in enumerate(all_htgenes):
+                if i!=len(all_htgenes)-1:
+                    f.write(f"{gene}\n")
+                else:
+                    f.write(f"{gene}")
+    if os.path.exists(f"{input_dir}/head_tail_ens.txt"):
+        log_update("\nAlready assembled UniProt ID mapping input for head and tail ensembl IDs. Continuing")
+    else:
+        with open(f"{input_dir}/head_tail_ens.txt", "w") as f:
+            for i, ens in enumerate(all_ens):
+                if i!=len(all_ens)-1:
+                    f.write(f"{ens}\n")
+                else:
+                    f.write(f"{ens}")
+def main():
+    # Define global variables from config.DATA_CLEANING
+    FODB_PATH = CLEAN.FODB_PATH
+    FODB_PUNCTA_PATH = CLEAN.FODB_PUNCTA_PATH
+    FUSIONPDB_PATH = CLEAN.FUSIONPDB_PATH
+    LOG_PATH = "data_cleaning_log.txt"
+    SAVE_CLEANED_FODB = False
+    # Prepare the log file
+    with open_logfile(LOG_PATH):
+        log_update("Loaded data-cleaning configurations from config.py")
+        CLEAN.print_config(indent='\t')
+        log_update("Reading FusionPDB...")
+        fusionpdb = pd.read_csv(FUSIONPDB_PATH,sep='\t',header=None)
+        fusionpdb = clean_fusionpdb(fusionpdb, TCGA_CODES, DELIMITERS, VALID_AAS)
+        log_update("Saving FusionPDB to FusionPDB_cleaned.csv...")
+        fusionpdb.to_csv('raw_data/FusionPDB_cleaned.csv', index=False)
+        # Clean FOdb, optinoally save
+        log_update("Reading FOdb...")
+        fodb = pd.read_csv(FODB_PATH)
+        fodb = clean_fodb(fodb, FODB_CODES, DELIMITERS, VALID_AAS)
+        if SAVE_CLEANED_FODB:
+            log_update("Saving FOdb to FOdb_cleaned.csv...")
+            fusionpdb.to_csv('FOdb_cleaned.csv', index=False)
+        # Merge FusionPDB and FOdb to fuson_db
+        fuson_db = create_fuson_db(fusionpdb, fodb)
+        # Mark benchmarking sequences
+        # FOdb puncta benchmark
+        log_update("Adding benchmarking sequences to fuson_db...")
+        fodb_puncta = pd.read_csv(FODB_PUNCTA_PATH)
+        # handle the mistake sequence - take the "-" off the end
+        special_seq = "MKRAHPEYSSSDSELDETIEVEKESADENGNLSSALGSMSPTTSSQILARKRRRGIIEKRRRDRINNSLSELRRLVPSAFEKQGSAKLEKAEILQMTVDHLKMLHTAGGKAFNNPRPGQLGRLLPNQNLPLDITLQSPTGAGPFPPIRNSSPYSVIPQPGMMGNQGMIGNQGNLGNSSTGMIGNSASRPTMPSGEWAPQSSAVRVTCAATTSAMNRPVQGGMIRNPAASIPMRPSSQPGQRQTLQSQVMNIGPSELEMNMGGPQYSQQQAPPNQTAPWPESILPIDQASFASQNRQPFGSSPDDLLCPHPAAESPSDEGALLDQLYLALRNFDGLEEIDRALGIPELVSQSQAVDPEQFSSQDSNIMLEQKAPVFPQQYASQAQMAQGSYSPMQDPNFHTMGQRPSYATLRMQPRPGLRPTGLVQNQPNQLRLQLQHRLQAQQNRQPLMNQISNVSNVNLTLRPGVPTQAPINAQMLAQRQREILNQHLRQRQMHQQQQVQQRTLMMRGQGLNMTPSMVAPSGIPATMSNPRIPQANAQQFPFPPNYGISQQPDPGFTGATTPQSPLMSPRMAHTQSPMMQQSQANPAYQAPSDINGWAQGNMGGNSMFSQQSPPHFGQQANTSMYSNNMNINVSMATNTGGMSSMNQMTGQISMTSVTSVPTSGLSSMGPEQVNDPALRGGNLFPNQLPGMDMIKQEGDTTRKYC-"
+        special_seq_name = "HEY1_NCOA2"
+        fodb_puncta.loc[
+            (fodb_puncta['FO_Name']==special_seq_name) &
+            (fodb_puncta['AAseq']==special_seq), 'AAseq'
+        ] = special_seq.replace('-','')
+        fodb_puncta_sequences = fodb_puncta['AAseq'].unique().tolist()
+        benchmark_sequences = dict(zip(fodb_puncta_sequences, ['Puncta']*len(fodb_puncta_sequences)))
+        log_update(f"\tRead FOdb puncta data and isolated {len(benchmark_sequences)} sequences for puncta benchmark")
+        # Biological discovery benchmark
+        benchmark_sequences2 = fuson_db.loc[
+            (fuson_db['fusiongenes'].str.contains('EWSR1::FLI1')) |
+            (fuson_db['fusiongenes'].str.contains('PAX3::FOXO1')) |
+            (fuson_db['fusiongenes'].str.contains('BCR::ABL1')) |
+            (fuson_db['fusiongenes'].str.contains('EML4::ALK'))
+        ]['aa_seq'].unique().tolist()
+        benchmark_sequences2 = dict(zip(benchmark_sequences2, ['Biological Discovery']*len(benchmark_sequences2)))
+        log_update(f"\tIsolated all EWSR1::FLI1, PAX3::FOXO1, BCR::ABL1, and EML4::ALK sequences ({len(benchmark_sequences2)} total) for biological benchmarks...")
+        for k, v in benchmark_sequences2.items():
+            if k in benchmark_sequences:
+                benchmark_sequences[k] = benchmark_sequences[k] + ',' + v
+            else:
+                benchmark_sequences[k] = v
+        log_update(f"\tTotal unique benchmark sequences: {len(benchmark_sequences)}")
+        # Add benchmark column
+        log_update("\tAdding benchmark column...")
+        fuson_db['benchmark'] = fuson_db['aa_seq'].apply(lambda x: benchmark_sequences[x] if x in benchmark_sequences else np.nan)
+        # Save fuson_db
+        log_update("\nWriting final database to fuson_db.csv...")
+        fuson_db.to_csv('fuson_db.csv', index=False)
+        log_update("Cleaning complete.")
+        # Assemble head tail queries for UniProt
+        assemble_uniprot_query()
+        # Do the head tail mappings
+        head_tail_mappings(fuson_db)
+if __name__ == '__main__':
+    main()

fuson_plm/data/cluster.py ADDED Viewed

	@@ -0,0 +1,50 @@

+import pandas as pd
+import numpy as np
+import os
+import subprocess
+import sys
+from Bio import SeqIO
+import shutil
+from fuson_plm.utils.logging import open_logfile, log_update
+from fuson_plm.data.config import CLUSTER
+def main():
+    # Read all the input args
+    LOG_PATH = "clustering_log.txt"
+    INPUT_PATH = CLUSTER.INPUT_PATH
+    MIN_SEQ_ID = CLUSTER.MIN_SEQ_ID
+    C = CLUSTER.C
+    COV_MODE = CLUSTER.COV_MODE
+    PATH_TO_MMSEQS = CLUSTER.PATH_TO_MMSEQS
+    MAX_SEQ_LENGTH = CLUSTER.MAX_SEQ_LENGTH
+    with open_logfile(LOG_PATH):
+        log_update("Input params from config.py:")
+        CLUSTER.print_config(indent='\t')
+        # Make a subfolder for clustering results, and direct MMSeqs2 outputs here
+        if not(os.path.exists("clustering")):
+            os.mkdir("clustering")
+        output_dir = "clustering/raw_output"
+        # Make fasta of input file
+        sequences = pd.read_csv(INPUT_PATH)
+        log_update(f"\nPreparing input data...\n\tInitial dataset size: {len(sequences)} sequences")
+        sequences = sequences.loc[sequences['aa_seq'].str.len() <= MAX_SEQ_LENGTH].reset_index(drop=True)
+        log_update(f"\tApplied length cutoff of {MAX_SEQ_LENGTH}AAs. New dataset size: {len(sequences)} sequences")
+        sequences = dict(zip(sequences['seq_id'],sequences['aa_seq']))
+        fasta_path = make_fasta(sequences, "clustering/input.fasta")
+        log_update(f"\tMade fasta of input sequences, saved at {fasta_path}")
+        run_mmseqs_clustering(fasta_path, output_dir, min_seq_id=MIN_SEQ_ID, c=C, cov_mode=COV_MODE, path_to_mmseqs=PATH_TO_MMSEQS)
+        # Brief read to preview results
+        clusters = analyze_clustering_result('clustering/input.fasta', 'clustering/raw_output/mmseqs_cluster.tsv')
+        # Save clusters
+        clusters.to_csv('clustering/mmseqs_full_results.csv',index=False)
+        log_update("Processed and combined mmseqs output. Wrote comprehensive results to clustering/mmseqs_full_results.csv")
+        cluster_summary(clusters)
+if __name__ == "__main__":
+    main()

fuson_plm/data/clustering/input.fasta ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:76ad3e500fd5e51220d210c2fdef65d761a9f9e8b7962c94bcc79b093408f7b7
+size 27788610

fuson_plm/data/clustering/mmseqs_full_results.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8e503a6511f93265a964fd105200a05fa957d9fc2e0edee37dbb3f0b0f55486e
+size 55967813

fuson_plm/data/clustering_log.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:73371a8475ebcbef54f15b9c48caa32da3b2ebd6ffac224677c8208792fef41d
+size 2931

fuson_plm/data/config.py ADDED Viewed

	@@ -0,0 +1,34 @@

+# config.py
+from fuson_plm.utils.logging import CustomParams
+CLEAN = CustomParams(
+    ### Changing these parameters is not recommended
+    FODB_PATH = '../data/raw_data/FOdb_all.csv',                    # path to raw FOdb database
+    FODB_PUNCTA_PATH = '../data/raw_data/FOdb_puncta.csv',          # path to raw FOdb puncta experimental data
+    FUSIONPDB_PATH = '../data/raw_data/FusionPDB.txt',              # path to raw FusionPDB Level 1 .txt download
+)
+# Clustering Parameters
+CLUSTER = CustomParams(
+    MAX_SEQ_LENGTH = 2000,                                          # INCLUSIVE max length (amino acids) of a sequence for training, validation, or testing
+    # MMSeqs2 parameters: see GitHub or MMSeqs2 Wiki for guidance
+    MIN_SEQ_ID = 0.3,                                               # % identity
+    C = 0.8,                                                        # % sequence length overlap
+    COV_MODE = 0,                                                  # cov-mode: 0 = bidirectional, 1 = target coverage, 2 = query coverage, 3 = target-in-query length coverage.
+    # File paths
+    INPUT_PATH = '../data/fuson_db.csv',
+    PATH_TO_MMSEQS = '../mmseqs'                                    # path to where you installed MMSeqs2
+)
+# Splitting Parameters
+# We randomly split clusters in two rounds to arrive at a Train, Validation, and Test set.
+# Round 1) All clusters -> Train (final) and Other (temp). Round 2) Other (temp) clusters -> Val (final) and Test (final)
+SPLIT = CustomParams(
+    FUSON_DB_PATH = '../data/fuson_db.csv',
+    CLUSTER_OUTPUT_PATH = '../data/clustering/mmseqs_full_results.csv',
+    RANDOM_STATE_1 = 2,                                    # random_state_1 = state for splitting all data into train & other
+    TEST_SIZE_1 = 0.18,                                    # test size for data -> train/test split. e.g. 20 means 80% clusters in train, 20% clusters in other
+    RANDOM_STATE_2 = 6,                                    # random_state_2 = state for splitting other from ^ into val and test
+    TEST_SIZE_2 = 0.44                                     # test size for train -> train/val split. e.g. 0.50 means 50% clusters in train, 50% clusters in test
+)

fuson_plm/data/data_cleaning_log.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3210df509a0c50e4ec07f9df396ff166abf85212342e203eb9d9dac1115eca71
+size 10381

fuson_plm/data/fuson_db.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c5c841ff582a45ba2427ed504d965ba01ab49e6cb2f3dacf2e4e3cbedad255d3
+size 37076062

fuson_plm/data/head_tail_data/ensembl_ht_idmap.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4ae973ad3b86408e684efc0af63249af50cc8c6e1bce73465550dc0a9c2bc839
+size 28978535

fuson_plm/data/head_tail_data/gene_to_ensembl_dict.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:46f49a7d49c00e80aa4426da6245558d7d36cf21525620d3f1d5339c1772df40
+size 547183

fuson_plm/data/head_tail_data/genename_ht_idmap.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e7de358687c013017ca30c67b613d861794430f8c8e040478710e56535801c92
+size 54844814

fuson_plm/data/head_tail_data/htgenes_uniprotids.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8f1ff59b6e41f58585f0ce588d2232bfe5605219b047f478c6ce74ecd2715a1d
+size 889031

fuson_plm/data/head_tail_data/isoform_fasta_id_output_formatted.fasta ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3a43d06330e000c1f81d905fb3bba4e911f5530f6d34d1476f206bea4a420ddd
+size 41148731

fuson_plm/data/head_tail_data/uniprot_idmap_inputs/head_tail_ens.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a85c21273e9cfdb0baa5e22f8a033413c68299d71ba7a8deecdf93d057d280ac
+size 442383

fuson_plm/data/head_tail_data/uniprot_idmap_inputs/head_tail_genes.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0b9ce1957875bdf05670d6c7412fa082896837714573197c6d92c5777cb24746
+size 67736

fuson_plm/data/raw_data/FOdb_SD5.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4b4fae3b8ab9661500ac34dc5e96069f76118f79ceac8e101d5361fe5e46d4b4
+size 19345

fuson_plm/data/raw_data/FOdb_all.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:acd531078289d83d42cdd8c0031b682612d396d92eea2e1e8b1871044424fdb0
+size 3876082

fuson_plm/data/raw_data/FOdb_puncta.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5ece2df3daef17ef73676ec35074fe9534038160a82b8125411ab4f7fefed54b
+size 237498

fuson_plm/data/raw_data/FusionPDB.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4040f181272d77ca5ab72fd85d00c6c36d7edd43613cebb44357f582eac7f3db
+size 531417333

fuson_plm/data/raw_data/FusionPDB_cleaned.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:351ef6f4f93e40859b5cef19ab5ac0729c6eeda8ac732fbe4bed7b68e1c5c7d2
+size 34245297

fuson_plm/data/split.py ADDED Viewed

	@@ -0,0 +1,120 @@

+import pandas as pd
+import os
+import pickle
+from fuson_plm.data.config import SPLIT
+from fuson_plm.utils.logging import log_update, open_logfile
+from fuson_plm.utils.splitting import split_clusters, check_split_validity
+from fuson_plm.utils.visualizing import set_font, visualize_splits
+def get_benchmark_data(fuson_db_path, clusters):
+    """
+    """
+    # Read the fusion database
+    fuson_db = pd.read_csv(fuson_db_path)
+    # Get original benchmark sequences, and benchmark sequences that were clustered
+    original_benchmark_sequences = fuson_db.loc[(fuson_db['benchmark'].notna()) ]
+    benchmark_sequences = fuson_db.loc[
+        (fuson_db['benchmark'].notna()) &                               # it's a benchmark sequence
+        (fuson_db['aa_seq'].isin(list(clusters['member seq'])))         # it was clustered (it's under the length limit specified for clustering)
+    ]['aa_seq'].to_list()
+    # Get the sequence IDs of all clustered benchmark sequences.
+    benchmark_seq_ids = fuson_db.loc[fuson_db['benchmark'].notna()]['seq_id']
+    # Use benchmark_seq_ids to find which clusters contain benchmark sequences.
+    benchmark_cluster_reps = clusters.loc[clusters['member seq_id'].isin(benchmark_seq_ids)]['representative seq_id'].unique().tolist()
+    log_update(f"\t{len(benchmark_sequences)}/{len(original_benchmark_sequences)} benchmarking sequences (only those shorter than config.CLUSTERING[\'max_seq_length\']) were grouped into {len(benchmark_cluster_reps)} clusters. These will be reserved for the test set.")
+    return benchmark_cluster_reps, benchmark_sequences
+def get_training_dfs(train, val, test):
+    log_update('\nMaking dataframes for ESM finetuning...')
+    # Delete cluster-related columns we don't need
+    train = train.drop(columns=['representative seq_id','member seq_id', 'representative seq']).rename(columns={'member seq':'sequence'})
+    val = val.drop(columns=['representative seq_id','member seq_id', 'representative seq']).rename(columns={'member seq':'sequence'})
+    test = test.drop(columns=['representative seq_id','member seq_id', 'representative seq']).rename(columns={'member seq':'sequence'})
+    return train, val, test
+def main():
+    """
+    """
+    # Read all the input files
+    LOG_PATH = "splitting_log.txt"
+    FUSON_DB_PATH = SPLIT.FUSON_DB_PATH
+    CLUSTER_OUTPUT_PATH = SPLIT.CLUSTER_OUTPUT_PATH
+    RANDOM_STATE_1 = SPLIT.RANDOM_STATE_1
+    TEST_SIZE_1 = SPLIT.TEST_SIZE_1
+    RANDOM_STATE_2 = SPLIT.RANDOM_STATE_2
+    TEST_SIZE_2 = SPLIT.TEST_SIZE_2
+    # set font
+    set_font()
+    # Prepare the log file
+    with open_logfile(LOG_PATH):
+        log_update("Loaded data-splitting configurations from config.py")
+        SPLIT.print_config(indent='\t')
+        # Prepare directory to save results
+        os.makedirs("splits",exist_ok=True)
+        # Read the clusters and get a list of the representative IDs for splitting
+        clusters = pd.read_csv(CLUSTER_OUTPUT_PATH)
+        reps = clusters['representative seq_id'].unique().tolist()
+        log_update(f"\nPreparing clusters...\n\tCollected {len(reps)} clusters for splitting")
+        # Get the benchmark cluster representatives and sequences
+        benchmark_cluster_reps, benchmark_sequences = get_benchmark_data(FUSON_DB_PATH, clusters)
+        # Make the splits and extract the results
+        splits = split_clusters(reps, benchmark_cluster_reps=benchmark_cluster_reps,
+                                random_state_1 = RANDOM_STATE_1, random_state_2 = RANDOM_STATE_2, test_size_1 = TEST_SIZE_1, test_size_2 = TEST_SIZE_2)
+        X_train = splits['X_train']
+        X_val = splits['X_val']
+        X_test = splits['X_test']
+        # Make slices of clusters dataframe for train, val, and test
+        train_clusters = clusters.loc[clusters['representative seq_id'].isin(X_train)].reset_index(drop=True)
+        val_clusters = clusters.loc[clusters['representative seq_id'].isin(X_val)].reset_index(drop=True)
+        test_clusters = clusters.loc[clusters['representative seq_id'].isin(X_test)].reset_index(drop=True)
+        # Check validity
+        check_split_validity(train_clusters, val_clusters, test_clusters, benchmark_sequences=benchmark_sequences)
+        # Print min and max sequence lengths
+        min_train_seqlen = min(train_clusters['member seq'].str.len())
+        max_train_seqlen = max(train_clusters['member seq'].str.len())
+        min_val_seqlen = min(val_clusters['member seq'].str.len())
+        max_val_seqlen = max(val_clusters['member seq'].str.len())
+        min_test_seqlen = min(test_clusters['member seq'].str.len())
+        max_test_seqlen = max(test_clusters['member seq'].str.len())
+        log_update(f"\nLength breakdown summary...\n\tTrain: min seq length = {min_train_seqlen}, max seq length = {max_train_seqlen}")
+        log_update(f"\tVal: min seq length = {min_val_seqlen}, max seq length = {max_val_seqlen}")
+        log_update(f"\tTest: min seq length = {min_test_seqlen}, max seq length = {max_test_seqlen}")
+        # Make plots to visualize the splits
+        visualize_splits(train_clusters, val_clusters, test_clusters, benchmark_cluster_reps)
+        # cols = representative seq_id,member seq_id,representative seq,member seq
+        train_clusters.to_csv("../data/splits/train_cluster_split.csv",index=False)
+        val_clusters.to_csv("../data/splits/val_cluster_split.csv",index=False)
+        test_clusters.to_csv("../data/splits/test_cluster_split.csv",index=False)
+        log_update('\nSaved cluster splits to splitting/train_cluster_split.csv, splitting/val_cluster_split.csv, splitting/test_cluster_split.csv')
+        cols=','.join(list(train_clusters.columns))
+        log_update(f'\tColumns: {cols}')
+        # IF SnP vectors have been comptued already, make train_df, val_df, test_df: the data that will be input to the training script
+        train_df, val_df, test_df = get_training_dfs(train_clusters, val_clusters, test_clusters)
+        train_df.to_csv("../data/splits/train_df.csv",index=False)
+        val_df.to_csv("../data/splits/val_df.csv",index=False)
+        test_df.to_csv("../data/splits/test_df.csv",index=False)
+        log_update('\nSaved training dataframes to splits/train_df.csv, splits/val_df.csv, splits/test_df.csv')
+        cols=','.join(list(train_df.columns))
+        log_update(f'\tColumns: {cols}')
+if __name__ == "__main__":
+    main()

fuson_plm/data/split_vis.py ADDED Viewed

	@@ -0,0 +1,333 @@

+import matplotlib.pyplot as plt
+import numpy as np
+from scipy.stats import entropy
+from sklearn.manifold import TSNE
+import pickle
+import pandas as pd
+import os
+from fuson_plm.utils.logging import log_update
+from fuson_plm.utils.visualizing import set_font
+def calculate_aa_composition(sequences):
+    composition = {}
+    total_length = sum([len(seq) for seq in sequences])
+    for seq in sequences:
+        for aa in seq:
+            if aa in composition:
+                composition[aa] += 1
+            else:
+                composition[aa] = 1
+    # Convert counts to relative frequency
+    for aa in composition:
+        composition[aa] /= total_length
+    return composition
+def calculate_shannon_entropy(sequence):
+    """
+    Calculate the Shannon entropy for a given sequence.
+    Args:
+        sequence (str): A sequence of characters (e.g., amino acids or nucleotides).
+    Returns:
+        float: Shannon entropy value.
+    """
+    bases = set(sequence)
+    counts = [sequence.count(base) for base in bases]
+    return entropy(counts, base=2)
+def visualize_splits_hist(train_lengths, val_lengths, test_lengths, colormap, savepath=f'../data/splits/length_distributions.png', axes=None):
+    log_update('\nMaking histogram of length distributions')
+    # Create a figure and axes with 1 row and 3 columns
+    if axes is None:
+        fig, axes = plt.subplots(1, 3, figsize=(18, 6))
+    # Unpack the labels and titles
+    xlabel, ylabel = ['Sequence Length (AA)', 'Frequency']
+    # Plot the first histogram
+    axes[0].hist(train_lengths, bins=20, edgecolor='k',color=colormap['train'])
+    axes[0].set_xlabel(xlabel, fontsize=24)
+    axes[0].set_ylabel(ylabel, fontsize=24)
+    axes[0].set_title(f'Train Set Length Distribution (n={len(train_lengths)})', fontsize=24)
+    axes[0].grid(True)
+    axes[0].set_axisbelow(True)
+    axes[0].tick_params(axis='x', labelsize=24)  # Customize x-axis tick label size
+    axes[0].tick_params(axis='y', labelsize=24)  # Customize y-axis tick label size
+    # Plot the second histogram
+    axes[1].hist(val_lengths, bins=20, edgecolor='k',color=colormap['val'])
+    axes[1].set_xlabel(xlabel, fontsize=24)
+    axes[1].set_ylabel(ylabel, fontsize=24)
+    axes[1].set_title(f'Validation Set Length Distribution (n={len(val_lengths)})', fontsize=24)
+    axes[1].grid(True)
+    axes[1].set_axisbelow(True)
+    axes[1].tick_params(axis='x', labelsize=24)
+    axes[1].tick_params(axis='y', labelsize=24)
+    # Plot the third histogram
+    axes[2].hist(test_lengths, bins=20, edgecolor='k',color=colormap['test'])
+    axes[2].set_xlabel(xlabel, fontsize=24)
+    axes[2].set_ylabel(ylabel, fontsize=24)
+    axes[2].set_title(f'Test Set Length Distribution (n={len(test_lengths)})', fontsize=24)
+    axes[2].grid(True)
+    axes[2].set_axisbelow(True)
+    axes[2].tick_params(axis='x', labelsize=24)
+    axes[2].tick_params(axis='y', labelsize=24)
+    # Adjust layout
+    if savepath is not None:
+        plt.tight_layout()
+        # Save the figure
+        plt.savefig(savepath)
+def visualize_splits_scatter(train_clusters, val_clusters, test_clusters, benchmark_cluster_reps, colormap, savepath='../data/splits/scatterplot.png', ax=None):
+    log_update("\nMaking scatterplot with distribution of cluster sizes across train, test, and val")
+    # Make grouped versions of these DataFrames for size analysis
+    train_clustersgb = train_clusters.groupby('representative seq_id')['member seq_id'].count().reset_index().rename(columns={'member seq_id':'member count'})
+    val_clustersgb = val_clusters.groupby('representative seq_id')['member seq_id'].count().reset_index().rename(columns={'member seq_id':'member count'})
+    test_clustersgb = test_clusters.groupby('representative seq_id')['member seq_id'].count().reset_index().rename(columns={'member seq_id':'member count'})
+    # Isolate benchmark-containing clusters so their contribution can be plotted separately
+    total_test_proteins = sum(test_clustersgb['member count'])
+    test_clustersgb['benchmark cluster'] = test_clustersgb['representative seq_id'].isin(benchmark_cluster_reps)
+    benchmark_clustersgb = test_clustersgb.loc[test_clustersgb['benchmark cluster']].reset_index(drop=True)
+    test_clustersgb = test_clustersgb.loc[test_clustersgb['benchmark cluster']==False].reset_index(drop=True)
+    # Convert them to value counts
+    train_clustersgb = train_clustersgb['member count'].value_counts().reset_index().rename(columns={'index':'cluster size (n_members)','member count': 'n_clusters'})
+    val_clustersgb = val_clustersgb['member count'].value_counts().reset_index().rename(columns={'index':'cluster size (n_members)','member count': 'n_clusters'})
+    test_clustersgb = test_clustersgb['member count'].value_counts().reset_index().rename(columns={'index':'cluster size (n_members)','member count': 'n_clusters'})
+    benchmark_clustersgb = benchmark_clustersgb['member count'].value_counts().reset_index().rename(columns={'index':'cluster size (n_members)','member count': 'n_clusters'})
+    # Get the percentage of each dataset that's made of each cluster size
+    train_clustersgb['n_proteins'] = train_clustersgb['cluster size (n_members)']*train_clustersgb['n_clusters']    # proteins per cluster * n clusters = # proteins
+    train_clustersgb['percent_proteins'] = train_clustersgb['n_proteins']/sum(train_clustersgb['n_proteins'])
+    val_clustersgb['n_proteins'] = val_clustersgb['cluster size (n_members)']*val_clustersgb['n_clusters']
+    val_clustersgb['percent_proteins'] = val_clustersgb['n_proteins']/sum(val_clustersgb['n_proteins'])
+    test_clustersgb['n_proteins'] = test_clustersgb['cluster size (n_members)']*test_clustersgb['n_clusters']
+    test_clustersgb['percent_proteins'] = test_clustersgb['n_proteins']/total_test_proteins
+    benchmark_clustersgb['n_proteins'] = benchmark_clustersgb['cluster size (n_members)']*benchmark_clustersgb['n_clusters']
+    benchmark_clustersgb['percent_proteins'] = benchmark_clustersgb['n_proteins']/total_test_proteins
+    # Specially mark the benchmark clusters because these can't be reallocated
+    if ax is None:
+        fig, ax = plt.subplots(figsize=(18, 6))
+    ax.plot(train_clustersgb['cluster size (n_members)'],train_clustersgb['percent_proteins'],linestyle='None',marker='.',color=colormap['train'],label='train')
+    ax.plot(val_clustersgb['cluster size (n_members)'],val_clustersgb['percent_proteins'],linestyle='None',marker='.',color=colormap['val'],label='val')
+    ax.plot(test_clustersgb['cluster size (n_members)'],test_clustersgb['percent_proteins'],linestyle='None',marker='.',color=colormap['test'],label='test')
+    ax.plot(benchmark_clustersgb['cluster size (n_members)'],benchmark_clustersgb['percent_proteins'],
+            marker='o',
+            linestyle='None',
+            markerfacecolor=colormap['test'],      # fill same as test
+            markeredgecolor='black',    # outline black
+            markeredgewidth=1.5,
+            label='benchmark'
+        )
+    ax.set_ylabel('Percentage of Proteins in Dataset', fontsize=24)
+    ax.set_xlabel('Cluster Size', fontsize=24)
+    ax.tick_params(axis='x', labelsize=24)  # Customize x-axis tick label size
+    ax.tick_params(axis='y', labelsize=24)  # Customize y-axis tick label size
+    ax.legend(fontsize=24,markerscale=4)
+    # save the figure
+    if savepath is not None:
+        plt.tight_layout()
+        plt.savefig(savepath)
+        log_update(f"\tSaved figure to {savepath}")
+def get_avg_embeddings_for_tsne(train_sequences, val_sequences, test_sequences, embedding_path='fuson_db_embeddings/fuson_db_esm2_t33_650M_UR50D_avg_embeddings.pkl'):
+    embeddings = {}
+    try:
+        with open(embedding_path, 'rb') as f:
+            embeddings = pickle.load(f)
+        train_embeddings = [v for k, v in embeddings.items() if k in train_sequences]
+        val_embeddings = [v for k, v in embeddings.items() if k in val_sequences]
+        test_embeddings = [v for k, v in embeddings.items() if k in test_sequences]
+        return train_embeddings, val_embeddings, test_embeddings
+    except:
+        print("could not open embeddings")
+def visualize_splits_tsne(train_sequences, val_sequences, test_sequences, colormap, esm_type="esm2_t33_650M_UR50D", embedding_path="fuson_db_embeddings/fuson_db_esm2_t33_650M_UR50D_avg_embeddings.pkl", savepath='../data/splits/tsne_plot.png',ax=None):
+    """
+    Generate a t-SNE plot of embeddings for train, test, and validation.
+    """
+    log_update('\nMaking t-SNE plot of train, val, and test embeddings')
+    # Combine the embeddings into one array
+    train_embeddings, val_embeddings, test_embeddings = get_avg_embeddings_for_tsne(train_sequences, val_sequences, test_sequences, embedding_path=embedding_path)
+    embeddings = np.concatenate([train_embeddings, val_embeddings, test_embeddings])
+    # Labels for the embeddings
+    labels = ['train'] * len(train_embeddings) + ['val'] * len(val_embeddings) + ['test'] * len(test_embeddings)
+    # Perform t-SNE
+    tsne = TSNE(n_components=2, random_state=42)
+    tsne_results = tsne.fit_transform(embeddings)
+    # Convert t-SNE results into a DataFrame
+    tsne_df = pd.DataFrame(data=tsne_results, columns=['TSNE_1', 'TSNE_2'])
+    tsne_df['label'] = labels
+    # Plotting
+    if ax is None:
+        fig, ax = plt.subplots(figsize=(10, 8))
+    # Scatter plot for each set
+    for label, color in colormap.items():
+        subset = tsne_df[tsne_df['label'] == label].reset_index(drop=True)
+        ax.scatter(subset['TSNE_1'], subset['TSNE_2'], c=color, label=label.capitalize(), alpha=0.6)
+    ax.set_title(f't-SNE of {esm_type} Embeddings')
+    ax.set_xlabel('t-SNE Dimension 1')
+    ax.set_ylabel('t-SNE Dimension 2')
+    ax.legend(fontsize=24, markerscale=2)
+    ax.grid(True)
+    # Save the figure if savepath is provided
+    if savepath:
+        plt.tight_layout()
+        fig.savefig(savepath)
+def visualize_splits_shannon_entropy(train_sequences, val_sequences, test_sequences, colormap, savepath='../data/splits/shannon_entropy_plot.png',axes=None):
+    """
+    Generate Shannon entropy plots for train, validation, and test sets.
+    """
+    log_update('\nMaking histogram of Shannon Entropy distributions')
+    train_entropy = [calculate_shannon_entropy(seq) for seq in train_sequences]
+    val_entropy = [calculate_shannon_entropy(seq) for seq in val_sequences]
+    test_entropy = [calculate_shannon_entropy(seq) for seq in test_sequences]
+    if axes is None:
+        fig, axes = plt.subplots(1, 3, figsize=(18, 6))
+    axes[0].hist(train_entropy, bins=20, edgecolor='k', color=colormap['train'])
+    axes[0].set_title(f'Train Set (n={len(train_entropy)})', fontsize=24)
+    axes[0].set_xlabel('Shannon Entropy', fontsize=24)
+    axes[0].set_ylabel('Frequency', fontsize=24)
+    axes[0].grid(True)
+    axes[0].set_axisbelow(True)
+    axes[0].tick_params(axis='x', labelsize=24)
+    axes[0].tick_params(axis='y', labelsize=24)
+    axes[1].hist(val_entropy, bins=20, edgecolor='k', color=colormap['val'])
+    axes[1].set_title(f'Validation Set (n={len(val_entropy)})', fontsize=24)
+    axes[1].set_xlabel('Shannon Entropy', fontsize=24)
+    axes[1].grid(True)
+    axes[1].set_axisbelow(True)
+    axes[1].tick_params(axis='x', labelsize=24)
+    axes[1].tick_params(axis='y', labelsize=24)
+    axes[2].hist(test_entropy, bins=20, edgecolor='k', color=colormap['test'])
+    axes[2].set_title(f'Test Set (n={len(test_entropy)})', fontsize=24)
+    axes[2].set_xlabel('Shannon Entropy', fontsize=24)
+    axes[2].grid(True)
+    axes[2].set_axisbelow(True)
+    axes[2].tick_params(axis='x', labelsize=24)
+    axes[2].tick_params(axis='y', labelsize=24)
+    if savepath is not None:
+        plt.tight_layout()
+        plt.savefig(savepath)
+def visualize_splits_aa_composition(train_sequences, val_sequences, test_sequences,colormap, savepath='../data/splits/aa_comp.png',ax=None):
+    log_update('\nMaking bar plot of AA composition across each set')
+    train_comp = calculate_aa_composition(train_sequences)
+    val_comp = calculate_aa_composition(val_sequences)
+    test_comp = calculate_aa_composition(test_sequences)
+    # Create DataFrame
+    comp_df = pd.DataFrame([train_comp, val_comp, test_comp], index=['train', 'val', 'test']).T
+    colors = [colormap[col] for col in comp_df.columns]
+    # Plotting
+    #fig, ax = plt.subplots(figsize=(12, 6))
+    if ax is None:
+        fig, ax = plt.subplots(figsize=(12, 6))
+    else:
+        fig = ax.get_figure()
+    comp_df.plot(kind='bar', color=colors, ax=ax)
+    ax.set_title('Amino Acid Composition Across Datasets', fontsize=24)
+    ax.set_xlabel('Amino Acid', fontsize=24)
+    ax.set_ylabel('Relative Frequency', fontsize=24)
+    ax.tick_params(axis='x', labelsize=24)  # Customize x-axis tick label size
+    ax.tick_params(axis='y', labelsize=24)  # Customize y-axis tick label size
+    ax.legend(fontsize=16, markerscale=2)
+    if savepath is not None:
+        fig.savefig(savepath)
+def visualize_splits(train_clusters, val_clusters, test_clusters, benchmark_cluster_reps, train_color='#0072B2',val_color='#009E73',test_color='#E69F00',esm_embeddings_path=None, onehot_embeddings_path=None):
+    colormap = {
+        'train': train_color,
+        'val': val_color,
+        'test': test_color
+    }
+    # Add columns for plotting
+    train_clusters['member length'] = train_clusters['member seq'].str.len()
+    val_clusters['member length'] = val_clusters['member seq'].str.len()
+    test_clusters['member length'] = test_clusters['member seq'].str.len()
+    # Prepare lengths and seqs for plotting
+    train_lengths = train_clusters['member length'].tolist()
+    val_lengths = val_clusters['member length'].tolist()
+    test_lengths = test_clusters['member length'].tolist()
+    train_sequences = train_clusters['member seq'].tolist()
+    val_sequences = val_clusters['member seq'].tolist()
+    test_sequences = test_clusters['member seq'].tolist()
+    # Create a combined figure with 3 rows and 3 columns
+    fig_combined, axs = plt.subplots(3, 3, figsize=(24, 18))
+    # Make the three visualization plots for saving TOGETHER
+    visualize_splits_hist(train_lengths,val_lengths,test_lengths,colormap, savepath=None,axes=axs[0])
+    visualize_splits_shannon_entropy(train_sequences,val_sequences,test_sequences,colormap,savepath=None,axes=axs[1])
+    visualize_splits_scatter(train_clusters, val_clusters, test_clusters, benchmark_cluster_reps, colormap, savepath=None, ax=axs[2, 0])
+    visualize_splits_aa_composition(train_sequences,val_sequences,test_sequences, colormap, savepath=None, ax=axs[2, 1])
+    if not(esm_embeddings_path is None) and os.path.exists(esm_embeddings_path):
+        visualize_splits_tsne(train_sequences, val_sequences, test_sequences, colormap, savepath=None, ax=axs[2, 2])
+    else:
+    # Leave the last subplot blank
+        axs[2, 2].axis('off')
+    plt.tight_layout()
+    fig_combined.savefig('../data/splits/combined_plot.png')
+    # Make the three visualization plots for saving separately
+    visualize_splits_hist(train_clusters['member length'].tolist(), val_clusters['member length'].tolist(), test_clusters['member length'].tolist(),colormap)
+    visualize_splits_scatter(train_clusters, val_clusters, test_clusters, benchmark_cluster_reps, colormap)
+    visualize_splits_aa_composition(train_clusters['member seq'].tolist(), val_clusters['member seq'].tolist(), test_clusters['member seq'].tolist(),colormap)
+    visualize_splits_shannon_entropy(train_sequences,val_sequences,test_sequences,colormap)
+    if not(esm_embeddings_path is None) and os.path.exists(esm_embeddings_path):
+        visualize_splits_tsne(train_clusters['member seq'].tolist(), val_clusters['member seq'].tolist(), test_clusters['member seq'].tolist(),colormap)
+def main():
+    set_font()
+    train_clusters = pd.read_csv('splits/train_cluster_split.csv')
+    val_clusters = pd.read_csv('splits/val_cluster_split.csv')
+    test_clusters = pd.read_csv('splits/test_cluster_split.csv')
+    clusters = pd.concat([train_clusters,val_clusters,test_clusters])
+    fuson_db = pd.read_csv('fuson_db.csv')
+    # Get the sequence IDs of all clustered benchmark sequences.
+    benchmark_seq_ids = fuson_db.loc[fuson_db['benchmark'].notna()]['seq_id']
+    # Use benchmark_seq_ids to find which clusters contain benchmark sequences.
+    benchmark_cluster_reps = clusters.loc[clusters['member seq_id'].isin(benchmark_seq_ids)]['representative seq_id'].unique().tolist()
+    visualize_splits(train_clusters, val_clusters, test_clusters, benchmark_cluster_reps,
+                         esm_embeddings_path='fuson_db_embeddings/fuson_db_esm2_t33_650M_UR50D_avg_embeddings.pkl', onehot_embeddings_path=None)
+if __name__ == "__main__":
+    main()

fuson_plm/data/splits/combined_plot.png ADDED Viewed

fuson_plm/data/splits/test_cluster_split.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ed561338b86264b420b12e37e92c8c434b3119081a9a02a7688c1934343ee5fb
+size 5628545

fuson_plm/data/splits/test_df.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9590919f2c09430dc6f8b617b8a738f7e174fb04fdae1f35ceaa0351ea05612f
+size 32236663

fuson_plm/data/splits/train_cluster_split.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1be258917ebd87f7cf71223e6fa340e7b6228464c42ae0b630c24efea8d2bd14
+size 44850849

fuson_plm/data/splits/train_df.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ddf563734014c7d4fec944c639431d3423d2ab79e1e6e9e800c955c24438c8eb
+size 257270565