get_state_embs <cls> token missing in token dictionary
Hello,
I try to run the in_silico_perturbation.ipynb.
Data_set is from https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files/cell_classification/disease_classification.
get_state_embs had error :
AssertionError: token missing in token dictionary
I checked all the token_dictionary_gc30M.pkl in geneformer/gene_dictionaries_30m and shew that:
Sample keys: ['', '
The human_dcm_hcm_nf.dataset id was like:
{'input_ids': [17173, 9837, 4456, 2080, 9311, 10856, 8764, 6073, 4262, 16585, 12769, 1375, 14751, 13169, 15935, 9730, 8681
It seems that the code for genes were different.
I added "emb_mode='cell', cell_emb_style='mean_pool', gene_emb_style='mean_pool'" to EmbExtractor
embex = EmbExtractor(model_type="CellClassifier", # if using previously fine-tuned cell classifier model
num_classes=3,
filter_data=filter_data_dict,
max_ncells=1000,
emb_layer=0,
summary_stat="exact_mean",
forward_batch_size=256,
nproc=16,
emb_mode='cell', cell_emb_style='mean_pool', gene_emb_style='mean_pool'
)
Then I finished isp.perturb_data.
But when I run "get_stats", it shew another error:
Traceback (most recent call last):
File "try2.py", line 15, in /human_dcm_hcm_nf_output_2", ...)
ispstats.get_stats("
File "/geneformer/lib/python3.11/site-packages/geneformer/in_silico_perturber_stats.py", line 1022, in get_stats/geneformer/lib/python3.11/site-packages/geneformer/in_silico_perturber_stats.py", line 1027, in
"Ensembl_ID": [
File "
else self.gene_token_id_dict[genes]
~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
KeyError: 20280
I think it was caused by the different gene code ENS / digital code.
Could you please help to tell what should I do to solve this problem?
Best regards,
Susie
Thanks for your question! In all functions, the default dictionary is currently the 95M dictionary, so please add the 30M one to the argument if you are using the 30M model/data.