{"cells":[{"cell_type":"markdown","metadata":{},"source":["#### MSA search process directly reuses ColabFold\n","\n","\n","refer to https://github.com/sokrypton/ColabFold?tab=readme-ov-file"]},{"cell_type":"code","execution_count":null,"metadata":{"vscode":{"languageId":"shellscript"}},"outputs":[],"source":["# Install miniforge\n","# https://github.com/conda-forge/miniforge#download\n","conda create -n colabfold_search python=3.10\n","conda activate colabfold_scratch\n","cd ~/ && mkdir workdir && cd workdir"]},{"cell_type":"code","execution_count":null,"metadata":{"vscode":{"languageId":"shellscript"}},"outputs":[],"source":["git clone https://github.com/soedinglab/MMseqs2.git\n","cd MMseqs2\n","git checkout 71dd32ec43e3ac4dabf111bbc4b124f1c66a85f1 # As in colabfold readme\n","mkdir build\n","cd build\n","cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..\n","make\n","make install \n","export PATH=$(pwd)/bin/:$PATH\n","cd ../"]},{"cell_type":"code","execution_count":null,"metadata":{"vscode":{"languageId":"shellscript"}},"outputs":[],"source":["git clone https://github.com/sokrypton/ColabFold\n","cd ColabFold \n","git checkout 9bd39fd8be88ca71c039be9a8000bb0db92f80a2 \n","# download and prepare the database\n","# Note: Uniref only version uniref30_2103 has contained taxonomy information\n","# you can download it from https://wwwuser.gwdg.de/~compbiol/colabfold/uniref30_2103_taxonomy.tar.gz\n","bash setup_databases.sh \n","# If you want to add Taxonomy ID using uniref30_2103 database\n","# refer to https://github.com/sokrypton/ColabFold/issues/216\n","# Add the following line after Line 93 in colabfold/mmseqs/search.py\n","# run_mmseqs(mmseqs, [\"convertalis\", base.joinpath(\"qdb\"), dbbase.joinpath(f\"{uniref_db}{dbSuffix1}\"),\n","# base.joinpath(\"res_exp_realign_filter\"), base.joinpath(\"uniref_tax.m8\"), \"--format-output\",\n","# \"query,target,taxid,taxname,taxlineage,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,cigar\",\n","# \"--db-load-mode\", str(db_load_mode), \"--threads\", str(threads)])\n","\n","pip3 install .\n","pip install colabfold[alphafold]==1.5.5"]},{"cell_type":"code","execution_count":null,"metadata":{"vscode":{"languageId":"shellscript"}},"outputs":[],"source":["export PYTHONPATH=$PYTHONPATH:ColabFold\n","mkdir -p logs\n","mkdir -p results\n","python3 ColabFold/colabfold/mmseqs/search.py --mmseqs=mmseqs {input_file} {db_dir} {output_dir} --db1 uniref30_2103_db --db2 pdb70_220313_db --db3 colabfold_envdb_202108_db --use-templates 1 --db-load-mode 2 --threads 32\n","# if you use scripts/msa/data/pdb_seqs/pdb_seq.fasta as {input_file}\n","# you will get the following results as in scripts/msa/data/mmcif_msa_initial\n","# 0.a3m 1.a3m 2.a3m 3.a3m pdb70_220313_db.m8 uniref_tax.m8"]},{"cell_type":"markdown","metadata":{},"source":["#### The above MSA Pipeline is how we generate the MSA data for training.\n","- It depends on specific versions of ColabFold and MMseq.\n","- It requires modifying the ColabFold code and using a specific version of Uniref30 to generate MSA with taxonomy information.\n","- The overall solution is to search the MSA containing taxonomy information only once for the unique sequence, and pair it according to the species information of each MSA.\n","\n"]}],"metadata":{"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":2}