AlphaFold on Notos
This document provides the implementation of the pipeline of AlphaFold V2.0 on the notos server at the Computational Science Research Center, SDSU. The notos server is managed by Dr. Christopher Paolini (Email).
There are two CPU sockets on notos. Since, notos is not a cluster with compute nodes there is no batch scheduler. Scripts/ commands can be run on multiple GNU Screen session. More instructions are given at the end of this document. Follow Running AlphaFold on GNU Screen session.
More information about AlphaFold can be found here.
First time setup
The following steps are required in order to run AlphaFold:
- Install Docker.
- Install NVIDIA Container Toolkit for GPU support.
- Setup running Docker as a non-root user.
- Docker and NVIDIA Container Toolkit have been installed on notos and are up-to date as of 6/9/2022. As each new user follows the steps below to run alphafold they must contact Dr. Christopher Paolini (Email) for docker access.
- Download genetic databases and model parameters.
- The genetic databases and model parameters have been downloaded and can be found on the notos server at
/mnt/beegfs/alphafold/databases
- The genetic databases and model parameters have been downloaded and can be found on the notos server at
-
Check that AlphaFold will be able to use a GPU by running:
docker run --rm --gpus all nvidia/cuda:11.6.0-base-ubuntu20.04 nvidia-smi
The output of this command should show a list of your GPUs.
Genetic databases - Downloaded on Notos
This step requires aria2c
which is already installed and is available on the notos server.
AlphaFold needs multiple genetic (sequence) databases to run:
- BFD,
- MGnify,
- PDB70,
- PDB (structures in the mmCIF format),
- PDB seqres – only for AlphaFold-Multimer,
- Uniclust30,
- UniProt – only for AlphaFold-Multimer,
- UniRef90.
A script scripts/download_all_data.sh
is provided on the official alphafold readme where you can download and set up all of the databases. This script did not work as intended on the notos server. A workaround is to manually download all the databases using the scripts available at scripts/
on the Alphafold github respository
The $DOWNLOAD_DIR
on notos for alphafold is /mnt/beegfs/alphafold/databases
-
Default:
bash scripts/download_alphafold_params.sh <DOWNLOAD_DIR> bash scripts/download_bfd.sh <DOWNLOAD_DIR> bash scripts/download_mgnify.sh <DOWNLOAD_DIR> bash scripts/download_pdb70.sh <DOWNLOAD_DIR> bash scripts/download_pdb_mmcif.sh <DOWNLOAD_DIR> bash scripts/download_pdb_seqres.sh <DOWNLOAD_DIR> bash scripts/download_uniclust30.sh <DOWNLOAD_DIR> bash scripts/download_uniprot.sh <DOWNLOAD_DIR> bash scripts/download_uniref90.sh <DOWNLOAD_DIR>
-
With
reduced_dbs
:bash scripts/download_small_bfd.sh <DOWNLOAD_DIR>
instead of
download_bfd.sh
, will download a reduced version of the databases to be used with thereduced_dbs
database preset.
Note: The download directory <DOWNLOAD_DIR>
should not be a subdirectory in the AlphaFold repository directory. If it is, the Docker build will be slow as the large databases will be copied during the image creation.
Note: The total download size for the full databases is around 415 GB and the total size when unzipped is 2.2 TB.
Once the script has finished, you should have the following directory structure:
$DOWNLOAD_DIR/ # Total: ~ 2.2 TB (download: 438 GB)
bfd/ # ~ 1.7 TB (download: 271.6 GB)
# 6 files.
mgnify/ # ~ 64 GB (download: 32.9 GB)
mgy_clusters_2018_12.fa
params/ # ~ 3.5 GB (download: 3.5 GB)
# 5 CASP14 models,
# 5 pTM models,
# 5 AlphaFold-Multimer models,
# LICENSE,
# = 16 files.
pdb70/ # ~ 56 GB (download: 19.5 GB)
# 9 files.
pdb_mmcif/ # ~ 206 GB (download: 46 GB)
mmcif_files/
# About 180,000 .cif files.
obsolete.dat
pdb_seqres/ # ~ 0.2 GB (download: 0.2 GB)
pdb_seqres.txt
small_bfd/ # ~ 17 GB (download: 9.6 GB)
bfd-first_non_consensus_sequences.fasta
uniclust30/ # ~ 86 GB (download: 24.9 GB)
uniclust30_2018_08/
# 13 files.
uniprot/ # ~ 98.3 GB (download: 49 GB)
uniprot.fasta
uniref90/ # ~ 58 GB (download: 29.7 GB)
uniref90.fasta
The genetic databases and model parameters have been downloaded and can be found on the notos server at
```bash
/mnt/beegfs/alphafold/databases
```
Setting up Alphafold for the first time
-
Clone this repository and
cd
into it.git clone https://github.com/deepmind/alphafold.git
-
Build the Docker image (Need to be done only once):
docker build -f docker/Dockerfile -t alphafold .
If you encounter the following error:
W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease' is not signed.
Add the following line to the
docker/Dockerfile
:RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
If you encounter the following in Step 3/19:
Step 3/9: ARG CUDA symlink /proc/mounts /mnt/beegfs/notos/docker/aufs/mnt/<long code> -inti/etc/mtab: device or resource is busy
This means that all the disk space for docker is exhausted. Contact Dr. Christopher Paolini (Email) to add more disk space for docker.
-
Install the
run_docker.py
dependencies. Note: You may optionally wish to create a Python Virtual Environment to prevent conflicts with your system’s Python environment.pip3 install -r docker/requirements.txt
-
Open
docker/run_docker.py
and change the output directoryoutput_dir
to your choice of directory where you have sufficient permissions to write into it. -
Run
run_docker.py
pointing to a FASTA file containing the protein sequence(s) for which you wish to predict the structure. If you are predicting the structure of a protein that is already in PDB and you wish to avoid using it as a template, thenmax_template_date
must be set to be before the release date of the structure. You must also provide the path to the directory containing the downloaded databases. For example, for the T1050 CASP14 target:python3 docker/run_docker.py \ --fasta_paths=T1050.fasta \ --max_template_date=2020-05-14 \ --data_dir=$DOWNLOAD_DIR
The
$DOWNLOAD_DIR
on notos for alphafold is/mnt/beegfs/alphafold/databases
If you encounter the following error:
'Jackhmmer failed\nstderr:\n%s\n' % stderr.decode('utf-8')) RuntimeError: Jackhmmer failed stderr: Fatal exception (source file esl_msafile_stockholm.c, line 1263): stockholm msa wrote failed system error: No space left on device
Contact Dr. Christopher Paolini (Email) to allocated more space in the docker:
If you encounter the following error:
TypeError: Descriptors cannot not be created directly. If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0. If you cannot immediately regenerate your protos, some other possible workarounds are: 1. Downgrade the protobuf package to 3.20.x or lower. 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
Do
pip3 install --upgrade protobuf==3.20.0
And add the following line in
docker/run_docker.py
in the environmnet definition as shown below:'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION': 'python',
container = client.containers.run( image=FLAGS.docker_image_name, command=command_args, device_requests=device_requests, remove=True, detach=True, mounts=mounts, user=FLAGS.docker_user, environment={ 'NVIDIA_VISIBLE_DEVICES': FLAGS.gpu_devices, # The following flags allow us to make predictions on proteins that # would typically be too long to fit into GPU memory. 'TF_FORCE_UNIFIED_MEMORY': '1', 'XLA_PYTHON_CLIENT_MEM_FRACTION': '4.0', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION': 'python', })
If you encounter the following error:
RuntimeError: HHblits failed stdout: stderr: 04:13:23.681 ERROR: Could find neither hhm_db nor a3m_db!
It is due to the change in permissions. Do,
sudo find /mnt/beegfs/alphafold/databases -type d -exec chmod 755 {} \; sudo find /mnt/beegfs/alphafold/databases -type f -exec chmod 644 {} \;
If you don’t have access to change permissions contact Dr. Christopher Paolini (Email).
Running AlphaFold
-
You can control which AlphaFold model to run by adding the
--model_preset=
flag.-
monomer: This is the original model used at CASP14 with no ensembling.
-
monomer_casp14: This is the original model used at CASP14 with
num_ensemble=8
, matching our CASP14 configuration. This is largely provided for reproducibility as it is 8x more computationally expensive for limited accuracy gain (+0.1 average GDT gain on CASP14 domains). -
monomer_ptm: This is the original CASP14 model fine tuned with the pTM head, providing a pairwise confidence measure. It is slightly less accurate than the normal monomer model.
-
multimer: This is the AlphaFold-Multimer model. To use this model, provide a multi-sequence FASTA file. In addition, the UniProt database should have been downloaded.
-
-
You can control MSA speed/quality tradeoff by adding
--db_preset=reduced_dbs
or--db_preset=full_dbs
to the run command. The following presets are provided.-
reduced_dbs: This preset is optimized for speed and lower hardware requirements. It runs with a reduced version of the BFD database. It requires 8 CPU cores (vCPUs), 8 GB of RAM, and 600 GB of disk space.
-
full_dbs: This runs with all genetic databases used at CASP14.
Running the command above with the
monomer
model preset and thereduced_dbs
data preset would look like this:python3 docker/run_docker.py \ --fasta_paths=T1050.fasta \ --max_template_date=2020-05-14 \ --model_preset=monomer \ --db_preset=reduced_dbs \ --data_dir=$DOWNLOAD_DIR
The
$DOWNLOAD_DIR
on notos for alphafold is/mnt/beegfs/alphafold/databases
-
Running AlphaFold-Multimer
-
All steps are the same as when running the monomer system, but you will have to
- provide an input fasta with multiple sequences,
- set
--model_preset=multimer
,
An example that folds a protein complex
multimer.fasta
:python3 docker/run_docker.py \ --fasta_paths=multimer.fasta \ --max_template_date=2020-05-14 \ --model_preset=multimer \ --data_dir=$DOWNLOAD_DIR
The
$DOWNLOAD_DIR
on notos for alphafold is/mnt/beegfs/alphafold/databases
By default the multimer system will run 5 seeds per model (25 total predictions) for a small drop in accuracy you may wish to run a single seed per model. This can be done via the --num_multimer_predictions_per_model
flag, e.g. set it to --num_multimer_predictions_per_model=1
to run a single seed per model.
Examples
Below are examples on how to use AlphaFold in different scenarios.
Folding a monomer
Say we have a monomer with the sequence <SEQUENCE>
. The input fasta should be:
>sequence_name
<SEQUENCE>
Then run the following command:
python3 docker/run_docker.py \
--fasta_paths=monomer.fasta \
--max_template_date=2021-11-01 \
--model_preset=monomer \
--data_dir=$DOWNLOAD_DIR
The $DOWNLOAD_DIR
on notos for alphafold is /mnt/beegfs/alphafold/databases
Folding a homomer
Say we have a homomer with 3 copies of the same sequence <SEQUENCE>
. The input fasta should be:
>sequence_1
<SEQUENCE>
>sequence_2
<SEQUENCE>
>sequence_3
<SEQUENCE>
Then run the following command:
python3 docker/run_docker.py \
--fasta_paths=homomer.fasta \
--max_template_date=2021-11-01 \
--model_preset=multimer \
--data_dir=$DOWNLOAD_DIR
The $DOWNLOAD_DIR
on notos for alphafold is /mnt/beegfs/alphafold/databases
Folding a heteromer
Say we have an A2B3 heteromer, i.e. with 2 copies of <SEQUENCE A>
and 3 copies of <SEQUENCE B>
. The input fasta should be:
>sequence_1
<SEQUENCE A>
>sequence_2
<SEQUENCE A>
>sequence_3
<SEQUENCE B>
>sequence_4
<SEQUENCE B>
>sequence_5
<SEQUENCE B>
Then run the following command:
python3 docker/run_docker.py \
--fasta_paths=heteromer.fasta \
--max_template_date=2021-11-01 \
--model_preset=multimer \
--data_dir=$DOWNLOAD_DIR
The $DOWNLOAD_DIR
on notos for alphafold is /mnt/beegfs/alphafold/databases
Folding multiple monomers one after another
Say we have a two monomers, monomer1.fasta
and monomer2.fasta
.
Both can be folded sequentially by using the following command:
python3 docker/run_docker.py \
--fasta_paths=monomer1.fasta,monomer2.fasta \
--max_template_date=2021-11-01 \
--model_preset=monomer \
--data_dir=$DOWNLOAD_DIR
The $DOWNLOAD_DIR
on notos for alphafold is /mnt/beegfs/alphafold/databases
Folding multiple multimers one after another
Say we have a two multimers, multimer1.fasta
and multimer2.fasta
.
Both can be folded sequentially by using the following command:
python3 docker/run_docker.py \
--fasta_paths=multimer1.fasta,multimer2.fasta \
--max_template_date=2021-11-01 \
--model_preset=multimer \
--data_dir=$DOWNLOAD_DIR
The $DOWNLOAD_DIR
on notos for alphafold is /mnt/beegfs/alphafold/databases
AlphaFold output
The outputs will be saved in a subdirectory of the directory provided via the --output_dir
flag of run_docker.py
(defaults to /tmp/alphafold/
). The outputs include the computed MSAs, unrelaxed structures, relaxed structures, ranked structures, raw model outputs, prediction metadata, and section timings. The --output_dir
directory will have the following structure:
<target_name>/
features.pkl
ranked_{0,1,2,3,4}.pdb
ranking_debug.json
relaxed_model_{1,2,3,4,5}.pdb
result_model_{1,2,3,4,5}.pkl
timings.json
unrelaxed_model_{1,2,3,4,5}.pdb
msas/
bfd_uniclust_hits.a3m
mgnify_hits.sto
uniref90_hits.sto
The contents of each output file are as follows:
features.pkl
– Apickle
file containing the input feature NumPy arrays used by the models to produce the structures.unrelaxed_model_*.pdb
– A PDB format text file containing the predicted structure, exactly as outputted by the model.relaxed_model_*.pdb
– A PDB format text file containing the predicted structure, after performing an Amber relaxation procedure on the unrelaxed structure prediction (see Jumper et al. 2021, Suppl. Methods 1.8.6 for details).ranked_*.pdb
– A PDB format text file containing the relaxed predicted structures, after reordering by model confidence. Hereranked_0.pdb
should contain the prediction with the highest confidence, andranked_4.pdb
the prediction with the lowest confidence. To rank model confidence, we use predicted LDDT (pLDDT) scores (see Jumper et al. 2021, Suppl. Methods 1.9.6 for details).ranking_debug.json
– A JSON format text file containing the pLDDT values used to perform the model ranking, and a mapping back to the original model names.timings.json
– A JSON format text file containing the times taken to run each section of the AlphaFold pipeline.msas/
- A directory containing the files describing the various genetic tool hits that were used to construct the input MSA.-
result_model_*.pkl
– Apickle
file containing a nested dictionary of the various NumPy arrays directly produced by the model. In addition to the output of the structure module, this includes auxiliary outputs such as:- Distograms (
distogram/logits
contains a NumPy array of shape [N_res, N_res, N_bins] anddistogram/bin_edges
contains the definition of the bins). - Per-residue pLDDT scores (
plddt
contains a NumPy array of shape [N_res] with the range of possible values from0
to100
, where100
means most confident). This can serve to identify sequence regions predicted with high confidence or as an overall per-target confidence score when averaged across residues. - Present only if using pTM models: predicted TM-score (
ptm
field contains a scalar). As a predictor of a global superposition metric, this score is designed to also assess whether the model is confident in the overall domain packing. - Present only if using pTM models: predicted pairwise aligned errors (
predicted_aligned_error
contains a NumPy array of shape [N_res, N_res] with the range of possible values from0
tomax_predicted_aligned_error
, where0
means most confident). This can serve for a visualisation of domain packing confidence within the structure.
- Distograms (
The pLDDT confidence measure is stored in the B-factor field of the output PDB files (although unlike a B-factor, higher pLDDT is better, so care must be taken when using for tasks such as molecular replacement).
Running Multiple files on Alphafold
- To start a screen session, type
screen
in your console:
$ screen
-
To get a list of commands
Ctrl+a ?
-
To create a named session
$ screen -S session_name
Named sessions are useful when you run multiple screen sessions.
-
To detach from a Linux screen session, press
Ctrl+a d
- To reattach to a linux screen
$ screen -r
- To get the list of current running session
$ screen -ls
More on GNU screen sessions can be found here
References
-
Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, κ.ά. ‘Highly accurate protein structure prediction with AlphaFold’. Nature 596, τχ. 7873 (2021): 583–89. https://doi.org/10.1038/s41586-021-03819-2.
-
Evans, Richard, Michael O\textquoterightNeill, Alexander Pritzel, Natasha Antropova, Andrew Senior, Tim Green, Augustin Žídek, κ.ά. ‘Protein complex prediction with AlphaFold-Multimer’. bioRxiv, 2021. https://doi.org/10.1101/2021.10.04.463034.