Common Datasets
- chrisreidy@arizona.edu
- sarawillis@arizona.edu
- sohampal@arizona.edu
Overview
We host several large community datasets.  It is beneficial to you and us.  For you, it saves all that time downloading and filling up your storage allocation. And for us it reduces the occurrence of the same data in many places. We do not currently update them on any particular cadence. You can request updates if you feel those would be useful to the community.
These datasets and databases are available on the compute nodes in the filesystem called /contrib/datasets
in read-only mode. You can view them there.
AlphaFold 2
AlphaFold is an AI system developed by the Google DeepMind project to predict the 3D structure of a protein from its amino acid sequence. AlphaFold needs multiple datasets to run, the combined size of which is around 2.62 TB. It takes a long time to download them. We save you that effort by locating these datasets at /contrib/datasets/alphafold
.
We also host containers at /contrib/singularity/alphafold
that you can use with the provided datasets to predict protein structures with AlphaFold. You can access the container by loading the alphafold
module from an interactive session or a batch submission script. When you load the alphafold
module, it defines the following additional environment variables that you can use to easily access the container and the datasets:
ALPHAFOLD_DIR
which points to/contrib/singularity/alphafold
ALPHAFOLD_DATADIR
which points to/contrib/datasets/alphafold
The following batch submission script shows how you can use the container with the provided datasets.
#!/bin/bash #SBATCH --job-name=alphafold-run #SBATCH --time=04:00:00 #SBATCH --gres=gpu:1 #SBATCH --cpus-per-task=8 #SBATCH --ntasks-per-node=1 #SBATCH --partition=standard #SBATCH --account=<pi-account> # Uncomment the following two lines to make predictions on proteins that would be too long to fit into GPU memory. # export APPTAINERENV_TF_FORCE_UNIFIED_MEMORY=1 # export APPTAINERENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0 module load alphafold apptainer run \ -B .:/etc \ -B $ALPHAFOLD_DATADIR:/data \ --pwd /app/alphafold $ALPHAFOLD_DIR/alphafold-2.3.0.sif \ --data_dir=/data \ --nv \ --use_gpu_relax \ --uniref90_database_path=/data/uniref90/uniref90.fasta \ --uniref30_database_path=/data/uniref30/UniRef30_2021_03 \ --mgnify_database_path=/data/mgnify/mgy_clusters_2022_05.fa \ --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \ --pdb70_database_path=/data/pdb70/pdb70 \ --template_mmcif_dir=/data/pdb_mmcif/mmcif_files \ --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \ --model_preset=monomer \ --max_template_date=2023-08-02 \ --db_preset=full_dbs \ --output_dir=results \ --fasta_paths=input.fasta
Llama 2
The Meta Large Language Model project has several datasets  that we make available.
NCBI Blast
The Blast databases are provided in support of the Blast module.