Jumper et al. Highly accurate protein structure prediction with AlphaFold Nature, doi: 10.1038/s41586-021-03819-2 2021.
Steinegger M., Mirdita M. and Söding J., Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold Nature Methods, doi: 10.1038/s41592-019-0437-4 2019.
Steinegger M., Söding J., Clustering huge protein sequence sets in linear time Nature Communications, doi: 10.1038/s41467-018-04964-5 2018.
BFD was created by clustering 2.5 billion protein sequences from Uniprot/TrEMBL+Swissprot, Metaclust and Soil Reference Catalog Marine Eukaryotic Reference Catalog assembled by Plass.
We clustered sequences that could be aligned to a longer sequence with 90% of their residues and a sequence identity of 30% using Linclust/MMseqs2 --cov-mode 1 --min-seq-id 0.3.
We removed all clusters with less than three sequences and turned the database into an HH-suite3 database using the Uniclust pipeline.
>RifCSPlowO2_12_1023861.scaffolds.fasta_scaffold367679_1 # 24 # 428 # -1 # ID=367679_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.435
After download please check that the contents of the downloaded archive matches the following MD5 hashes:
We recommend downloading the BFD with aria2c.
All files are available under a Creative Commons Attribution 4.0 International License.