BFD

BFD Downloads

BFD is available at the following two mirrors. The archives were created at different times, thus the checksums won't match. However, the hashes for the individual files are listed below.

Mirror Google Cloud

MD5 Hash

6a634dc6eb105c2e9b4cba7bbae93412

Byte Size

291649557441
Mirror GWDG

MD5 Hash

4b53fc6ca77c78fbc433948fb47e08c6

Byte Size

291649557551

Reference

Jumper et al. Highly accurate protein structure prediction with AlphaFold Nature, doi: 10.1038/s41586-021-03819-2 2021.

Steinegger M., Mirdita M. and Söding J., Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold Nature Methods, doi: 10.1038/s41592-019-0437-4 2019.

Steinegger M., Söding J., Clustering huge protein sequence sets in linear time Nature Communications, doi: 10.1038/s41467-018-04964-5 2018.

Bioinformatic Methods

BFD was created by clustering 2.5 billion protein sequences from Uniprot/TrEMBL+Swissprot, Metaclust and Soil Reference Catalog Marine Eukaryotic Reference Catalog assembled by Plass.

We clustered sequences that could be aligned to a longer sequence with 90% of their residues and a sequence identity of 30% using Linclust/MMseqs2 --cov-mode 1 --min-seq-id 0.3.

We removed all clusters with less than three sequences and turned the database into an HH-suite3 database using the Uniclust pipeline.

File Format

The each entry sequences in the database has id from different sources, either Uniprot, JGI, NCBI or OM-RGC. The JGI data can be accessed by using the first part of the fasta identifier at the organism field of the following url. https://genome.jgi.doe.gov/portal/pages/dynamicOrganismDownload.jsf?organism=RifCSPlowO2_12

 >RifCSPlowO2_12_1023861.scaffolds.fasta_scaffold367679_1 # 24 # 428 # -1 # ID=367679_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.435

Consistency Check

After download please check that the contents of the downloaded archive matches the following MD5 hashes:

bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata: 2dc0f09adabbcf1965ed578e0b2ab07e
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex: 476941cf4a964d96fb3b68a82fe734d1
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata: 4bb63ac9c3a3dd088cf654df1f548d53
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex: 26d48869efdb50d036e2fb9056a0ae9d
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata: 9bd2da8a8adbcc30801f0221d0dc1987
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex: 799f308b20627088129847709f1abed6

We recommend downloading the BFD with aria2c.

License

All files are available under a Creative Commons Attribution 4.0 International License.