BFD Downloads

Bioinformatic Methods

BFD was created by clustering 2.5 billion protein sequences from Uniprot/Tremble +Swissprot, Metaclust and Soil Reference Catalog Marine Eukaryotic Reference Catalog assembled by Plass.

We clustered sequences that could be aligned to a longer sequence with 90% of their residues and a sequence identity of 30% using Linclust/MMseqs2 --cov-mode 1 --min-seq-id 0.3.

We removed all clusters with less than three sequences and turned the database into an HH-suite3 database using the Uniclust pipeline.

File Format

The each entry sequences in the database has id from different sources, either Uniprot, JGI, NCBI or OM-RGC. The JGI data can be accessed by using the first part of the fasta identifier at the organism field of the following url.
 >RifCSPlowO2_12_1023861.scaffolds.fasta_scaffold367679_1 # 24 # 428 # -1 # ID=367679_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.435

Consistency Check

After download please check that the downloaded archive matches the following values:

MD5 Hash
Byte Size

We recommend downloading the BFD with aria2c.


All files are available under a Creative Commons Attribution 4.0 International License.