DBD: Transcription factor prediction database

Release 2.0
www.transcriptionfactor.org
Sequence ID search:
About DBD

Regulation of gene expression influences almost all biological processes in an organism; sequence-specific DNA-binding transcription factors are critical to this control. For most genomes, the repertoire of transcription factors is only partially known. Hitherto transcription factor identification has been largely based on genome annotation pipelines that use pairwise sequence comparisons, which detect only those factors similar to known genes, or on functional classification schemes that amalgamate many types of proteins into the category of 'transcription factor'. Using a novel transcription factor identification method, the DBD transcription factor database fills this void, providing genome-wide transcription factor predictions for organisms from across the tree of life.

DBD: Transcription factor prediction database

The most recent article describing DBD is:

DBD - taxonomically broad transcription factor predictions: new content and functionality

Derek Wilson, Varodom Charoensawan, Sarah K. Kummerfeld, Sarah A. Teichmann
Nucleic Acids Research 2008 36(Database issue):D88-D92; doi:10.1093/nar/gkm964
Abstract [ PubMed ]   Full text [ HTML · PDF ]
Download citation [ RIS (for EndNote, Reference Manager, ProCite) · BibTeX ]
and, the original article describing DBD is:

DBD: a transcription factor prediction database

Sarah K. Kummerfeld, Sarah A. Teichmann
Nucleic Acids Research 2006 34(Database Issue):D74-D81.
Abstract [ PubMed ]   Full text [ HTML · PDF ]
Download citation [ RIS (for EndNote, Reference Manager, ProCite) · BibTeX ]

The prediction method behind DBD identifies sequence-specific DNA-binding transcription factors through homology using hidden Markov models (HMMs) of domains. The collection of HMMs is taken from two existing databases (Pfam and SUPERFAMILY), and is limited to models that exclusively detect transcription factors that specifically recognise DNA sequences. It does not include basal transcription factors or chromatin-associated proteins, for instance. Based on comparison with experimentally verified annotation, the prediction procedure is roughly 97% accurate. Between one quarter and one half of our genome-wide predicted transcription factors are novel, representing previously uncharacterised proteins.

The DBD consists of predicted transcription factor repertoires for 930 completely sequenced genomes, their domain assignments and the hand curated list of DNA binding domain HMMs. Users can browse, search or download the predictions by genome, domain family or sequence identifier, view families of transcription factors based on domain architecture and receive predictions for a protein sequence.

Prediction Method

First we manually inspected all SCOP and PFAM families to identify those that consist exclusively of sequence specific DNA binding domains. From this annotation, we selected the hidden Markov models that represent these families from the SUPERFAMILY and PFAM databases. To make a prediction for a given protein, we search the amino acid sequence against the HMM libraries and designate the protein to be a TF if it has a significant match to a model we annotated as being a sequence specific DNA binding.

Benchmarks

To evaluate the accuracy of the prediction process, we carried out a series of test on groups of sequences that had been experimentally annotated as transcription factors.

The aim of the first test was to asses the accuracy of the underlying approach, transcription factor identification via presence of a DNA binding domain, without adding the complexity of domain prediction. The sequence set was from the PDB, including only proteins of known structure with curated domain composition from SCOP. By including only proteins with known domain composition, we eliminated any potential error introduced by incorrect assignments from the hidden Markov models. We used the Gene Ontology (GO) annotation of the PDB proteins as a standard list of known TFs (table 1 lists the functional classes we classified as representing the transcription factors). When we examined PDB proteins containing a DNA binding domain, we found that more than 99% (394) are classified by GO as sequence specific TFs. The remaining 1% (3) are classified by GO as "nucleic acid binding" but have not been allocated to a sub-category. This test illustrates the validity of both the underlying approach, prediction based on structural domains, and; our hand curation of the SCOP domains.

Accession
Description
GO:0003700transcription factor activity
GO:0003702RNA polymerase II transcription factor activity
GO:0003709RNA polymerase III transcription factor activity
GO:0016563transcriptional activator activity
GO:0016564transcriptional repressor activity
Table 1: Sequence specific DNA binding transcription factor GO categories. These five categories are from the molecular function ontology and have been selected because they include only sequence specific DNA binding transcription factors.

The second test aimed to evaluate the prediction method as a whole, including the domain assignment step using SUPERFAMILY and PFAM. The sequence set used was from the UniProt database the most comprehensive catalogue of proteins available including more than 1.5 million sequences. As a standard for comparison, we used the experimentally verified GO annotation (that is we excluded homology based annotation). We searched the PFAM and SUPERFAMILY hidden Markov models against the UniProt sequence set and derived a set of putative transcription factors. In order to calculate the accuracy of our method, we calculated the number of predicted TFs for which GO supports our prediction, that is annotates the protein as being sequence specific DNA binding TF (categories included are shown in table 1). This benchmark establishes our accuracy to be 97% (Table 2). This means that we expect 3 out of 100 of our predictions to be incorrect, that is the proteins identified are not transcription factors. Conversely, we calculated the coverage of our method to be 65 % by counting the number of proteins that GO annotates as a transcription factor but we fail to predict. This suggests that we miss around one third of transcription factors. Closer inspection of these proteins showed that many are not actually sequence specific DNA binding TFs, but rather are involved in some other expression related process (for example basal TFs). This means that the miss-rate of one third should be considered an upper bound. We expect to miss some TFs because we rely on HMM domain assignments which are known to give incomplete coverage (depending on the genome, between 30% and 60% of amino acids lack a domain assignment). Closer inspection of the 358 known TFs that we categorised as carrying out some other (non-expression related) function indicates that limitations in the homology detection are likely to be to blame; more than 60% of this set have no domain assignments at all and XXX have an unassigned region large enough to be occupied by a DBD.

 Our TF list as annotated by GOGO annotated TFs as annotated by us
Annotated as TF 786786
Expression related 1872
Other function 9358
Unclassified97576
Table 2: UniProt benchmark To evaluate the prediction method and in particular assess impact of using SUPERFAMILY and PFAM for predicting DNA binding domains on sequences without known structures, we compared our predictions with the experimentally derived GO annotation of the UniProt database. The first column of numbers indicates the GO annotation of proteins in our predicted TF set. 97% of predictions are corroborated by GO, giving a false positive rate of 3%. Manual inspection (and literature search) of the false positives suggests that at least half are in fact experimentally verified sequence specific DNA binding proteins. Many of the remaining half have little annotation, but any provided is supportive of the suggestion that these proteins are transcription factors. The final column shows our annotation of all the proteins GO annotates as transcription factors. This shows that we identify 65% of known transcription factors, or conversely, we miss about 1/3rd. Manual inspection suggests that some of the missed proteins may in fact be basal factors and therefore should not be included in our set. Perhaps the most interesting finding from this analysis is the large number (975) of TFs we found that were annotated by GO as unclassified. It should also be noted that in total we identified 37736 transcription factors from the 1.5 million sequences in UniProt, but only the 1788 mentioned in the GO annotation have been included in the table above.

The final group of tests that were used to evaluate the prediction method involved comparison with curated lists of transcription factors from Saccharomyces cerevisiae. We predicted 169 transcription factors: 125 of these are known, 5 seems to be false positives and 39 are novel, previously uncharaceterised proteins. Converseley, for the 160 known transcription factors, we correctly predicted 78% (125). Of the remaining 22% (35), half had no domain assignments.

In summary, we have developed a method for predicting sequence specific DNA binding transcription factors. Based on an evaluation using a large set of annotated protein sequences, we find that it is extremely accurate (97% correct) and has good coverage (65% identification rate).