Regulation of gene expression influences almost all biological processes
in an organism; sequence-specific DNA-binding transcription factors are
critical to this control. For most genomes, the repertoire of
transcription factors is only partially known. Hitherto transcription
factor identification has been largely based on genome annotation
pipelines that use pairwise sequence comparisons, which detect only those
factors similar to known genes, or on functional classification schemes
that amalgamate many types of proteins into the category of 'transcription
factor'. Using a novel transcription factor identification method, the
DBD transcription factor database fills this void, providing genome-wide
transcription factor predictions for organisms from across the tree of life.
DBD: Transcription factor prediction database
The most recent article describing DBD is:
and, the original article describing DBD is:
The prediction method behind DBD identifies sequence-specific DNA-binding
transcription factors through homology using hidden Markov models (HMMs) of
domains. The collection of HMMs is taken from two existing databases (Pfam
and SUPERFAMILY), and is limited to models that exclusively detect
transcription factors that specifically recognise DNA sequences. It does
not include basal transcription factors or chromatin-associated proteins,
for instance. Based on comparison with experimentally verified annotation,
the prediction procedure is roughly 97% accurate. Between one quarter and
one half of our genome-wide predicted transcription factors are novel,
representing previously uncharacterised proteins.
The DBD consists of predicted transcription factor repertoires for
930
completely sequenced genomes, their domain assignments and the hand
curated list of DNA binding domain HMMs. Users can browse, search or
download the predictions by genome, domain family or sequence identifier,
view families of transcription factors based on domain architecture and
receive predictions for a protein sequence.
Prediction Method
First we manually inspected all SCOP and PFAM families to
identify those that consist exclusively of sequence specific DNA binding
domains. From this annotation, we selected the hidden Markov models that
represent these families from the SUPERFAMILY and PFAM databases. To
make a prediction for a given protein, we search the amino acid sequence
against the HMM libraries and designate the protein to be a TF if it has
a significant match to a model we annotated as being a sequence specific
DNA binding.
Benchmarks
To evaluate the accuracy of the prediction process,
we carried out a series of test on groups of sequences that had been experimentally
annotated as transcription factors.
The aim of the first test was to asses the accuracy of the underlying approach,
transcription factor identification via presence of a DNA binding domain, without
adding the complexity of domain prediction.
The sequence set was from the PDB, including only
proteins of known structure with curated domain composition from SCOP.
By including only proteins with known domain composition,
we eliminated any potential error introduced by incorrect assignments from the hidden Markov models.
We used the Gene Ontology (GO) annotation of the PDB proteins as a standard
list of known TFs
(table 1 lists the functional classes we classified as representing the transcription factors).
When we examined PDB proteins containing a DNA binding domain,
we found that more than 99% (394) are classified by GO as sequence specific TFs.
The remaining 1% (3) are classified by GO as "nucleic acid binding" but have not been allocated to a sub-category.
This test illustrates the validity of both the underlying approach, prediction based on structural domains,
and; our hand curation of the SCOP domains.
Accession
Description
GO:0003700
transcription factor activity
GO:0003702
RNA polymerase II transcription factor activity
GO:0003709
RNA polymerase III transcription factor activity
GO:0016563
transcriptional activator activity
GO:0016564
transcriptional repressor activity
Table 1: Sequence specific DNA binding transcription factor GO categories.
These five categories are from the molecular function ontology and have been
selected because they include only sequence specific DNA binding transcription factors.
The second test aimed to evaluate the prediction method as a whole, including the domain assignment step
using SUPERFAMILY and PFAM.
The sequence set used was from the UniProt database
the most comprehensive catalogue of proteins available including more than 1.5 million sequences.
As a standard for comparison, we used the experimentally verified GO annotation
(that is we excluded homology based annotation).
We searched the PFAM and SUPERFAMILY hidden Markov models against the UniProt
sequence set and derived a set of putative transcription factors.
In order to calculate the accuracy of our method, we calculated the number
of predicted TFs for which GO supports our prediction, that is annotates the protein
as being sequence specific DNA binding TF (categories included are shown in table 1).
This benchmark establishes our accuracy to be 97% (Table 2).
This means that we expect 3 out of 100 of our predictions to be incorrect,
that is the proteins identified are not transcription factors.
Conversely, we calculated the coverage of our method to be 65 % by counting the number of
proteins that GO annotates as a transcription factor but we fail to predict.
This suggests that we miss around one third of transcription factors.
Closer inspection of these proteins showed that many are not actually
sequence specific DNA binding TFs, but rather are involved in some other
expression related process (for example basal TFs).
This means that the miss-rate of one third should be considered an upper bound.
We expect to miss some TFs because we rely on HMM domain assignments which
are known to give incomplete coverage (depending on the genome, between 30% and 60%
of amino acids lack a domain assignment).
Closer inspection of the 358 known TFs that we categorised as carrying out some
other (non-expression related) function indicates that limitations in the
homology detection are likely to be to blame;
more than 60% of this set have no domain assignments at all
and XXX have an unassigned region large enough to be occupied by a DBD.
Our TF list as annotated by GO
GO annotated TFs as annotated by us
Annotated as TF
786
786
Expression related
18
72
Other function
9
358
Unclassified
975
76
Table 2: UniProt benchmark
To evaluate the prediction method and in particular assess impact
of using SUPERFAMILY and PFAM for predicting DNA binding domains on sequences
without known structures, we compared our predictions with the experimentally
derived GO annotation of the UniProt database.
The first column of numbers indicates the GO annotation of proteins in our predicted TF set.
97% of predictions are corroborated by GO, giving a false positive rate of 3%.
Manual inspection (and literature search) of the false positives suggests that at least
half are in fact experimentally verified sequence specific DNA binding proteins.
Many of the remaining half have little annotation, but any
provided is supportive of the suggestion that these proteins
are transcription factors.
The final column shows our annotation of all the proteins GO annotates as transcription factors.
This shows that we identify 65% of known transcription factors, or conversely, we
miss about 1/3rd.
Manual inspection suggests that some of the missed proteins may in fact be
basal factors and therefore should not be included in our set.
Perhaps the most interesting finding from this analysis is the large number (975) of TFs
we found that were annotated by GO as unclassified.
It should also be noted that in total we identified 37736 transcription factors from the
1.5 million sequences in UniProt, but only the 1788 mentioned in the GO annotation have
been included in the table above.
The final group of tests that were used to evaluate the prediction method involved
comparison with curated lists of transcription factors from Saccharomyces cerevisiae.
We predicted 169 transcription factors: 125 of these are known, 5 seems to be false positives
and 39 are novel, previously uncharaceterised proteins.
Converseley, for the 160 known transcription factors, we correctly predicted 78% (125).
Of the remaining 22% (35), half had no domain assignments.
In summary, we have developed a method for predicting sequence specific DNA binding
transcription factors.
Based on an evaluation using a large set of annotated protein sequences,
we find that it is extremely accurate (97% correct) and has good coverage
(65% identification rate).