|
Following the release of the completed human
genome sequence in April 2003, the scientific
community intensified its efforts to mine the data
for clues about how the body works in health and in
disease. A basic requirement for this understanding of
human biology is the ability to identify and
characterize sequence-based functional elements
through experimentation and computational analysis.
In September 2003, the NHGRI introduced the ENCODE
project to facilitate the identification and analysis
of the complete set of functional elements in the
human genome sequence. During the initial pilot and
technology development phases of the project, 44
regions—approximately 1% of the human
genome—were targeted for analysis using a
variety of experimental and computational methods with
the aim of assembling a comprehensive encyclopedia of
the functional elements in these regions, showing
their identity and precise location. The pilot project
established protocols for scaling up to full-genome
coverage and produced a wealth of data, elucidating
elements such as protein-coding genes, transcription
units, protein binding sites, conserved DNA elements,
features of chromatin assembly and modification, and
single nucleotide polymorphisms.
During the pilot phase, UCSC collected, processed,
and released more than 500 ENCODE data sets
representing a broad range of experimental methods and
diverse tissues and cell lines. In addition to the two
designated ENCODE cell lines, HeLa cervical carcinoma
and GM06990 lymphoblastoid, more than 40 cell types
are represented. A substantial proportion of the data
is the product of chromatin immunoprecipitation
(ChIP-CHIP) experiments used to determine binding
sites for transcription factors—eight groups
have produced ChIP/CHIP data from four microarray
platforms, investigating more than two dozen
transcription factors and histone modifications.
Several experimental groups have provided time course
data and varied cell treatments. Other notable
experimental data include localization of RNA
transcription starts, identification of regions of
DNaseI hypersensitivity, and temporal profiling of DNA
replication.
Accompanying the ENCODE experimental data, UCSC also
hosts the ENCODE high-quality gene set, provided by
the Gencode project, and a variety of computationally
derived annotations, including gene predictions from
the ENCODE Gene Annotation Assessment Project (EGASP),
pseudogene annotations from four projects, and RNA
secondary structure predictions from two contributors.
The comparative
genomics tracks include multiple alignments of 28
vertebrate species in the ENCODE regions, produced
with three sequence alignment methods and four
different conservation algorithms. The
Genome Browser provides a full set of genome-wide
comparative genomics tracks that complement the
ENCODE tracks, including a genome-wide multiple
alignment covering nearly 30 vertebrate species.
You can find more information about the ENCODE pilot
phase at UCSC in the news
archives.
| |