|
(This is a copy of NHGRI's official
ENCODE Data Release Policy (2003-2007))
Data Release Principle and Standard
The NHGRI is committed to the principle of rapid data release to the
scientific community. This principle was initially implemented during
the Human Genome Project and has been recognized as leading to one of
the most effective ways of promoting the use of the human genome
sequence to advance scientific knowledge. At a meeting in
Ft. Lauderdale co-sponsored by the Wellcome Trust and NHGRI in January
2003, the concept of rapid data release by genomic sequence data
producers was reaffirmed, and the attendees strongly recommended
applying the practice to other types of data produced by "community
resource projects". The attendees recognized, however, that different
issues, particularly with respect to data validation, would be
involved in the development of appropriate release practices for
different types of data. Since they also recognized that sustaining
the practice of rapid, prepublication data release by community
resources requires that the interests of all involved - including the
data producers, data users, and funding agencies - be addressed, they
emphasized the need to develop a tripartite system of
responsibility. A report summarizing
the meeting at Ft. Lauderdale is also available.
The NHGRI has identified the Encyclopedia of DNA Elements (ENCODE)
Project, designed to comprehensively identify functional elements in
the human genome sequence, as a community resource project. ENCODE
has begun as a pilot effort to test and compare methods for the
exhaustive identification and validation of functional sequence
elements in a limited (~1% or 30 Mb) amount of the human genome. In
practice, the ENCODE data release policy will be affected by two
important considerations: (1) several different data types will be
generated, as a variety of experimental approaches will be taken in
the Project to identify functional sequence elements, and (2) the
criteria for validation for each data type, which will vary, need to
be taken into account in developing appropriate data release standards
for each data type.
At the outset of the project, the ENCODE Consortium considers it
relevant to distinguish between data verification and data
validation. 'Data verification' is understood to refer to assessing
the reproducibility of an experiment, while 'data validation' is
understood to refer to confirmation by other, independent methods. As
outlined below, the Consortium believes that early deposit of data in
public databases is important, and this should happen as soon as data
is verified - even if it has not yet been validated. For each data
type, the Consortium is attempting to identify a minimal verification
standard necessary for public release of each data type. The
Consortium members will also identify additional levels of validation
that will be applied in subsequent analyses of the data or with
additional experimentation where appropriate. When possible,
estimates of the false positive and false negative rates for the
particular experimental approach will be included in the data releases
as a measure of data validation. The data will be deposited to public
databases, such as GenBank or ENCODE Consortium databases, and the
data will be available for all to use without restriction (See
Appendix A).
ENCODE Publication Policy / Intellectual Property Considerations
As recommended at the Ft. Lauderdale meeting for a community resource
project, the ENCODE Consortium has published an initial manuscript, a
so-called "marker paper", describing the goals of the project, its
data release practices, and the publication policies that it intends
to follow.
As noted, the main goal of the ENCODE pilot project is to compare the
ability of a set of research methods to identify comprehensively all
sequence-based functional elements in genomic DNA. Thus, the final
product of the Consortium, which it intends to publish in a
peer-reviewed journal, is planned to be an overall analysis of the
different methods tested by the Consortium members, an annotated
version of the full set of selected ENCODE target sequences, with all
of the functional elements identified by the Project, and a
recommendation for how to expand the ENCODE project to annotate the
entire human genome. The Consortium expects to submit this manuscript
or manuscripts for publication within six months of the end of the
pilot project. In addition to group publication(s), all of the
individual research groups in the ENCODE Consortium are free to
publish the results of their own efforts in independent publications
at any time. In these individual papers, Consortium participants will
not be restricted to describing the methods developed for the project,
but can and should expand into describing biological insights that
arise from their analyses. To facilitate comparison of data between
different groups involved in ENCODE, all publications by Consortium
members should, when possible, include data on a common reference set
of reagents agreed upon by the Consortium, e.g., a common cell line or
a common antibody, as applicable.
Users of Consortium data, whether members of the Consortium or not,
should be aware of the publication status of the data they use and
treat them accordingly.
For example, all investigators, including other Consortium members,
should obtain the consent of the data producers before using
unpublished data in their individual publications. Consortium members
will not have privileged access to data from other members of the
Consortium. Rather, all data shared by the Consortium members will be
obtained from the data that has been released to public databases.
Investigators outside of the ENCODE Consortium are free to use the
ENCODE Consortium data, either en masse or specific subsets, but are
asked to follow the guidelines developed at the Ft. Lauderdale
meeting. Specifically, data users should cite the source of the data
(referencing the initial ENCODE marker paper) and should acknowledge
the data producers from the ENCODE Consortium. In addition, the data
users are asked to recognize the interests of the data producers to
publish reports on the generation and analysis of their data. The
ENCODE data are released to public databases as pre-publication data
and remain unpublished until they appear in peer-reviewed
publications. Outside investigators who perform an in-depth analysis
of data from the ENCODE Consortium and are interested in publishing a
report before the data producers do so should discuss their results
with the data producer(s) and are encouraged to establish
collaborations. However, the ENCODE Consortium members are not
required to collaborate with any outside investigators. All
investigators, through their roles as journal and grant reviewers,
should enforce a high standard of respect for the scientific
contribution of the data producers. This discussion of the ENCODE
data release policy has been primarily directed at issues concerning
the use of ENCODE data in scientific publications. The intent of the
policy is to accelerate the use of the data by the scientific
community. To facilitate this goal, the data producers agree not to
restrict the use of the data by others while the data users are
encouraged to act in a manner that is consistent with this
unrestricted access policy. The associated issue of intellectual
property as it pertains to the ENCODE data is addressed in Appendix B.
Appendix A: Data Release Standard for the First Level of Verification
The Data Sharing/Release working group has recommended that the ENCODE
Consortium establish a well-articulated description of a first-level
verification standard for each data type produced by Consortium
members: ENCODE labs should release, to an appropriate public
database, data obtained in experiments when this standard has been
met. In most cases, it is anticipated that additional efforts for
further verification and validation of the data will be carried out,
but these should not delay the initial release of data. The working
group acknowledges that releasing preliminary data may not be the
first choice of the data producers. However, on the assumption that
such data can be useful to the scientific community, NHGRI has adopted
the policy for the ENCODE Project to make such data available in a
timely manner. This policy is consistent with the Institute's
commitment to rapid data release to the scientific community.
All of the data generated by the ENCODE project will be linked to the
human genome sequence. Data from the ENCODE Project that can be
directly displayed on the human genome sequence will be stored and
delivered by the University of California, Santa Cruz (UCSC) Genome
Browser; other Project data will be stored and delivered by the
appropriate databases to be coordinated by the NHGRI Genome Technology
Branch. All ENCODE data must have the associated information on how
the experiment was performed and how the raw data were analyzed to
generate the conclusions (i.e., sequence elements) to be displayed.
As data are deposited into public databases, individual tracks will be
created to display these data on the UCSC Browser. Where applicable,
the primary data underlying any sequence elements will be linked
directly to the browser track. Participating labs are encouraged to
submit their data rapidly even if they conflict with data from other
groups. As additional data validations are performed, the
investigators can modify the submitted data or even withdraw the data
if further tests call into question the validity of the released data.
All data will be accompanied by prominent caveats to notify users of
the level of verification of the data and that frequent data release
and updates will be forthcoming as further validation and analyses are
performed.
Appendix B: ENCODE Intellectual Property Issues
Since the inception of the Human Genome Project, NHGRI policy has
encouraged the rapid release and ready accessibility of genomic data
to the broad research community. A related issue of availability
pertains to any intellectual property rights that might be sought by
data generators, and the effect that the exercise of such rights has
on access to the data.
The Bayh-Dole Act of 1980 provides a statutory mandate to NIH grantees
and contractors to seek patent protection, when appropriate, on
inventions made using government funds and to license those inventions
with the goal of promoting their utilization, commercialization and
public accessibility. While the NHGRI has, in accordance with that
law, encouraged grantees to seek patent protection for genomic
technologies that have been developed with grant funds, the Institute
has been concerned about the claims and exercises of those claims in
the case of large-scale genomic data sets because of the Institute's
belief that broad accessibility to the data is of paramount
importance, and that such data are generally pre-competitive, i.e., a
considerable amount of work would need to be performed beyond the
initial data production to demonstrate utility. For genomic sequence
data, for example, NHGRI indicated its opinion that raw data, in the
absence of additional experimental biological information, lack
demonstrated specific utility and therefore are inappropriate
materials for patent filing. The grantees participating in the NHGRI
large-scale sequencing program have been monitored for whether they
filed patent claims and, to date, none have. In the case of the
HapMap Project, the participants (including the NHGRI grantees) agreed
not to file for patents on the bulk data from the Project. However,
there was a complication because the raw data produced by the Project
(SNPs and individual genotypes) had to be processed to generate the
Project's ultimate output (haplotypes). In considering the issue of
data release, HapMap participants were concerned about the possibility
that researchers outside of the Project could add some of their own
data to the raw Project data, develop haplotypes prior to the
Project's ability to do so, file patent claims based on the combined
data, and then potentially restrict access by others to the HapMap
data (a so-called parasitic patent). To deal with this concern, a
click-wrap license was imposed on the individual genotype data; to
gain access to the data, researchers are required to agree not to
restrict the access of others to the data and not to share the data
with anyone who has not agreed to the click-wrap license.
In some respects, the cases of genomic sequence data and haplotype
data were relatively easy to deal with because the data themselves do
not have "utility" (in the patent law sense of the term). As a
result, grantees did not express concern about the NHGRI policies on
data release. In the case of the ENCODE Project, however, the
applicability of this argument is not as obvious. The ENCODE
Consortium will include both members funded by NHGRI ENCODE grants and
those funded by other sources. The purpose of the ENCODE Project is
to generate data that identify or define genomic DNA sequence elements
that have biological function, and therefore might be considered to
have utility and be able to be patented. Therefore, the use of patents
in ways that might restrict access to large amounts or broad
categories of data, e.g., all transcription factor binding sites, is
an issue that needs to be addressed.
NHGRI's primary interest is to ensure the widespread availability of
all information and any inventions that are generated during the
ENCODE Project. NHGRI, therefore, encourages all ENCODE data
producers to consider placing all information generated from their
project-related efforts in the public domain and to address the NIH
guidelines on the sharing of research tools (http://www.ott.nih.gov/policy/rt_guide_final.html).
In the cases in which the Consortium members elect to exercise their
intellectual property rights, NHGRI encourages consideration of
maximal use of non-exclusive licensing of patents to allow for broad
access and stimulate the development of multiple products. As a
criterion for joining the ENCODE Consortium, investigators have agreed
to abide by the Project's data release policy.
NHGRI also encourages users of the ENCODE data to act responsibly and
share the effort involved in maintaining unrestricted access to the
data. Thus, for example, if a data user were to incorporate ENCODE
data into an invention, the subsequent license should not restrict the
access of others to the ENCODE data. For this purpose, the term "data
users" is meant to include both researchers who are members of the
ENCODE Consortium and researchers who are not.
The ENCODE pilot phase, during which time data corresponding to only
1% of the human genome will be produced, will provide NHGRI with an
opportunity to observe data producer and data user practices with
respect to intellectual property and the ENCODE Project. NHGRI
grantees are reminded that the grantee institution is required to
disclose each subject invention to the Federal Agency providing
research funds within two months after the inventor discloses it in
writing to grantee institution personnel responsible for patent
matters. NHGRI will monitor grantee activity in this area to learn
whether or not attempts are being made to patent large amounts of
information derived from the ENCODE Project. If, in the future,
circumstances arise that convince NIH that additional measures are
needed to achieve the goal of widespread access to the results of the
Project, the Institute reserves the right to consider a determination
of exceptional circumstance to restrict or eliminate the right of
parties, under future grants, to elect to retain title. Similarly,
NHGRI will monitor the activity of data users to attempt to determine
whether access to the ENCODE data is being encumbered by any
restrictive licenses. If the policy of reliance on data user
responsibility to maintain unrestricted data access is not effective,
the NHGRI will consider adopting a click-wrap license similar to that
used by the HapMap Project to protect the ENCODE data and to ensure
unrestricted access to the use of this data.
|
|