To document your data means to give precise information about your data: What is the title of the dataset? What is the persistent identifier for the dataset? What is the dataset about? What kind of data does it contain? How are the data in the dataset organised in files and folders? What method(s) were used to collect or generate the data? When and where were the data collected or generated? Who owns the data? Who is/are the author(s) of the dataset? Are there restrictions on access to the data? What is the license for reuse of the data? These questions and more should be answered when documenting your data. The documentation is entered when archiving your dataset. But you are wise to think about, and make notes of the information needed, as you collect or generate your data and as you proceed in your PhD project.

The motivation for documenting your data is to make sure those who find your dataset understand what information it holds, and that they interpret it correctly. Make sure you leave as little room as possible for outsiders to misunderstand your data. And do this also for your own sake. As some years go by, you will probably not remember all the conditions and background for your data. So if you want to be able to go back to your data some time into the future, make sure you do a good job documenting your data.

There are mainly two elements to how you document your data: Structured Metadata and a Readme file.


Metadata is information about your data and makes them findable in discovery services. Metadata will also enhance the value and reusability of your data. So good metadata quality is important. When archiving your data for long time preservation (and sharing, if possible), we advise you to use a good quality and preferably a certified data archive. Such archives make use of metadata schemas, where typically some metadata fields are required while others are recommended if applicable. A metadata schema will guide you in what should be included as metadata for your dataset.

By following a standard, the schema ensures that information is entered in fields that are clearly defined: What information should be entered where, and also what syntax is to be used. How should author name be entered? “Surname, Given name”, or surname and given name in separate fields? This is important so that search engines do not mix up surnames and given names. Date information is another example. Which date information is entered where? Date of data collection, time period covered by the data, date of dataset being published or updated … And which syntax is used for the date? YYYY-MM-DD or some other syntax? Such issues are taken care of by metadata standards and the schemas used by data repositories, and these ensure that search engines handle the metadata information correctly. Following metadata standards thus ensures that metadata may be harvested and combined across sources to put together a huge amount of data on a topic. Without metadata standards, this would not be possible.

There is a rich variety of metadata standards, developed to fill different needs. And the needs may vary a lot. Data are heterogenous material, from interview recordings to counting and measurements of something in nature or in the astrophysical space, to computer generated data from simulations. Obviously, what metadata to record will vary accordingly. Therefore, different data repositories may be specialised to archive subject specific data, by applying a metadata standard especially suited to their need. Your PhD project may be such that there are dedicated data archives using metadata standards that suit your dataset and your subject field perfectly. Your supervisor or peers may be the people to discuss this with, or you may ask at your library.

A list of metadata standards are found in this list from the Digital Curation Centre (DCC). DCC also gives an introduction to some examples of metadata standards

The various metadata elements may be grouped into different categories:

Citation metadata: These are metadata that give the basic who, what, where, and when of the data, and are the metadata needed to cite your data. This should always include:

  • Author / owner of the data
  • Title
  • Publication date
  • Persistent identifier

Descriptive metadata: These are metadata that describe your dataset, and are the main elements that makes your dataset findable through search engines. Author and publication date are elements that belong both in the citation and the descriptive categories, while others are uniquely for descriptive purpose:

  • Author
  • Title
  • Keywords
  • Abstract
  • Date
  • Geographic information

Legal metadata: This too is an important category. If a dataset e.g. lacks clear license information, how will you know what you are allowed to do with it? Contacting the author or owner to ask may be cumbersome. Legal metadata includes elements like:

  • Author
  • Owner
  • Copyright holder (if applicable)
  • License information

Examples of archived dataset and their metadata

Lessons learned:

  • Have in mind the information needs of a person who does not know your dataset at all, when documenting your dataset. 
  • Too much metadata in a dataset will never be a problem. Too little metadata may easily result in reduced quality and reusability of the dataset
  • Structured metadata are developed for machine readable information about the dataset.
  • Different data repositories may use different metadata standards, according to the kind of data they are designed for.
  • Metadata standards enable search engines to do precise searches across multiple repositories.
  • Metadata and the human readable Readme file complement each other, to make up the documentation of a dataset.

Food for thought
Have you considered where the data from your PhD project should be archived? Have you assessed the metadata schema used in the various data repositories you may choose to use?

Last modified: Tuesday, 6 December 2022, 3:57 PM