Do Yourself and Others a Favour
Digital data are the foundation of all scientific research. The increasing amount and complexity of scientific data – in terms of data types, file formats, file sizes and quantity of files – pose challenges in keeping data organized and documented, to make them available for verification of research and new analyses leading to new insights.
The DFG Code of Conduct “Guidelines for Safeguarding Good Research Practice”, applicable since August 2019, emphasizes the importance of documenting research. It encourages scientists to make research data publicly available and archive data for the long-term (generally ten years). The Rules of Good Scientific Practice of the Max Planck Society address as well a storage of primary data at least ten years. And in Horizon Europe, the beneficiaries must manage the generated research data.
Discipline Specific Data Management
In addition to institutional requirements, there are also subject-specific needs. In many cases these are communicated via societies. The DFG is collecting subject-specific recommendations on the handling of research data. This collection is a good point of reference on discipline-specific requirements for research data.
Why Data Management?
Data management is good scientific practice, ensuring that research can be verified and reproduced. Moreover, data can be re-used – on their own or combined with other data – including in ways not anticipated by initial data creators. This can also prevent duplication of research efforts.
Data management is increasingly required by funding agencies and publishers. Funding agencies expect results of publicly funded research to reach a broad audience. Publishers insist that data underlying scientific papers are made available. Last but not least, it is in your interest, for two reasons:
- You can get credit for sharing your data. It creates additional exposure for your research and may lead to new collaborations.
- It enables you to keep track of your data and facilitates working together in research teams. Without efforts to organize and document your data, are you sure to find and understand them five years from now on? Will your collaborators understand your data? You might be the first re-user of your data. It may be necessary to re-analyse raw data upon request from e.g. your supervisor or journal reviewers, or even after publication if your interpretations and conclusions are questioned.
Data management does not imply that all data have to be shared immediately. Data publication can be delayed until after the publication of research findings; sensitive data might be shared with a limited audience and only under certain conditions, or not at all.
What are Research Data?
There is no straightforward definition applicable to all scientific fields – the broadest one would be “anything needed to underpin scientific research“, e.g. to validate and reproduce research findings. While physical samples may also be considered “data”, we focus on digital (or digitized) objects.
Partly dependent on scientific field and research questions, data can be quantitative or qualitative, in various formats, and generated with a range of methods (e.g. field observations, experiments, model output, interviews/polls).
- Raw data are the initial primary results.
- Processed data are the product of subsequent steps, e.g. data cleaning, quality control, data aggregation, statistical analysis.
- Tools used during such steps – code, scripts, scientific software – are also important to reproduce and re-use your research.
- Preserving, documenting, and – to the extent possible – sharing these tools contributes to research reproducibility and is thus good scientific practice.
What are Metadata?
“Data about data” are equally relevant for the discovery and understanding of scientific datasets. Depending on the characteristics of the dataset and targeted audience, various levels of detail may be adequate. Metadata can cover the following aspects:
- bibliographic information (title, author, …)
- contextual information (subject, geographical location, temporal coverage, keywords, …)
- administrative aspects (date of creation, file type, access rights, …)
- technical metadata (file properties and file sizes, …)
Metadata can be a part of the file itself, in a separate README file, or on a landing page for datasets – preferred option for using search tools.
- Dublin Core is a widely used generic metadata standard.