File Handling

How do you keep track of all files generated during a scientific project? How do you deal with changes that can occur during data processing and analysis, or while editing a text document (e.g. report)? When several people are involved in a project it is highly recommended to define practices prior to the start, and observe such practices throughout the research process. File names should be descriptive, providing useful cues to the content of a file independent of its location on a computer. Besides, a sophisticated folder structure might also help you rediscover your project contents. Finally, keeping track of different versions of your files facilitates recovery in cases of data loss due to unintended modifications.  

File Nomenclature

Here are some recommendations on how to name your files:  

Good practice:

  • Include relevant information in file names. This may include project, investigator (initials), experiment, location, date, parameter, status of data (raw, processed etc.), file version, or anything else deemed relevant in a given context.
  • Use YYYYMMDD / YYYY_MM_DD for dates (e.g. 20160802 / 2016_08_02) and put date at the beginning or end of file names, both to facilitate chronological sorting.
  • Use leading zeros for other numbers as needed (001, 002, …. 011, …. 114 rather than 1, 2, ….. 11, ….. 114) depending on the expected amount of files, and also to allow files to be properly sorted.
  • Use dashes (file-name.xxx) or underscores (file_name.xxx) to separate parts of a filename.
  • Generate a README file explaining file nomenclature (including the meaning of acronyms or abbreviations), file organization and versioning. Store this file on top of a folder structure for easy access.
  • Decide to use either upper case or lower case letters; do not mix them. 

Better avoid to:

  • use special characters such as ~ ! / \ @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ‘ “| as they may have specific meanings in certain operating environments. Also avoid ä ö ü ß å ø ñ or similar.
  • use dots other than before the file extension.
  • use spaces in file names – they might end up being truncated during file transfer (with the file extension also lost and files potentially becoming unreadable).
  • make file names overly long (a maximum of 32 characters is suggested) – this may involve compromises regarding the recommendation above.

Good example:

20140618_exp08_co2data_raw_v01.csv

File Organization Structure

In general, for the naming of folders, the same rules apply as for files. Depending on the structure of your data and documentation, you have to decide on how to arrange your files and folders. This may also require judgment and compromises: a very deep folder structure (subsubsubsubsubfolders) can be as inconvenient as one folder with 100 or more individual files.

If different project members should have various access restrictions to files, this could be represented by your folder structure.  

File Versioning

Documents may evolve over time, several people may be involved in sequential changes. For data (of any kind, e.g. numeric or images) and text documents alike, file versioning serves two purposes:

  • You can revert to earlier versions if needed.
  • You can keep track of changes, including documentation on the underlying rationale and people involved.

Version control can be done either manually by using naming conventions or by using a version control system.

A prominent example for a version control system is Git, as a distributed version-control system for tracking changes. It is widely used in the field of software development. Also a version control for research data – other than software code – Git is, of course, possible as well. There are a couple of Git systems within the Max-Planck-Society, which can be used. For example the MPCBF is hosting a GitLab instance. The GWDG is offering a GitLab system as well. And the Max Planck Institute for Molecular Genetics is runnig an own GitHub instance, which is open for other Max Planck researchers (access via the helpdesk). A tutorial for using Git is given by Data Carpentry or at GitHub Education.

When working with a manual control system, versions should be numbered consecutively; major changes (v1, v2, v3, …) can be distinguished from minor ones (v1-1, v1-2, v1-3 or 1a, 1b, 1c). Use leading zeros if you expect more than nine versions.  

However, do not apply such numbering when using version control software, as it will interfere with automatic versioning.  

Qualifiers such as “raw” or “processed” for data, or “draft” or “internal” for documents are useful. But terms such as “final2”, “final-revised”, “final-changed_again”, “final_ready” can be confusing. In other words: Assign “final” only when you are sure that no further changes will occur.  

Additionally, further versioning information can be documented:

  • What has changed and why?
  • How were changes implemented (e.g. data processing steps)?
  • Who was involved and/or responsible?

Such information may be included within the documents or provided separately as a changelog file.  

Further Reading

Cornell University: File management

Princeton University Liberary: File Naming

Ram, K. (2013): Git can facilitate greater reproducibility and increased transparency in science. Source Code for Biology and Medicine, 8(1), 7. doi:10.1186/1751-0473-8-7.

Latest News relating to File Handling