During a research project, various files are generated in specific formats, e.g. raw data, analysis, documentation, media files. Files generated by proprietary software are often stored in a format that can not be readable by other applications. For documentation, commercial office software is likely to be used. Thinking of sharing information with others and how this information can be retrieved in the future, this section will discuss what aspects have to be taken into account for successful data management.
Proprietary vs. Open Formats
Proprietary formats often are not readable without the corresponding (commercial) software and may become obsolete in the future. WordPerfect (.wpd) is just one example of a format that used to be common and popular, but is currently at most “exotic”. Readability of data generated by scientific instruments like microscopes or MRI facilities may depend on the underlying software that is usually associated with costs and may not be available anymore. On the other hand, these formats often contain additional information (e.g. technical metadata about the used device). Also, proprietary formats can provide specific details on the subject that open formats don’t offer.
Conversion from proprietary to open formats is often possible, but may result in some loss of data and/or context (formatting, embedded metadata, …). In such cases, it may be advisable to save data in both proprietary and open formats.
The decision either to stick with the proprietary files or to convert them to an open format might depend on the following questions:
- Is there a loss of information?
- To what extent am I committed to the vendor of this particular software/hardware?
- Is there a need to edit saved files?
- Who should have access to those files? Are you planning to disseminate your files widely?
File Format Sustainability Factors
The Library of Congress defines seven “sustainability factors for digital formats” (other sources use similar criteria):
- Disclosure: Do complete specifications and tools for validating technical integrity exist? Are they freely available? Digital preservation requires understanding how the information is represented (encoded) as bits and bytes in digital files.
- Adoption: A widely used format is less likely to become (rapidly) obsolete. If it still becomes outdated at some stage, tools for migration of file formats or emulation of their soft- and hardware environment are more likely to emerge. However, it is difficult to predict whether a format remains popular over many years.
- Transparency: Are file formats open to direct analysis with basic tools? If the underlying information is represented simply and directly, at best human readable using a text editor, its migration can be processed more efficiently.
- Self-documentation: Are metadata included within the file? Basic descriptive metadata are analogous to the title page of a book and can help scholars of the future to understand what they observe.
- (Limited) external dependencies: Minimize the degree to which a format depends on specific hardware, software, or operating systems and predicted complexity of dealing with such requirements in future technical environments.
- Impact of patents: Patents related to a digital format may inhibit the ability of archival institutions to sustain content in that format. Widespread adoption of a format may indicate that patents do not pose problems for long-term archiving.
- Technical protection mechanisms: Encryption poses problems for long-term archival and dissemination, migration to new formats, transfer to new storage media.
Comprehensive coverage by the Library of Congress includes discussion of various content categories (still image, sound, textual, moving image, …). The list of recommended formats is updated annually. The Swiss Coordination Center for Permanent Archiving (KOST-CECO) describes 52 important and common file formats in its catalog and analyzes their suitability for digital archiving.
Recommended File Formats
The following list is for indicative guidance, not obligatory:
|Type of data||Recommended formats||Comments|
|Text||Plain text (.txt)|
Rich Text Format (.rtf)
Open Document Text (.odt; ISO 26300)
|Encoding for all formats preferably UTF-8|
Alternative: .docx (ISO 29500)
PDF/A (PDF format for archival purposes; ISO 19005) emphasizes layout definitions
|Tabular data (minimal metadata)||Comma-separated values (.csv)|
Open Document Spreadsheet (.ods; ISO 26300)
|Encoding for all formats preferably UTF-8|
Alternative: .xlsx (ISO 29500-1)
Please consider the different spelling of numbers in the Anglo-Saxon and European systems, namely “,” and “.”. This can cause serious problems in tables with columns of numbers. The background is the ambiguity between commas, thousands characters and field separators. This can be avoided by using the tab-delimited format.
|Tabular data (extensive metadata)||SPSS portable format (.por)|
|Alternative: .xlsx (ISO 29500-1)|
|Images||TIFF6.0 uncompressed (.tif)|
Scalable Vector Graphics (.svg)
|TIFF is proprietary but widely used and well-documented|
JPEG2000 (ISO 15444) provides lossless compression but isn’t widely adopted
JPEG has lossy compression, please be aware of the issue of data loss with this compression format, see as one examples this study
.svg for vector images
FITS is a flexible open file format for images, spectra and tables, especially used in astronomy
OGG video (.ogv, .ogg)
|Audio||Free Lossless Audio Codec (.flac)|
Waveform Audio File Format (.wav)
|Alternative: MPEG-1 Audio Layer 3 (.mp3) has lossy compression, but is widely adopted|
For some data types, there is no broad consensus on preferred formats.