Data Quality

Key points

  • Data quality is (pragmatically) defined from a user perspective as “fit for purpose“. It is thus context-dependent: the same data may be high quality in one context, and (relatively) poor quality in another one.
  • Data quality is a multidimensional concept, including characteristics of data themselves and how they are presented and documented. Some aspects are subjective, there can be trade-offs between different dimensions of data quality.
  • Initial concepts of data quality are mainly from a business or management (theory) context, adapted from criteria for product quality (data and information as a product). They may not be directly transferable to scientific data, but can be used at least for guidance or as starting points. Readers of this website may also benefit if they later transfer from science into different professional environments.
  • For scientists, implications are twofold: Data quality refers to their own data – facilitating reproducibility, possible replication and re-use also beyond science – as well as to assessing the analogous potential of other data (including from non-scientific sources).

Introduction

This chapter is presented as a “forethought”, aspects of data quality concern the entire research process – from design of e.g. experiments via data generation until data are finalized and possibly shared with others. Some of these aspects are (implicitly) covered in later sections of this web page, but this chapter reflects on data quality in general, not limited to scientific research data.

“Data quality” may be considered just a buzzword or vague concept, which is arguably true to some extent: Some elements are subjective (particularly believability and reputation of data). Many aspects cannot be rigorously addressed (for own data) or assessed (for data by others), let alone quantified. Most of all, data quality is context-dependent due to the overarching criterion “fit for purpose” emphasizing the perspective of data (re-)users: “the same data elements or data sources may be deemed high quality for one use and poor quality for a different use” (Kahn et al. 2015 – in a health care context, but applicable in general).

Data quality refers to data themselves, as well as how they are documented (metadata) and presented at the level of individual data elements (e.g. tables) and information systems. Emphasis is on data re-use – also for purposes not initially considered and in different fields. However, not all re-use cases can be anticipated, and tacit contextual knowledge on how data are generated cannot always be documented to data re-users from different (scientific or non-scientific) fields.

Besides lack of alignment between intention of data creation and subsequent usage, challenges stem from the sheer volume of data and their increasing diversity (structured, unstructured, semi-structured, multimedia – video, maps, images), including data from new technologies such as sensor data (Sadiq 2013).

From a scientist perspective

Data quality emphasis on a user perspective has two implications for scientific researchers:

  • reproducibility and in some cases replicability of their own data, as well as re-use – also in other fields and beyond science
  • re-using data by others, including data that weren’t generated for scientific purposes

Data re-use

Reproducibility, replicability, robustness and generalizability are usually assessed by peers from the same scientific field – or by the initial study team, or together involving consultation and collaboration. Data re-use may involve different scientific fields, or occur beyond science. Such re-use cases cannot always be anticipated.

Re-use of scientific data in different fields and transfer into economy and society requires new quality criteria (RfII 2019). Data should be understandable and re-usable without contacting data providers, ideally also machine-readable. For research data, contextual knowledge is required to understand the data generation process and thus assess their potential (but also limitations) for re-use. Documentation and metadata about data generation and data processing are thus relevant – “A data set that does not provide access to relevant metadata may result in the data being unintelligible, misinterpreted, or unintentionally misused” (Price & Shanks 2016).

However, it may not be feasible to document all details – “It is simply impossible for researchers to document every decision they made on data as well as tacit knowledge they used in data creation” (Yoon 2016 – in a social science context but applicable to other fields). If critical information seems to be missing or unclear, data consumers should contact data providers.
Researchers may obviously also have the role of data consumers, re-using scientific data by peers. When data from several fields are combined, domain knowledge for each field is beneficial. Statistical expertise is also often required.

Re-using other data

A huge amount of (big) data is available, but the quality (information content) of a dataset isn’t directly related to its size. The challenge can be to find and select data (of sufficient quality) to answer a given research question – or conversely to find an appropriate research question for a given (combination of) dataset(s).

Keller et al. (2017) distinguish between designed data, administrative data and opportunity data.

  • Designed data, traditionally used in scientific discovery, include results of experiments or surveys and intentional observations.
  • Administrative data, generated by government agencies but also e.g. companies, might fill gaps resulting from globally decreasing response rates to surveys. However, they also do not assure complete coverage and thus representativeness, and related documentation tends to be limited to absent.
  • Opportunity data are defined as “generated on an ongoing basis as society moves through its daily pace”. Social media users may not even be aware of the fact that they generate data, or do not intend to generate data – let alone taking into account that such data might be used (combined with data from other users or other sources) for scientific purposes. Researchers don’t have any control on how these data are generated.
    Groves (2011) suggests the term “organic data” – “now-natural feature of this ecosystem” (with society considered an ecosystem). “What has changed in the current era is that the volume of organic data produced as auxiliary to the Internet and other systems now swamps the volume of designed data.

Fields of data quality assessments

Initial concepts of data quality are mainly from a business or management (theory) context, adapted from criteria for product quality as businesses recognized or treated data and information as a product. Here, consequences and costs of poor data quality are often obvious, e.g. missed business opportunities, decreasing customer (and employee) satisfaction, poor decision making but also compliance risks if data do not meet standards of regulatory authorities. Such concepts aren’t directly applicable to a scientific context but, as Price & Shanks (2016) state, might provide useful guidelines or starting points for “specialized applications” including scientific data. Data generated by official statistical agencies also follow standardized quality criteria.

Focus on standardized data quality in science is so far limited to mainly medicine, engineering, major research facilities (e.g. CERN) and specific data types such as satellite and protein data (RfII 2019). Medicine also needs to demonstrate compliance with regulations, and poor data can affect life and well-being of patients. There are also recent initiatives for biodiversity data (Chapman et al. 2020).

Generally, attention for data quality in research is derived from intrinsic motivation of researchers – science as a profession and mission [in German: Beruf und Berufung] – with some external guidance and steering (RfII 2019). Keller et al. (2017) state that three factors motivate work on data quality of most scientists:

  • create a strong foundation of data from which to draw own conclusions
  • need to protect data and conclusions from criticism of others
  • need to understand potential flaws in data collected by others

User perspective

Merino et al. (2016) rephrase “fit for purpose” as “adequacy of data to the purposes of analysis”. Two other quotes emphasize different data quality aspects from a user perspective:

Data quality is the probability of data to be used effectively, economically and rapidly to inform and evaluate decisions.” (Karr et al. 2006)
Even if data contain all the information needed for a given purpose, they may have limited or no potential for re-use if such information is presented in a poorly accessible fashion. “Probability” also includes that data that are – right or wrong – perceived to be of bad quality are unlikely to be (re)-used.

The U.S. National Academy of Medicine defines quality data as “data strong enough to support conclusions and interpretations equivalent to those derived from error-free data” (Davis et al. 1999, quoted in Zozus, Kahn, Weiskopf 2019).
This acknowledges that data are – in medical and other contexts – rarely (if ever) “perfect”, but need to be “good enough”. Seemingly perfect data may actually be considered suspicious or “too good to be true”, i.e. possibly manipulated/falsified or entirely fabricated.

Data quality categories and dimensions

Data quality is a multidimensional concept, accuracy as most obvious criterion is just one aspect. Following RfII (2019), we focus on “Total Data Quality Management” (TDQM) – developed at Massachusetts Institute of Technology (MIT) in the 1990’s (e.g. Wang & Strong 1996, Wang 1998) and still widely quoted. For other schemes, see the review article by Cichy & Rass (2019) and brief discussion below.

TDQM names four categories of data or information quality (the two terms aren’t exactly synonymous, but often used interchangeably):

Intrinsic data quality refers to data having quality in their own right, independent of specific (re-)use cases.
Intrinsic DQ dimensions: accuracy, objectivity, believability, reputation.

Contextual data quality focuses on a given use case or “task at hand” – data may have high intrinsic quality but variable, sometimes poor contextual quality.
Contextual DQ dimensions: relevancy, value-added, timeliness, completeness, appropriate amount of data (all for the task at hand)

Representational data quality refers to how data are presented within systems.
Representational DQ dimensions: interpretability, ease of understanding, concise representation, consistent representation, ease of manipulation.

Accessibility data quality also refers to the role of systems – (close to) perfect data by other categories are of little to no value if they cannot be (easily) accessed.
Accessibility DQ dimensions are access as well as access security.

Other frameworks sometimes use different terms for corresponding or at least similar dimensions, e.g. accuracy is also referred to as (largely) free-of-error, and timeliness as currency. Batini and coworkers (Batini et al. 2015, Batini & Scannapieco 2016) include timeliness within accuracy (temporal/time-related accuracy).

In a science context (limitations of current data management plans), Williams et al. (2017) mention traceability: ability to reproduce raw data from analysis and vice versa – documenting all data processing steps, and attributing them to individuals or machines. This is needed for reproducibility of results, and will also matter in a business context for auditing purposes. Obviously, raw data cannot be reconstructed from final data if data processing includes steps such as data aggregation or averaging.

Practical aspects of individual DQ dimensions

Intrinsic data quality: Accuracy

Accuracy is context-independent and, unlike most other DQ criteria, can be quantified. Still, required levels of accuracy (“good enough”) can differ in different contexts:
It is possible for an incorrect character in a text string to be tolerable in one circumstance but not in another.” (Pipino, Lee, Wang 2002). Likewise, for numerical data a given accuracy (e.g. ± 1%) may suffice for some, but not for all use cases.

Intrinsic data quality: Objectivity

This refers to data being free of bias – or at least bias is limited and known, with implications understood. Olteanu et al. (2019) define bias as “systematic distortion in the sampled data that compromises its representativeness“. Bias arises if coverage, e.g. population or geographic, is incomplete, and data aren’t randomly missing. Such bias may be inherent and unavoidable, distinct from or opposed to intentionally biased data (implicitly addressed by the believability and reputation dimensions).

Examples

Social media data aren’t representative of the general population (Olteanu et al. 2019, Hargittai 2020), with differences also between platforms. The likelihood that anyone uses a given platform depends on factors such as age, gender, race/ethnicity and – obviously – Internet literacy. Behavior on social media may differ from behavior in other situations. Moreover, the degree of activity per user varies, not all users can be identified, and content created by individual users is selective (Keller et al. 2017). Accordingly, findings derived from social media data cannot be (easily/directly) generalized to other contexts.

Big Data can contribute to official statistics, as discussed by – among others -Statistics Netherlands and Eurostat. Buelens et al. (2014) point out that Big Data often aren’t representative of a population of interest. Data generation processes vary widely, but are very different from probability sampling (as done by national statistical agencies). Big data are often a by-product of a process not primarily aimed at data collection. Records of events aren’t necessarily directly associated with statistical units such as households, persons or enterprises. They are also affected by self-selection: individuals decide whether or not to use the technologies where big data are captured (Beresewicz et al. 2018).
Braaksma & Zeelenberg (2020) additionally mention “uncontrolled changes in sources that threaten continuity and comparability” – “big data may be highly volatile and selective: the coverage of the population to which they refer may change from day to day, leading to inexplicable jumps in time-series.” Yet they suggest to “accept the big data just for what they are: an imperfect, yet very timely, indicator of developments in society” – “these data exist and that’s why they are interesting”.

Biodiversity data may display geographic bias: variable coverage between countries, more coverage along roads or rivers due to easy access (Chapman et al. 2020). This has implications for modeling of species’ geographic distribution.

Believability and reputation

These are subjective criteria, reflecting user trust in data sources, data generators and their motivations for generating and sharing data. Data from official government agencies may (with differences between countries) be considered more reliable than e.g. industry data generated and distributed for PR/marketing purposes, or data provided by political parties.

Example

A used car salesperson may have a good quality car worth thousands of dollars, but because of the poor credibility of used car salespeople, no one will pay what the car is worth. The consumer just doesn’t believe it.” “A very good used car is almost never sold at its true value. (Fisher et al. 2011)

Contextual data quality

Relevancy and value-added

These refer to data being useful for the task at hand, providing information that wasn’t already known/available from other pre-existing sources. Added value can also arise from combining several datasets.

Timeliness

Are data (sufficiently) up-to-date for the task at hand? The frequency of updates can also matter, for both needs are related to data volatility, i.e. how often data change over time. For example, weather data are obviously far more volatile than address data – but the latter can also be subject to changes over time. Height changes for children but not (or less) for adults.

Timeliness is affected by the interval between data collection and processing and data release. If trend analysis will be performed, data frequency needs to be appropriate (Merino et al. 2016). Too frequent updates can also be inconvenient – for manufacturing planning and control processes, new information should be delivered neither too often nor too infrequently (Gustavsson & Wänström 2009). For biodiversity data, yearly resolution may be adequate in some use cases, while other use cases (obviously those studying seasonal patterns) need higher resolution (Chapman et al. 2020).

Accurate information on relevant topics won’t be useful to clients if it arrives after they have to make their decisions.” (Brackstone 1999, on data quality in statistical agencies)

Completeness

Completeness concerns both individual parameters (no missing values) and the amount of information in the entire dataset – “sufficient breadth, depth and scope for the task at hand”, “representing all and only the relevant aspects of the reality of interest”, containing all types of desired or needed data (Batini & Scannapieco 2016, Pipino, Lee, Wang 2002).

Reasons for missing values could be – for example for e-mail addresses – not existing, existing but unknown, or may exist (unknown whether this is the case) (Batini & Scannapieco 2016). Approaches to deal with missing values often assume that data are missing at random, but this may not always be the case – for example in a clinical context “the reason the data are missing is often associated with the health status of the participant” (Friedman et al. 2015).

Duplicates can also occur, both within a single database and when data from several sources are combined with each other. Information from duplicate records can be conflicting (which record or source is more reliable?), identical and thus redundant, or complementary.

Appropriate amount of data

This emphasizes that there can also be too much data, leading to information overload. Data should be usable for a given purpose without much filtration (Gustavsson & Wänström 2009).

Representational data quality

Interpretability and ease of understanding

This includes appropriate units and clear definitions (Pipino, Lee, Wang 2002), as well as documentation and metadata. Data content, collection process (including data processing steps and methods), ownership and reliability need to be documented clearly, unambiguously and in a form that is conveniently accessed by users. This impacts data quality via data usability (Karr et al. 2006).

Concise and consistent representation

This refers to how data themselves are organized and presented. Information should be directly (re-)usable, without the need for prior reworking in terms of format, content and/or structure (Gustavsson & Wänström 2009). For example, importing of data into statistical software programs should be straightforward, with different elements (files) of the same dataset already in the same format. However, if data from several sources are combined, data transformations and standardizations may be required.

Ease of manipulation

This can refer to the feasibility of such transformations, as well as of any other further data treatments – sorting, filtering, statistical analysis, etc. .

Accessibility data quality

Access refers to data being available (for free or at an acceptable cost) easily and quickly, from the users’ locations and operating environments and whenever needed.
Access security comprises several aspects. It may refer to long-term availability of data – in a scientific context, beyond the lifetime of scientific projects (RfII 2019). It may also refer to access being appropriately restricted e.g. for sensitive personal data, with safeguards against unauthorized access. It includes data being protected against unauthorized modifications (data integrity). In an industry (but possibly also in a scientific) context, it can be relevant that competitors do not have access to data.

Trade-offs between DQ dimensions

The most common tradeoff is between timeliness and any of accuracy/completeness/consistency – “having accurate, complete or consistent data may require time and thus timeliness can be negatively affected” (Zaveri et al. 2016, see also Batini & Scannapieco 2016).

Example

Environmental monitoring data may be first shared in (near) real time, with a later version subjected to further data processing and quality control (of values).

Also for other data types, the preferred/best version may depend on requirements of a given use case.

Completeness may also lead to reduced consistency (Zaveri et al. 2016).

Implications of data quality for decision-making

Responses to the COVID-19 pandemic are based on data from a variety of sources, scientific data from various disciplines are used in decision-making beyond science. A full discussion of an ongoing process is obviously beyond the scope of this page. Aspects include timeliness of information, comparability of data between countries, and which data are used by governments for decision-making. Advantages and disadvantages of early information sharing – sometimes before full analysis, often before independent verification via peer review – illustrate trade-offs between timeliness and other DQ dimensions.

Even without delays in reporting, data on confirmed infections and deaths from COVID-19 do not reflect contemporary infection scenarios. Thus they inherently aren’t as timely and fit for purpose – assess current dynamics of the pandemic, including effects of restrictions and their lifting – as ideally needed. Modeling data can provide complementary insights, but all models have their limitations – being based on assumptions and incomplete representations of reality. With decision-making being urgent, it has to rely on the “best” (most relevant, most complete) available data, recognizing their limitations. Some decisions, particularly early ones, may in hindsight be considered questionable, but had been based on data available at that stage of the pandemic.

Fisher & Kingma (2001) ascribe the explosion of the space shuttle Challenger in 1986 to poor data quality, or rather/mainly to poor usage of available information and flawed decision-making due to communication problems between engineers and managers. Besides data quality variables accuracy, timeliness, consistency, completeness, relevancy and (overarching) fitness for use discussed above, these authors introduce “moderator variables” influencing the way decision-makers use information: information overload (“too much information and too little time to respond”), experience level of decision-makers and time constraints.

The explosion was caused by failed O-rings under cold temperatures. This potential problem had been known for years, with special investigations 6 months before the accident. However, it was not considered for decision-making. Information overload is partly ascribed to NASA staff reductions, information wasn’t passed on from middle to upper management. Engineers used charts in a format familiar to them, but not well understood by decision-makers. “While the Thiokol engineers believed that the shuttle was not safe, they had difficulty articulating that belief.

Concluding remarks

This general discussion – with some references to specific fields – obviously cannot provide guidance or “rules” for specific cases of data sharing or data (re-)use. We hope that it can increase awareness on aspects of data quality for data providers and data users – scientists can have both roles. Own data should be presented and documented to facilitate re-use also beyond science – but not all cases of data re-use can be anticipated, and the primary scientific research question is obviously paramount. For data re-use, the “best” available data should be selected, while recognizing their limitations.

References

[key references in bold]

Batini, C., & Scannapieco, M. (2016). Data and information quality. Cham, Switzerland: Springer International Publishing.
[textbook on data quality, including discussion of various specific data types]

Batini, C., Rula, A., Scannapieco, M., & Viscusi, G. (2016). From data quality to big data quality. In Big Data: Concepts, Methodologies, Tools, and Applications (pp. 1934-1956). IGI Global.

Beręsewicz, M., Lehtonen, R., Reis, F., Di Consiglio, L., & Karlberg, M. (2018). An overview of methods for treating selectivity in big data sources. Eurostat Statistical Working Paper. doi:10.2785/312232.

Braaksma, B., & Zeelenberg, K. (2020). Big data in official statistics. Statistics Netherlands Discussion Paper January.

Brackstone, G. (1999). Managing data quality in a statistical agency. Survey methodology25(2), 139-150.

Buelens, B., Daas, P., Burger, J., Puts, M., & van den Brakel, J. (2014). Selectivity of Big data. The Hague: Statistics Netherlands.

Chapman, A. D., Belbin, L., Zermoglio, P. F., Wieczorek, J., Morris, P. J., Nicholls, M., Rees, E. R., Veiga, A. K., Thompson, A., Saraiva, A. M., James, S. A., Gendreau, C., Benson, A., & Schigel, D. (2020). Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data. Biodiversity Information Science and Standards, 4: e50889. doi:10.3897/biss.4.50889.

Cichy, C., & Rass, S. (2019). An overview of data quality frameworks. IEEE Access7, 24634-24648. doi:10.1109/ACCESS.2019.2899751
[review article from a business perspective]

Clemens, M. A. (2017). The meaning of failed replications: A review and proposal. Journal of Economic Surveys31(1), 326-342. doi:10.1111/joes.12139.

Davis, J. R., Nolan, V.P., Woodcock, J., & Estabrook, E. W., editors (1999). Assuring data quality and validity in clinical trials for regulatory decision making, Institute of Medicine Workshop report. Roundtable on research and development of drugs, biologics, and medical devices. Washington, DC: National Academy Press. http://books.nap.edu/openbook.php?record_id=9623&page=R1.

Fisher, C., Lauria, E., Chengalur-Smith, S., & Wang, R. (2011). Introduction to Information Quality. AuthorHouse, Bloomington.

Fisher, C. W., & Kingma, B. R. (2001). Criticality of data quality as exemplified in two disasters. Information & Management, 39(2), 109-116. doi:10.1016/S0378-7206(01)00083-0.

Freese, J., & Peterson, D. (2017). Replication in social science. Annual Review of Sociology, 43, 147-165. doi:10.1146/annurev-soc-060116-053450.

Friedman, L. M., Furberg, C. D., DeMets, D. L., Reboussin, D. M., & Granger, C. B. (2015). Fundamentals of clinical trials. Springer.

Goodman, S. N., Fanelli, D., & Ioannidis, J. P. (2016). What does research reproducibility mean? Science translational medicine, 8(341), 341ps12-341ps12. doi:10.1126/scitranslmed.aaf5027.

Groves, R. M. (2011). Three eras of survey research. Public opinion quarterly, 75(5), 861-871. doi:10.1093/poq/nfr057.

Gustavsson, M., & Wänström, C. (2009). Assessing information quality in manufacturing planning and control processes. International Journal of Quality & Reliability Management 26(4), 325-340. doi:10.1108/02656710910950333.

Hargittai, E. (2020). Potential biases in big data: Omitted voices on social media. Social Science Computer Review38(1), 10-24. doi:10.1177/0894439318788322.

Kahn, M. G., Brown, J. S., Chun, A. T., Davidson, B. N., Meeker, D., Ryan, P. B., Schilling, L. M., Weiskopf, N. G., Williams, A. E., & Zozus, M. N. (2015). Transparent reporting of data quality in distributed data networks. EGEMS (Washington, DC)3(1), 1052. doi:10.13063/2327-9214.1052.

Karr, A. F., Sanil, A. P., & Banks, D. L. (2006). Data quality: A statistical perspective. Statistical Methodology, 3(2), 137-173.
https://doi.org/10.1016/j.stamet.2005.08.005.

Keller, S., Korkmaz, G., Orr, M., Schroeder, A. & Shipp, S., (2017). The Evolution of Data Quality: Understanding the Transdisciplinary Origins of Data Quality Concepts and Approaches. Annual Review of Statistics and Its Application 2017 4:1, 85-108. doi:10.1146/annurev-statistics-060116-054114
[review article with chapters on physical and biological sciences, engineering/information technology/business, medicine and public health, social and behavioral sciences, statistical sciences and “opportunity data”]

Merino, J., Caballero, I., Rivas, B., Serrano, M., & Piattini, M. (2016). A data quality in use model for big data. Future Generation Computer Systems63, 123-130. doi:10.1016/j.future.2015.11.024.

National Academies of Sciences, Engineering, and Medicine. (2019). Reproducibility and replicability in science. National Academies Press.
[recent textbook with definitions of terms and detailed coverage]

Olteanu, A., Castillo, C., Diaz, F., & Kiciman, E. (2019). Social data: Biases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data, 2, 13. doi:10.3389/fdata.2019.00013.

Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211-218. doi:10.1145/505248.506010.

Price, R., & Shanks G. (2016) A Semiotic Information Quality Framework: Development and Comparative Analysis. In: Willcocks L.P., Sauer C., Lacity M.C. (eds) Enacting Research Methods in Information Systems. Palgrave Macmillan, Cham. doi:10.1007/978-3-319-29272-4_7.

Király, P., & Brase, J. (2021). Qualitätsmanagement. In M. Putnings, H. Neuroth, & J. Neumann (Eds.), Praxishandbuch Forschungsdatenmanagement (pp. 357–380). De Gruyter Saur. doi:10.1515/9783110657807-020.

RfII – Rat für Informationsinfrastrukturen: Herausforderung Datenqualität – Empfehlungen zur Zukunftsfähigkeit von Forschung im digitalen Wandel, zweite Auflage, Göttingen 2019, 172 S. http://www.rfii.de/?p=4043 (English version RfII – The Data Quality Challenge: http://www.rfii.de/?p=4203)
[discussion paper partly motivating creation of this website]

Sadiq S, editor. Handbook of Data Quality. Berlin, Heidelberg: Springer Berlin Heidelberg; 2013.

Wang, R. Y. (1998). A product perspective on total data quality management. Communications of the ACM41(2), 58-65. doi:10.1145/269012.269022.

Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of management information systems12(4), 5-33. doi:10.1080/07421222.1996.11518099.
[key references on Total Data Quality Management (MIT)]

Williams, M., Bagwell, J., & Zozus, M. N. (2017). Data management plans: the missing perspective. Journal of biomedical informatics71, 130-142. doi:10.1016/j.jbi.2017.05.004.

Yoon, A. (2016). Red flags in data: Learning from failed data reuse experiences. Proceedings of the Association for Information Science and Technology 53(1), doi:10.1002/pra2.2016.14505301126.

Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., & Auer, S. (2016). Quality assessment for linked data: A survey. Semantic Web7(1), 63-93https://doi.org/10.3233/SW-150175.

Zozus, M. N., Kahn, M. G., & Weiskopf, N. G. (2019). Data quality in clinical research. In Clinical research informatics (pp. 213-248). Springer, Cham.