Ensure your datasets are of high quality and fit for purpose

October 5, 2021

Well begun is half done, or in case of developing AI systems, quality data is half done.

The importance of good quality and relevant data is of paramount importance to the performance of the AI systems. Your AI system is only as good as the data it is trained with – data may contain errors, mistakes, inaccuracies, as well as biases, therefore it is important to ensure that data that is used to train your AI systems is of good quality.

In addition to quality data, integrity of the data must also be ensured. Data must be accurate and consistent. Therefore, it is important that your datasets contain data that is relevant for the purpose of the AI system and that the datasets contain uniform data – eg. data that is of same value. Using bad quality data, or data that is inconsistent or irrelevant may lead to distorted outcomes, eating away not only the credibility and trustworthiness of the AI systems but also causing potential losses and issues on the organizational level.

Thus, quality, integrity and purpose of the datasets is important to ensure the correct and efficient functioning of the AI system as well as to ensure the trustworthiness of the AI system.  

Therefore, to ensure that the datasets used are fit for purpose, of good quality and integrity, processes and datasets used in the AI systems must be documented and tested through each step on the way, including during AI system planning, training, testing and deployment. Furthermore, these requirements apply not only to systems designed and developed within the relevant organization, but also to systems planned and developed by any third-parties.[1]

To ensure that your datasets are of high quality, it is recommended that the entities ask the following questions:[2] [3]

  • Is your AI system developed or is it being trained by processing or using personal data?
  • What is the extent to which you are in control of the quality of external data sources used by your AI systems? How do you ensure the quality of external data sources?
  • Who are the parties responsible for data collection, maintenance, and data dissemination?
  • Do you have in place processes that ensure the quality and integrity of your data? What are these processes, and did you consider other processes?
  • Assess the quality and integrity of the datasets used; Where was the data obtained from? What information is included in the data? Does the included information appropriate for the purpose of the AI system? Who/what groups are covered in the data, who / what groups are underrepresented? Is information missing within the datasets or are some units only partially covered? What is the geographic coverage and timeframe of the data collection used to develop the AI system?
  • How do you verify that your datasets have not been hacked or otherwise compromised?

In conclusion, the quality and integrity of data creates the foundation to the trustworthiness of the AI systems. As the basis for the trustworthy AI are created when the system is developed, it is important to ensure the availability and use of good quality data. Quality and integrity of the data do not only promote the credibility and trustworthiness of the AI systems in the eyes of stakeholders and wider public, but also guarantee that the algorithms are more efficient and require less resources to function properly. Furthermore, datasets and well as all related governance measures should be documented. Documentation ensures transparency of the AI systems as well as ensures traceability and allows the understanding to the decision-making process behind the line of code.

[1] High-Level Expert Group on AI by European Commission: Ethics Guidelines for Trustworthy AI.

[2] High-Level Expert Group on AI by European Commission: Assessment List for Trustworthy Artificial Intelligence (ALTAI).

[3] European Union Agency for Fundamental Rights: Data quality and artificial intelligence – mitigating bias and error to protect fundamental right.

Collaborate with us

We collaborate with partners who work in different ways to create a better future with AI. Connect with us to talk about how our platform can help your and your customers' businesses thrive.