Spotlight: Data Quality – Dimension 3, Accuracy
An Interview Series with Clarity AI Executive Team on the 8 Dimensions of Data Quality
How does Clarity AI ensure its data is of the highest quality?
Clarity AI uses an 8-dimension framework to ensure data is of the highest quality. Those dimensions are coverage, freshness / timeliness, accuracy, data updates, explainability, consistency, point-in-time, and feedback. In this series of interviews with Clarity AI executives, each of these dimensions is explored and explained. Clarity AI’s expert team creates scientific- and evidence-based methodologies that then leverage powerful, scalable artificial intelligence (e.g., machine learning) to collect, clean, analyze and expand existing data sets to power its sustainability tech platform or to integrate directly into users’ existing workflows.
Dimension 3 – Accuracy
Clarity AI’s VP of Product, Ángel Agudo, Head of Product Research & Innovation, Patricia Pina, Head of Data Strategy, Juan Diego Martin, and Head of Data Science, Ron Potok, discuss – with Chris Ciompi, Clarity AI’s Chief Marketing Officer – the critical dimension of accuracy and its relationship to data quality. The group discussed how a data collection system can leverage algorithms to improve data accuracy. The system would use algorithms to extract and check data, and then provide real-time alerts to data collectors if anything seems off. This would help ensure data accuracy from the beginning and build trust with clients. Additionally, the group touched on the importance of explainability in building trust and differentiating correct data from incorrect data. Overall, the group emphasized the need for a data collection system that is efficient, accurate, and transparent in order to build trust with clients and ensure high-quality data, and hence, results.
Chris Ciompi: Hi, everyone, and thanks for coming to the table again to chat through another dimension of data quality. Let’s talk about accuracy. So, on to Ángel again. Please define accuracy as it relates to data quality.
Ángel Agudo: We utilize various technologies such as Natural Language Processing (NLP) to efficiently collect data from reports. This data is subjected to a compilation of algorithms in real-time, which compares it with other dimensions of the company, across time, and with other companies in the industry to identify potential errors. To achieve this, the algorithms are trained with the view of sustainability experts, who challenge each datapoint with a strong theoretical support. Depending on the result, the datapoint may be considered right, or a recollection by a human may be required to evaluate the issue. In some cases, the reported datapoint may be provided, but complemented with an adjusted value, to provide a better picture of the reality of the company. All of this ensures that Clarity AI provides the highest quality data in the market, from an accuracy standpoint.
Chris Ciompi: Thank you. Patricia, why is accuracy important for consumers of sustainability data?
Patricia Pina: Sustainability data is used to make decisions. If you have the wrong data, you will make the wrong decisions. So, accuracy is critical. It is the basis, the building block of everything else. And just to illustrate this point: if we look into CO2 emissions data, which happens to be both the most reported metric and the most used in the industry, and we focus on reported data, which is the most stable and mature data in the market, we see very different numbers floating around in the market. Our research found that in 40% of the cases, there were discrepancies in the numbers offered by data providers on those companies. Addressing these discrepancies is important because it makes a huge difference in the calculations and reports that market participants use to inform about the emissions of their financial products. It can increase the carbon footprint by as much as 20%, and even more. And just to put the 20% in perspective: 7% is the annual decrease that we need to head towards in order to meet the Paris Alignment. So, 20-30% are very significant numbers.
Chris Ciompi: Thank you. I’m going to push a little bit on the example, and on the Paris Alignment. When you say Paris Alignment, you mean the 2030 and the 2050 goals, right?
Patricia Pina: Yes, I’m referring to the decarbonization rate that we would need in order to hit the 2030 and the 2050 goals.
Chris Ciompi: Okay, excellent. Thank you. Juan Diego, how accurate is the data across the full span of Clarity AI’s coverage?
Juan Diego Martín: We work to have greater than 99% accuracy within our data. And in order to do that, we employ a strategy that we call “four levels of defense”. The first one is very strict service-level agreements, with everyone involved in the process. The second one is technology, that allows us to spot anomalies as soon as possible, and we have four main assets for that liability: heuristics, competing approaches, accuracy checks using Natural Language Processing (NLP) techniques, and third-party validation. The third line of defense is validation at the master database level, that all our modules use, so everything that is going to be pushed into the platform goes through additional quality controls. The fourth one is done at a module level, where specific teams for each of our products validate that the data is of the highest quality possible and ready to be delivered to the customer.
Chris Ciompi: Thank you, and I think Ron, there’s probably some fodder in there for you. How is the accuracy of the data at Clarity AI influenced by artificial intelligence?
Ron Potok: Following on what Patricia said, there are discrepancies of data providers within the marketplace for the same CO2 emissions data, meaning that two different providers might be giving different CO2 emissions for the same company. At Clarity AI, we take a statistical approach. We source data from multiple providers so that we can study it and utilize it to try to get the most accurate sustainability data. As a statistician, you might want to average different opinions of sustainability together. But that’s not the approach here. We don’t believe that CO2 emissions for a company in a given year are an opinion. We believe it is a fact and there’s a right answer and a wrong answer. So instead, we have built AI technology that helps us determine whether every data point is accurate or not. The type of information we use to determine that accuracy is context, that we add to every data point. That context could be data previously reported by the company, or normal values within the industry. This is to ensure that each data point that we deliver to the customer is reasonable and is given within its context. There are multiple other ways that we ensure quality throughout the process, but what’s special about Clarity AI is the fact that we have access to many different providers, and that we have built models that allow us to assign a confidence level to each data point to determine how confident we are that this data point is correct, irrespective of where it comes from.
Chris Ciompi: On the models, can you explain a little bit how AI is working, powering those models to influence accuracy in a positive way?
Ron Potok: We have several different models. The model I’ll focus on is our reliability model. Like I mentioned before, we have built a model that applies context to every data point, and that context comes from the data providers. Potentially, there are two or three different providers with different values for that data point so we ask ourselves: What is the history of that data? Meaning, for example, your Scope 1 emissions last year, two years ago, three years ago as a company, and the context of the industry: given the industry you’re in, what are normal values for you to have? We apply all of that information as features in a machine learning model that allows us to output for each data point how likely it is that this data point is correct for a given company.
Chris Ciompi: And how complicated would it be to do what you just described without AI?
Ron Potok: The value of AI or machine learning techniques, in general, is to condition on many different aspects simultaneously. So, if you set up rules like in a rule-based system, you would have a lot of “if statements” that are independent of each other. Instead, what a model does is understand the context of all of those decisions and what is the likelihood of success based on all of that information at the same time. It is certainly feasible to do with heuristic rules, but it becomes unattractive very quickly, and that’s why we build models. The complexity becomes intractable, and the interaction effects between features become intractable for humans to write down rules.
Chris Ciompi: Perfect. Thank you, Ron. Patricia, how does data accuracy help drive product innovation at Clarity AI?
Patricia Pina: When I think about how accuracy helps us innovate, I think about different pieces. First of all, we want to make sure we have a quick feedback loop with our clients when it comes to accuracy. To do that, we have put in place channels and tools for clients to challenge any data point. Then we get back to them with a full explanation of the data. The other piece is how we can get more sophisticated and smarter with algorithms and checks. One way of doing that is by integrating these algorithms at the very beginning of the data flow to detect any potential issues in the accuracy very early in the process and in real-time, provide feedback to whoever is collecting that data, and adjust it to deliver the highest quality data to our clients with no delay.
Chris Ciompi: When you say “in real-time,” how does that influence innovation?
Patricia Pina: In our data collection system, both for data extraction and validation, we integrate algorithms. The person collecting the data will receive real-time alerts if any of the data seems incorrect based on what we know about the company, as well as other data we have collected in the past. We will do all these checks in real-time and provide feedback to the company collecting the data. If there are errors, they will be corrected at that moment to ensure accuracy from the beginning.
Chris Ciompi: So, this is one way to achieve the 99% plus accuracy that Juan Diego mentioned earlier?
Patricia Pina: Yes, exactly.
Chris Ciompi: Got it. So, this is pushing back to what Juan Diego said about aiming for 99% plus accuracy. It’s one of the ways. Ángel, how does the level of data accuracy at Clarity AI influence the capabilities of the tech platform?
Ángel Agudo: Providing the right data and building trust with our clients is critical. Clients often compare different sources of data for the same purpose and might find differences. We need to show them how they can differentiate between what data is right and what is wrong. Explainability is key to building trust, so we need to communicate our data work and corrections in a way that builds that trust. Our real-time data collection and quality checks make us very efficient, and the platform should convey that information to build trust.
Chris Ciompi: Thanks, everyone! Thanks for the great discussion on this dimension of data quality – accuracy.