How accurate is Thematic at analyzing customer feedback?

Alyona Medelyan PhD
Alyona Medelyan PhD

In discussions with potential customers, we often hear the question, "How accurate is Thematic at analyzing my open-ended feedback?"

Here is why they are asking this question:

“Manual analysis is the most accurate, but it doesn't scale. I tried different solutions. My CX platform produces generic results that aren’t useful and requires a lot of manual input. I know that AI today has advanced. ChatGPT can be really granular, but it doesn't know my business. Accuracy matters! Can your solution deliver it?"

In this article, we'll explain that measuring and reporting on accuracy of theming is challenging. We'll also discuss how we think about measuring accuracy and how we ensure that Thematic delivers best possible accuracy.

Thematic has many different features, each with its own potential accuracy score. Here, we'll concentrate on the accuracy of thematic analysis: the discovery and applying of themes in open-ended feedback.

Three key challenges with getting 100% accuracy of feedback analysis

When it comes to analyzing feedback, what does “accurate” really mean?

A team planning out a project using post-it notes

I have published dozens of academic papers that evaluate accuracy of AI systems. So I can geek out on this topic for hours. But, in the context of customer feedback, these three key points matter the most:

1. 100% accuracy does not exist because analysis is subjective

When calculating accuracy for multiclass classification, accuracy is the fraction of correct classifications.

In thematic analysis, it would be the fraction of correct themes. But which themes are "correct" is subjective. People often disagree on that. In fact, the same person analyzing feedback on a different day will create a different set of themes.

For this reason, when it comes to thematic analysis, instead of measuring "accuracy", academics instead measure the inter-rater reliability or inter-indexer consistency with metrics such as Krippendorff's alpha or Rolling's consistency.

Rolling's consistency, in its essence is a combination of Precision and Recall.

Among experts, these metrics range between 40 and 70% depending on the task. These percentages seem low, but remember that there may be dozens or even hundreds of different categories.

Ideally, we would want to compare AI's theming to theming produced by different people who are experts at this task. The ideal outcome is for AI's consistency with these people to be the same as their average consistency with each other.

2. Usefulness trumps accuracy

It’s easy to be highly consistent and accurate, if there are very few themes, let's say only 3: billing, price, or customer service.

But is the analysis useful?

Unlikely! You probably already knew these categories in advance.

A solution with 100% accuracy that doesn’t get more specific than billing is worse than a solution that has an 85% level accuracy, but tells you which comments are saying “billing date is inconvenient” vs. “billing terms have improved” vs. “billing isn’t accurate.”

Ultimately, it’s more important to get specific and actionable themes, rather than chasing an accuracy number. More on this later.

3. Ease and speed of refinement matters too

Refinement of themes is a key step that many people overlook. No matter how accurate is the analysis, there are always human perspectives and tacit business knowledge that need to be incorporated in order to make the output actionable.

Oftentimes this knowledge and perspective only becomes apparent after a review of the first results.

Let’s say we have evaluated two solutions:

Solution A is 70% accurate in finding themes. Its decisions are transparent, and anyone can easily refine them to make the results more actionable and relevant. Within a couple of hours you can get to nearly perfect accuracy.

Solution B is 80% accurate, but it’s a black box. It takes weeks of engineers or scientists time to refine the output.

I know which one I’d rather choose!

Image showing data being imported, analyzed, and exported

How we achieve high accuracy at Thematic

While 100% accuracy is a myth, it doesn't mean that we don't strive for high accuracy.  Our goal is to deliver the best-in-breed text analytics for feedback analysis. We do this by using the latest AI techniques and bringing scientific rigor into finding the best solution. Our R&D team is a group of experts in this field, including myself, the co-founder and CEO. I have a PhD in keyword extraction and more than 2500 academic citations in Natural Language Processing.

Other players in our field are slow to let go of rule-based methods. They created whole ecosystems of consultants creating the rules manually and billing by the hours. At Thematic, we don't rely on consulting hours. Our goal is to automate as much as possible, so we've always been the first to ship products with the latest AI.

The field of AI has been rapidly evolving in the past years (below the arc). Since Thematic was founded, we've been integrating the latest advances into our products matching with the same rapid pace.

We've also been mindful of addressing the three challenges with achieving high accuracy explained earlier. Here's how:

1. We reduce the subjectivity of the analysis and evaluation

We have established that feedback analysis is subjective. But if we split the analysis into many individual tasks, we can make the evaluation more objective: Have we captured all important broad themes? Have we found all the right specific themes? Have we found all occurrences of a theme in all sentences?

We also split the evaluation into  "Is it accurate?" into key elements that together contribute to overall accuracy:

  • Coverage - What's the percentage of sentence segments that have a theme?
  • Exhaustiveness - Have we found all important themes present in the data?
  • Precision - Are the right sentences tagged with that theme?
  • Recall - Have we found all occurrences of a theme in the data?
  • Specificity - Are the themes at the right level of granularity?
  • Comparison - Is system A better than system B at tagging a set of sentences with themes?

For each task, we then define the evaluation:

  1. The best possible evaluation datasets with labels we agree on as "correct".  
  2. The best metric that fits that data and task.

As new models come out, we are constantly optimizing the solution by using automatic accuracy evaluation. The outcome is an AI model that delivers the best accuracy, speed and costs for that task. By using this approach, Thematic's AI always stays current and delivers the best possible accuracy.

We use the same approach in other areas of Thematic, such as sentiment analysis, conversational analytics and question answering. An additional bonus: by keeping the tasks precise and discreet, Large Language Models are less likely to hallucinate.

2. We deliver useful analysis: both high-level and granular themes

Our users tell us that they need to be able to zoom in and out when analyzing feedback. When zooming out, they want to see high-level overview: Which business areas cause the most concern for their customers? When zooming in, they want to see the specific thing customers aren't happy about: What specifically is an issue and what can we do about it?

To achieve this, we made sure we discover themes at multiple levels and make it easy to zoom in and out within our interface. For example, below is an overview of themes discovered in support chat for a music learning app:

You can instantly see that Account/Subscription Management is the most popular reason for why people are contacting the team. Within the base theme Instruments and Hardware, you can review which specific instrument are more likely to cause issues.

To get more granular on a subtheme like Keyboard,  you can click into a theme summary and theme clusters that group similar issues encountered by users:

If you click on a cluster, e.g. "Keyboard mapping and compatibility issues", you will see the final level of specificity and the original feedback:

The easy access to original quotes is particularly useful for learning more about the issue or even contacting the user for further details.

3. We make it easy and fast for anyone to refine themes

Finally, at Thematic, we’ve designed an interface that makes it easy to quickly refine themes. You don't need to be a data scientist. If you can move files into folders on your computer, you can use our Themes Editor. However, please note that this refinement is optional. Even without human help Thematic can get 80-90% accurate themes, depending on the dataset.

Themes can be regrouped, renamed and deleted. If a theme is not specific enough, you can discover more specific themes within it. Our customer success team helps users by adding an initial set of refinements. Here's a guide and a video on how to use our themes editor.

Most recently, we added an accuracy gauge to help users understand when they can stop editing. Here, the accuracy needed to be calculated without knowing what "correct" themes might be. It also needed to be calculated incredibly fast, since each edit could change the metrics. To achieve this, we simplified the accuracy metrics to just Coverage and Specificity and created a proprietary method to measure these on the fly.

Why is a user-friendly themes editor important:

  • You can validate the AI’s interpretation and curation of themes.
  • You can refine theme names to fit your business' or team's terminology.
  • You can tailor theme model to fit the business lens.
  • You can guide AI to the right level of granularity and exhaustiveness

Our AI learns from anonymized edits, delivering better results in the future.

Case study: Accuracy of student feedback analysis

In the early days of Thematic, we’ve published a guide on how to evaluate analysis of customer feedback, and a white paper on the advanced evaluation of Thematic.

In the white paper we used two different feedback datasets. Each one had themes from 4 people.

We calculated people’s consistency with each other, and compared Thematic’s average consistency with them.

Our research shows that Thematic is more accurate than people, whose analysis can suffer from personal bias or being tired.

Try to quickly eyeball the themes that Thematic found on one of the datasets: “How can we improve things in our business school?”

The advanced evaluation results were as follows:

  • Before a user edited the themes, Thematic was slightly less consistent than people. Post-editing, Thematic was better than 2 out of 4 people.
  • Thematic’s results were more specific because not only it discovered themes, but also sub-themes. Where people just provided the general categories, Thematic found both themes and sub-themes.

Since we've concluded that study, Thematic has been improving. With the power of Large Language Models, we can now deliver even better accuracy out of the box.

Takeaways

  • Don't believe the 100% accuracy statements. Instead look at the methodology behind it and the refinement process.
  • Solutions as accurate (if not more accurate) as people exist!
  • You don’t need to compromise the quality of analysis when automating this task.
  • Usefulness (specificity and actionability) matters more than accuracy alone.
  • Ease and speed of refinement matters and should be accessible to anyone.
  • Make sure to test on your data to get a sense of how useful and actionable the themes are.

At Thematic we care about accuracy a great deal! And we guarantee human-competitive results. Want to test it out? Reach out to request a free analysis of your customer feedback in Thematic!

AI & TechFeedback AnalysisUsing Thematic

Alyona Medelyan PhD Twitter

Alyona has a PhD in NLP and Machine Learning. Her peer-reviewed articles have been cited by over 2600 academics. Her love of writing comes from years of PhD research.