Guide: Open-End Coding Of Customer Surveys [Updated 2020]

Guide: Open-End Coding Of Customer Surveys [Updated 2020]

Open-Ended Survey Responses
This article is authored by Alyona Medelyan, Ph.D in Natural Language Processing. Medelyan spent over 15 years researching ways to extract meaning from text.


Open-ended survey questions often provide the most useful insights, but if you’re dealing with hundreds or thousands of answers, summarising them will give you the biggest headache.

The answer? Coding open-ended questions. But what’s the best way to go about open-end coding customer surveys?

Whether you go for manual or automated coding, it’s a good idea to learn best practices from people who have been dealing with text for decades, qualitative researchers.

Here, you’ll learn how manual coding works.

Coding Open-Ended questions
From text to codes to analysis


What is coding and why does it matter?

When you hear a term like ‘big data’ it almost always refers to quantitative data: numbers or categories. Statistical and machine learning techniques “love” numbers.

Free text is an example of qualitative data. Dealing with it is hard but important!

Qualitative researchers believe that numbers won’t get you far. They believe that by interviewing people and asking them to answer open-ended questions, you can learn more than by only looking at quantitative data.

Let’s take for example NPS surveys. The NPS score, calculated from numeric answers to ‘How likely on a scale from 0 to 9 are you to recommend us to friend or family?’ will result in a single measure of company’s performance.

But it’s the open-ended answers to the question ‘Why did you give us that score?’ that will teach you how to improve that measure in the future.

Qualitative research produces a lot of text.

Researchers use coding to draw conclusions from this data. Survey questions where respondents can write whatever they like are also called open-ended questions. A response is known as a verbatim.

‘Coding’ or ‘tagging’ each open-ended response with one or more codes helps capture what the response is about, and in turn, summarise the results of the entire survey effectively.

If we compare coding to NLP methods for analyzing text, in some cases coding can be similar to text categorization and in others to keyword extraction.

Next, let’s look at coding and the different methodologies in more detail.

We often refer to how to perform the task manually, but if you are looking at using an automatic solution, this knowledge will help you understand what matters and how to choose an approach that’s effective.


What is a coding frame?

When creating codes, they’re put into a coding frame. This frame is important because it represents the organizational structure and influences the usefulness of the coded results.

There are two types of frame: ‘flat’ and ‘hierarchical’:


  • A Flat frame means that all codes are of the same level of specificity and importance. It is easy to understand. But if it gets large, organizing and navigating it will be difficult.
  • Hierarchical frames capture a taxonomy of how the codes relate to one another. They allow you to apply a different level of granularity during coding and analysis of the results.

One interesting application of a hierarchical frame is to support sentiment differences.

If the top-level code describes what the open-ended response is about, a mid-level one can specify if it is positive or negative and a third level the attribute or specific theme.

You can see an example of this type of coding frame below.

Coding Open-Ended questions
Using Sentiment in a Hierarchical coding frame


Advantages and disadvantages of code frames

Flat code frame Hierarchical code frame
Supports fewer codes Supports a larger code frame
(+) Easier and faster to manually code with (-) Requires navigating the code frame to find the right one
(+) Easy to provide consistent coding (-) Prone to a subjective opinion of how each answer is coded
(-) Difficult to capture answers that aren’t common leading to a large ‘other’ category (+) Can organize on basis of organizational structure etc
(-) Doesn’t differentiate between importance and levels of specificity of themes (+) Allows for different levels of granularity


Coverage and flexibility of a coding frame

A couple of critical things to consider when coding open-ended questions are the size and the coverage of the frame. Make sure to group responses with the same themes, disregarding of wording, under the same code.

For example, a code ‘cleanliness’ could cover responses mentioning words like ‘clean’, ‘tidy’, ‘dirty’, ‘dusty’ and phrases like ‘looked like a dump’, ‘could eat off the floor’. The coder needs a good understanding of each code and its coverage.

Having few codes and a fixed frame makes the decision easier.

Having many codes, particularly in a flat frame, makes it harder as there can be ambiguity and sometimes it isn’t clear what a response precisely means.

Manual coding also requires the coder to remember or be able to find all the relevant codes, which is harder with a large coding frame.

Finally, coding frames should be flexible. Coding a survey is a costly task, especially if done manually, and so the results should be usable in different contexts.

Imagine this: You are trying to answer the question ‘what do people think about customer service’ and create codes capturing key answers. Then you find that the same survey responses also have many comments about your company’s products.

If you need to answer ‘what do people say about our products’, you may find yourself having to code from scratch!

Creating a coding frame that is flexible and has good coverage (see the Inductive Style below) is a good way to get value in the future.


Deductive and inductive coding styles

What are the two approaches to manual coding open-ended questions, and which one is best?

1. Deductive coding using pre-existing frame

With deductive coding, you start with a predefined set of codes. These might come from an existing taxonomy that may cover departments in a business or industry-specific terms.

Here, codes are driven by a project objective and are intended to report back on specific questions.

For example, if the survey is about customer experience and you already know that you are interested in problems that arise from call wait times then this would be one of the codes.

The deductive approach has the benefit that you can guarantee the items you are interested in will be covered, but you need to be careful of bias.

When you use a pre-existing coding frame, you are starting with a bias as to what the answers could be and might miss themes that would emerge naturally from people’s responses.


2. Inductive coding using sampling and re-coding

The alternative coding style is inductive, which is often called ‘grounded’. Here, you start from scratch, and all codes arise directly from the survey responses.

The process for this is iterative:


  1. You read a sample of the data
  2. Create codes that will cover the sample
  3. Reread the sample and apply the codes
  4. Read a new sample of data applying the codes and noting where codes didn’t match
  5. Create new codes
  6. Go back and recode ALL responses again
  7. Repeat from step 4

If you happen to add a new code, split an existing code into two, or change its description, make sure to review how this change will affect all responses.

Otherwise, the same response near the beginning and the end of the survey could end up with different codes!


How to choose high-quality codes

When deciding what codes to create several things should be considered.

1. Ensure Coverage
Codes should cover as many survey responses as relevant.

The code should be more generic than the comment itself to allow it to cover other responses.

Of course, this needs to be balanced with the usefulness for analysis.

For example ‘Product’ is a very broad code that will have high coverage but limited value.

On the other hand ‘Product stops working after using it for 3 hours’ is very specific and is unlikely to cover many responses.

2. Avoid Commonality
Having similar codes is ok. But make sure there is a clear difference between them.

In maths, this is referred to as orthogonality and captures how independent two things are.

‘Customer Service’ and ‘Product’ would be orthogonal while ‘Customer service’ and ‘Customer support’ may have subtle differences but are not orthogonal and may work better as the same code.

3. Create contrast
Try to create codes that contrast with each other.

Capture both the positive and negative elements of the same thing separately.

For example ‘Useful product features’ and ‘Unnecessary product features’ would have contrast.

4. Reduce data
Let’s look at the two extremes: There are as many codes as comments, or each code applies to all responses.

In both cases, the coding exercise is pointless.

So, try to think about how to reduce the number of data points so that analysis useful.

For example, ‘Product stops working after using it for 3 hours’ would create an unnecessary data point.

Use instead ‘Product stops after use’.


The accuracy of coding open-ended questions

Ensuring consistency is hard regardless of whether coding is deductive or inductive.

A coder’s mental frame and past experiences color how they interpret things. As a result, different people given the same task are very likely to disagree on what the proper codes should be.

In fact, one study has shown that the same person coding the same survey on a different day will produce different results.

Mitigate this by logging all decisions and thoughts that went into coding. Review them when applying existing codes or deciding if a new one is necessary. This process will also mean that the choice of codes can be backed up with evidence.

A different, more expensive, approach to ensure accuracy is through careful testing of the coding reliability.

The ‘test-retest’ method involves the same person coding the data twice without looking at the results.

The ‘independent-coder’ method uses a second coder on the same survey.

In both cases, the results are then compared for consistency and amended as needed.


Top tips for coding open-ended questions


  • Coding is the process of assigning codes to open-ended answers, or other types of text data, after which text can be analyzed just like numerical data.
  • Code frames can be flat (easier and faster to use) and hierarchical (more powerful).
  • Code frames need to have good coverage and flexible to allow for a complete and a varied analysis of open-ended answers.
  • Inductive coding (without a pre-defined code frame) is more difficult but less prone to bias.
  • When creating codes make sure they contrast each other and reduce the data.
  • Accuracy means consistent coding – which can be achieved by logging and reviewing decisions.
  • If you’re spending too much time coding qualitative feedback or simply collecting too much to analyze, you should look into automating your open-end coding.

Thematic does exactly that by leveraging AI to analyze (and code) qualitative feedback at scale.

We’d love to chat to see if our algorithms can help you uncover deeper customer insights without all the manual effort.

Click below to schedule a time to talk!


Want to see a Thematic demo on your data?

Two computers displaying Thematic softwareIn only 30 minutes, we’ll show you what makes Thematic different, by demonstrating how Thematic works on your data and give you adice on how to improve your customer feedback analysis.

Book a demo

Subscribe to email updates