Felipe was joined by co-host Dr. Edison Marrese. Both Felipe and Edison are Post Docs in the general area of Deep Learning focusing on NLP.
They walked us through a huge amount of material, explaining the maths behind approaches such as distributional word similarity, word embeddings, convolutional and recurrent neural networks.
They also shared some practical tips on models, projects and experimental techniques. The content was adopted from a 3-day course, but they managed to cover the most interesting parts within only 3 hours of the workshop.
Here are my 3 key learnings:
1. Easily use and adopt models created by other researchers – instead of creating your own
Creating a Deep Learning model requires a huge amount of work on pulling together the right kind and the right amount of data, setting up the learning environment and running the algorithms on a server.
The results are often published on GitHub and other websites. Models, shared as part of the results, can be adopted for similar use-cases, often without the need of retraining.
“Fine-tuning the network with your own data is usually the best approach” – Felipe Bravo
For example, the winning solution in a recent sentiment analysis competition was an ensemble of Deep Learning models trained on various data representations.
One model, that we can’t wait to experiment with at Thematic, is the DeepEmoji project, where the specific sentiment of phrases such as “this movie was shit” and “this movie was the shit” could be discovered.
2. Big trend to use character-based NLP models
Typically, NLP models are trained by splitting text into sequences of words. This works particularly well for the English language.
Compared to languages like Finnish or Russian it has a very small number of suffixes and endings. Compared to German, words aren’t typically combined to form new words. Compared to Japanese and Chinese, words are always separated by whitespaces.
English is one of the easiest languages to analyze.
Interestingly, Deep Learning models trained on character sequences don’t rely on language-specific methods for dealing with special language characteristics, such as tokenization for splitting text into words.
Because the English language usually has the greatest amount of training data, this means that other languages can now benefit from character based models.
Similarly, in the 90s, one of the first usable language detection algorithms used the most common character sequence patterns to detect a language. There are hidden linguistic properties of languages that most people aren’t aware of, but Deep Learning models can capture them.
3. Customer feedback analysis is one of the hardest NLP tasks
Our post-event discussions with Felipe and Edison about what we do at Thematic were thought-provoking. In the academic world, thematic analysis of people’s reviews is called “aspect detection”.
For example, “room cleanliness”, “breakfast quality”, “location”, “price”, “check out” are all aspects of hotel reviews. Hotel owners need to know which sentiment is attached to each aspect.
At Thematic, we deal with a variety of businesses, and each one has their unique set of aspects. In fact, our customers don’t just want to know the sentiment of top-level aspects, but they need an in-depth understanding of what is actually driving that sentiment.
We solve this problem by automatically extracting up to hundreds of aspects or themes that are the most common in customer responses. The same algorithm automatically groups these themes into broader categories for easy analysis.
When we explained this process, Felipe and Edison agreed that it’s an extremely hard task to solve across a variety of datasets.
Because most businesses don’t have training data, Deep Learning algorithms can’t easily help, and an approach specifically crafted for this task (as we’ve done at Thematic) is required.
In the academic world, researchers tend to compete on clearly defined tasks and datasets that can be shared. While it’s possible to design a task around hotel reviews, a cross-domain approach is much harder, particularly given how subjective this task is.
I believe that the best ideas come from such interactions, and I’m sure the workshop attendees have benefited from the knowledge shared at this workshop. A huge thanks to GridAKL for sponsoring the venue, and Felipe and Edison for running it.