A Primer on Natural Language Understanding (NLU) Technologies

From interactive conversational agents, to sentiment analysis, to search query predictions, Natural Language Understanding (NLU) is the technology behind many of today’s most popular language-driven applications spanning across a wide variety of industry domains.

Yet, few truly understand what this relatively new field in human language technology entails in practice. The following is a primer on NLU that sheds some light into what exactly this relatively nascent technology does, how it works and the state of its development today.

What Does NLU Do Exactly?

As its name implies, NLU refers to how computers understand natural language. But that definition requires us to address three words—or two terms.

Defining Natural Language

Here, “natural language” which refers to the most natural language human beings use to communicate since the age of two or three. The communication (also called “content”) produced is made available in its textual form—either spoken and further transcribed by speech recognition, by professionals or directly written.

Defining Understanding

In the context of NLU, “understanding” refers to, first, extracting information from such textual content, followed by relating the different extracted information types to each other, so as to finally take appropriate actions (e.g., book a trip or a hotel, order food or check calendar openness).

Case Study: Using NLU in Call Centers

In combination with transcriptions from automatic speech recognition (ASR) technology, NLU is often used in call centers to automatically analyze a phone call transcript—with the end goal to anonymize personally identifiable information contained inside of it. To do that, we need to understand, for instance, if the proper noun “Paris” refers to the name of a person, a god or a city. (Also read: How Artificial Intelligence Will Revolutionize the Sales Industry.)

What’s The Point of NLU?

In general, NLU’s main goal is to build a document-specific dynamic knowledge graph related to the context of the content/document we are analyzing. Such a knowledge graph provides a bird’s eye view of all the information pieces a document contains (e.g., a date, a person, a price or the CVC number of a credit card) and how they relate to one another (e.g., to meet or to pay).

The knowledge graph needs to be dynamic because information within a document comes in a sequence and can be updated (e.g., during a call, the discussed meeting time can be changed or the price of an item can be corrected). Furthermore, we need reasoning capabilities to precisely instantiate the real values the document refers to: “Tomorrow” is a date; but its value is “2021-08-21” when the date of document creation is August 20 2021. “Eight o’clock” is 8 p.m. if the context is about dining; and so on. (Also read: Making Data Analytics Human for Decision-Making.)

If we take the “Paris” example (which, again, can be a person’s name, a city or even a god) and the sentence, “I love Paris. The city is wonderful,” there is no doubt that a probabilistic machine learning system can make use of the context (i.e., the word “city” in the next sentence and to tag “Paris” as a city thusly). A good named entity recognition (NER) system could do the job. But the real question is to understand which “Paris” we are talking about.

An NLU system needs to tell us where this “Paris” is to, for example, calculate the time needed to get there or understand that a person referred to in the document is the mayor of that city. “Where” is important because there are 53 places called Paris in the world. Which one is it? Paris, the French capital, is the most widely known city in this list. If the document continues with, “But waiters there are arrogant,” the probability we are talking of Paris, France gets higher.

But imagine the following sentence: “Its vicinity to the quiet lake of Oneida does not seem to help.” Here, the context switches completely: If we use our world knowledge, we know we are talking of Paris in the north of New York state. Technically, this final decision can be taken only by making use of another knowledge graph—a general geographical knowledge graph—which contains a link between “Paris” and “Oneida.” Combining the result of our NER neural network model with the general knowledge graph allows us to make the correct instantiation.

How does NLU work?

At a high level, NLU provides answers to questions such as who, when, where, what and, ultimately, how. A document can be the transcript of a phone call, a newspaper article, an email or any document formulated in a natural language—such as English. Examples of questions that can be answered by NLU include:

“Who had a car accident?”
“When did the customer have a car accident?”
“Who talked with Biden about Korea?”
“When did Biden talk about Korea with Macron?”

Abstracting Document Content

In order to answer such questions, first we need to abstract a document’s content. To answer the “who” question, we look for an abstracted notion of a person. To answer the “when” question, we look for abstracted notions of date, time and, eventually, date-time intervals or duration. We also need to give a value to each of these abstracted notions, which we call “instantiating” or “resolving.” By abstracting and resolving, we extract a large set of information from a document (e.g., a list of people or locations and their names or a list of times and their values) but we are still unable to answer the questions above.

To do that, we need a relation between the elements of information we have extracted. In other words, we need to decide if two pieces of information are indeed related to each other and how. In the example above, Biden and Macron are the people related to each other and they are related by talking about Korea. Typically, such a relation is defined at a basic level by syntax (subject-verb-object); but the relation identification process can be more complex: First because the two elements of information to be related are not necessarily in the same sentence; and second because human-to-human communication contains a lot of implicit information.

Reasoning With Information

NLU is also about reasoning—i.e., deciding if a new piece of information that has appeared is replacing previous information or is complementary to it.

To illustrate, imagine the transcript of a phone call when a new mailing element is expressed. Is that element part of a new address, does it complement the address previously provided, is it a confirmation or a repetition? Reasoning allows us to decide what to do with this new piece of information.

Reasoning also comes into play when a reference is made to previously provided information. “Jane has a green car; Peter dreams of a red car.” A further reference within the document could be about a “green car driver” which is understood to be Jane, but the mention of a “red car driver” is certainly not Peter.

Making The Implicit Explicit

Knowing the context in which a document has been produced allows us to be more precise; but context is often implicit. Put differently, the context is not necessarily put in words within the document—it can be a lot of things: a domain, the profile of a user and so on.

The word “tomorrow,” for instance, can be interpreted correctly by taking into account the context, which could be the date of a request. The same goes for a time slot: If we know the time zone of a person who proposes a time for a meeting, we can produce a calendar invite that is valid in another time zone.

In short, NLU is about abstracting and reasoning within a specific context or domain to derive an action (containing new computable values), as the example of automating a calendar invite illustrates. As outlined above, successfully completing an NLU task in relation to a document heavily depends on the questions we are asking about such a document.

What is the state of NLU today?

NLU is most commonly found in speech- and language-driven applications—such as personal smart assistants (e.g., Alexa and Google Home), conversational chatbots, Internet of Things (IoT) devices, speech-enabled mobile apps and applications designed to gather sentiment on a given topic. Today, it performs exceedingly well with simple directive, short utterances where the context is known (e.g., “What’s the local weather tomorrow?,” “I want to order a large pizza with pepperoni,” and “Where is the closest dry cleaner?”). (Also read: We Asked IT Pros How Enterprises Will Use Chatbots in the Future. Here’s What They Said.)

The Impact of Word2vec Embeddings

Introduced in 2013, Word2vec embeddings were a major milestone for NLU technology as they allowed scientists to move from a discrete letter-based modeling approach to a continuous highly dimensional vector-based one. As a result, we now have a few orders of magnitude more space at our disposal to model a document—500 or even 1000 dimensions as opposed to the 26 dimensions (corresponding to the 26 letters in the English alphabet) we were dealing with initially. Additionally, this space is continuous with values between -1 and +1 as opposed to the values 1 or 0.

These embeddings half-opened the door to a new world by producing one embedding per word, without taking into account its semantic class. In our “Paris” example, this would place “Paris” in a space between the unambiguous person names like “John” and the unambiguous cities like “Berlin.” We knew we were in the middle but, in a specific text, we still had no way to decide which class the word “Paris” referred to; we only knew it is ambiguous. A few years later, we were able to train context-dependent embeddings (such as BERT and ROBERTa) where we could obtain an embedding for “Paris” as a person name and another embedding for “Paris” as a city name. These more complex embeddings are now widely—and successfully—used to fine-tune downstream tasks like named entity recognition.

The advantage of embeddings is that each word vector produced is positioned in the defined space (e.g., 500 dimensions) with respect to the context within which the word has been observed and also in relation to other embeddings (for example, by grouping proper names in the same sub-space). This means a word vector contains more information than the representation of the word itself—it also contains abstracted information about this word: an internal classification that can be near what a human would call “syntactic-semantic” information. The information stored in a vector is then very useful for the NLU information extraction phase.

Challenges Facing NLU Technology

However, when it comes to inferring the expected, exact and correct meaning/decision/conclusion in relation to the task analyzed within a long-form document or conversation, NLU still often struggles to generate an accurate result. Why?

Embeddings provide us with an internal representation of how each word is to be positioned and categorized or abstracted with respect to other words. But a human being cannot directly interpret this internal representation. For such an interpretation, we need annotated data that can map the model’s internal representation to what the human being needs—for example that “Paris” is the name of a person and not a city, or that the number “nineteen ninety-five” is a price and not a date.

For a NER task, deciding whether a word is the name of a person or a city can be fairly simple: The information is often found locally, within a few words left and right from the word to identify, typically within the same sentence. For precise price extraction though, things are harder. The contextual information needed to decide if the number we are looking at is either a credit card CCV or a money amount is not at sentence level—but at paragraph (document) level.

Even if implicit contextual information exists—for example linked to an inbound call center’s phone number, like the name of the company called, its location, the service called or product types and names of products—the question still remains about how to efficiently transfer this information to the model or to the corpus so we can make use of such metadata that define the call’s context. It is obvious that specific annotated corpora is needed. (Also read: What is Data Profiling & Why is it Important in Business Analytics?)

Conclusion

One of the primary reasons NLU lags behind other language technologies, such as speech recognition and machine translation, is that it does not have an extensive set of annotated data to fuel it. Creating data for NLU machine learning models is a more complex process and requires a deeper skill set versus that for building ASR corpora.

For example, defining a named entity within a document also requires looking at the context within, and of, the document itself to infer the conclusion (e.g., city = ”Paris” and location = ”near Oneida” defines the entity “Paris” as “Paris, NY”). Tasks such as analyzing sentiment are even harder and are often subject to a higher degree of inter-annotator dispute due to certain tasks’ subjective nature. As such, these types of tasks can result in output ripe for bias when left unchecked.

Overall, NLU annotation is a complex process due to its very nature: needing to define how, what and why we “understand” what is in the text. To summarize, annotating a document for NLU is more than merely recognizing what the author said or wrote; it is about precisely recognizing what the author wanted to say or write.

Today, we are working to make NLU more “machine learnable” by coming up with innovative ways to efficiently create annotated NLU data. Even if it seems a contradiction, part of the work is building more robust rule-based systems to bootstrap the annotation process. We are also working on how to learn inferencing from data—there are many things to do.

While NLU is still at a relative infancy stage, it is already an exciting component of AI applications and I look forward to where we can take it in the future.