# Named Entity Recognition using Neural Networks

Named-Entity Recognition (NER) is an NLP task that involves finding and classifying sequences of words (tokens) into pre-defined categories. Examples of named entities include person names, locations, organizations and time expressions. At Comtravo one important piece of our NLP pipeline is a NER module which identifies several classes of named-entities. In addition to the standard named entities (persons, organizations) we are also interested in finding airline and airport mentions. In the following blog post we will give an overview of recently published neural network architectures for named-entity recognition. This blog post was originally published on www.davidsbatista.net

Since about 2015-2016 new methods for sequence labelling tasks based on neural networks started to be proposed. I will try in this blog post to do a quick recap of some of these methods. The aim is to understand their architectures and point out what each technique adds or how they differ from other known methods.

# Introduction

Several NLP tasks involve classifying a sequence of words. A classical example is part-of-speech tagging, in this scenario each $x_{i}$ describes a word and each $y_{i}$ the associated part-of-speech tag (e.g.: noun, verb, adjective, etc.). In named-entity recognition each $x_{i}$ also describes a word and $y_{i}$ is a semantic label associated with that word (e.g. person, location, organization, event, etc.).

# Linear Sequence Models

Classical approaches (prior to the neural networks revolution in NLP) involve methods that make independence assumptions, that is, the predicted tag for each word depends only on the surrounding words not on the tags of the previous words. Later methods take into consideration the label sequence, i.e. the tag given to previous word(s) is considered when deciding the tag for the current word. For an overview of these methods you can refer to the following articles:

Recently, the state-of-the-art for most NLP sequence prediction tasks has become neural network methods. Most of these methods combine different neural network architectures in one model. One important architecture common to all the recent methods is the recurrent neural network (RNN). RNNs are designed to store information about history; in the context of NLP sequence tasks, the history encodes information about previous words in a text. The specific architecture used in most of the recent research is a Long Short-Term Memory (LSTM) network.

# Neural Sequence Labelling Models

The first ever work to try to use try to LSTMs for the task of Named Entity Recognition was published back in 2003:

However, lack of computational power led to small and inexpressive models and performance results that were far behind other methods at the time. Recently, this performance gap has been closed. I will describe four recent papers that propose neural network architectures for tasks such as NER, chunking and POS-tagging. I will focus specifically on the proposed architectures and leave out details about the datasets or scores

At time of writing there are newer methods, published in 2017 and 2018, which are currently the state-of-the-art, but I will leave these for another blog post.

## Bidirectional LSTM-CRF Models for Sequence Tagging (2015)

### Architecture

This is, to the best of my knowledge, the first work to apply a bidirectional-LSTM-CRF architecture to sequence tagging. The idea is to use two LSTMs, one reading each word in a sentence from beginning to end and another reading from the end to the beginning. For each word, the Bi-LSTM produces a vector representation made from the un-folded LSTM state, i.e. forward and backward model up to that word. The intuition behind the architechture is that hidden state vector for each word will take into account the words seen before, in both directions, thus creating a contextual encoding of each word.

The authors do not mention how the vectors from each LSTM are combined to produce a single vector for each word, I will assume that the vectors are simply concatenated.

The bidirectional-LSTM architecture is combined with a Conditional Random Field (CRF) layer at the top. A CRF layer has a state transition matrix as its parameters; the state transition matrix encodes the probability of moving from one state to another. In the context of sequence tagging, this means, for instance, the probability of applying the organization tag after a person tag. This transition matrix this can be used to integrate information about previously predicted tags when predicting the current tag.

### Features and Embeddings

Word embeddings generated from each state of the LSTM are combined with hand-crafted features:

• spelling, e.g. capitalization, punctuation, word patters, etc.
• context, e.g. uni-, bi- and tri-gram features

The embeddings used are those produced by Collobert et al., 2011 which has 130K vocabulary size and each word corresponds to a 50-dimensional embedding vector.

Features connection tricks:

The input for the model include both word, spelling and context features, however the authors suggest connecting the hand-crafted features directly to the output layer (i.e. the CRF). This accelerates training and results in very similar tagging accuracy compared to a model without direct connections. The vector representing the hand-crafted features is therefore passed directly to the CRF, not passed through the bidirectional-LSTM

### Summary

Overall, the model architecture has three components: a RNN for encoding each word in a document, some hand crafted features that are useful for the task and a CRF decoder layer. The Bi-LSTM produces a contextual encoding for each word in a sentence. This encoding is concatenated with a feature vector derived from spelling rules and hand-crafted contextual clues. The final concatenated vector is used to drive a CRF decoder.

## Named Entity Recognition with Bidirectional LSTM-CNNs (2016)

### Architecture

The authors propose a hybrid model combining a bidirectional-LSTM with a Convolutional Neural Network (CNN). The CNN is used to create an encoding of each word; it learns both character- and word-level features. The model therefore makes use of words-embeddings, additional hand-crafted word features and CNN-extracted character-level features. All these features, for each word, are fed into a bidirectional-LSTM.

The output vectors of the forward and backward LSTMs at each time step are decoded by a linear layer and a log-softmax layer into log-probabilities for each tag. These two vectors are then added together.

Character-level features are induced by a CNN architecture, which was successfully applied to Spanish and Portuguese NER (Santos et al., 2015) and German POS-tagging (Labeau et al., 2015). For each word a convolution and a max layer are applied to extract a new feature vector from the per-character feature vectors such as character embeddings and character type.

### Features and Embeddings

Word Embeddings: 50-dimensional word embeddings (Collobert et al. 2011), all words are lower-cased, embeddings are allowed to be modified during training.

Character Embeddings: a randomly initialized lookup table with values drawn from a uniform distribution in the range [−0.5,0.5] to output a character embedding of 25 dimensions. Two special tokens are added for PADDING and UNKNOWN.

Additional Char Features A lookup table was used to output a 4-dimensional vector representing the type of the character (upper case, lower case, punctuation, other).

Additional Word Features: each words is tagged as allCaps, upperInitial, lowercase, mixedCaps, noinfo.

Lexicons: partial lexicon matches using a list of known named-entities from DBpedia. The list is then used to perform $n$-gram matches against the words. A match is successful when the $n$-gram matches the prefix or suffix of an entry and is at least half the length of the entry.

### Summary

The authors also explore several features, some hand-crafted:

• word embeddings
• word shape features
• character-level features (extracted with a CNN)
• lexical features

All these features are concatenated, passed through a bi-LSTM and at each time step decoded by a linear layer and a log-softmax layer into log-probabilities for each tag. The model also learns a tag transition matrix, and at inference time the Viterbi algorithm selects the sequence that maximizes the score over all possible tag-sequences.

## Neural Architectures for Named Entity Recognition (2016)

### Architecture

This was, to the best of my knowledge, the first work on NER to completely drop hand-crafted features, i.e. they do not use any language specific resources or features beyond a small amount of supervised training data and unlabeled corpora.

The two proposed architectures are:

• bidirectional LSTMs + Conditional Random Fields (CRF)
• generating label segments using a transition-based approach inspired by shift-reduce parsers

I will focus on the first model, which follows a similar architecture as the other models presented in this post. I personally like this model because of its simplicity.

As in the previous models, two LSTMs are used to generate a word representation by concatenating its left and right context. These are two distinct LSTMs with different parameters. The tagging decisions are modeled jointly using a CRF layer (Lafferty et al., 2001).

### Embeddings

The authors generate word embeddings from both the characters of the word and from the contexts where the word occurs.

The rationale behind this idea is that many languages have orthographic or morphological evidence for a word or sequence of words being a named-entity; in German all proper nouns are capitalized, for instance. The character-level embeddings aim to capture this information. Furthermore, named entities appear in fairly regular contexts in large corpora. They therefore use large corpus to learn word embeddings that are sensitive to word order.

#### Character Embeddings

A character lookup table containing every character is initialized randomly. The character embeddings corresponding to every character in a word are given in direct and reverse order to a bidirectional-LSTM. The embedding for a word derived from its characters is the concatenation of its forward and backward representations from the bidirectional-LSTM. The hidden dimension of the forward and backward character LSTMs is 25 each.

#### Word Embeddings

The character-level representation is concatenated with a word-level representation from pre-trained word embeddings. The word embeddings are pre-trained using skip-n-gram (Ling et al., 2015), a variation of skip-gram that accounts for word order.

These embeddings are fine-tuned during training; the authors claim that using pre-trained compared to randomly initialized embeddings results in performance improvements. They also mention that they observe a significant performance improvement by applying a dropout mask to the final embedding layer just before the input to the bidirectional LSTM.

### Summary

This model is relatively simple, the authors use no hand-crafted features, just embeddings. The word embeddings are the concatenation of two vectors: a vector made of character embeddings using two LSTMs for each character in a word, and a vector corresponding to word embeddings trained on external data.

The embeddings for each word in a sentence are then passed through a forward and backward LSTM, and the output for each word is fed into a CRF layer.

## End-to-end Sequence Labelling via Bi-directional LSTM-CNNs-CRF (2016)

### Architecture

This system is very similar to the previous one. The authors use a Convolutional Neural Network (CNN) to encode character-level information of a word into its character-level representation. This is combined with a word-level representation and fed into a bidirectional-LSTM to capture contextual information for each word. Finally, the output vectors of the Bi-LSTM are fed to a CRF layer to jointly decode the best label sequence.

### Embeddings

#### Character Embeddings

The CNN is similar to the one in Chiu and Nichols (2015), the second system presented, except that they use only character embeddings as the inputs to CNN, without any character type features. A dropout layer is applied before character embeddings are input to CNN.

#### Word Embeddings

The word embeddings are the publicly available GloVe 100-dimensional embeddings trained on 6 billion words from Wikipedia and web text.

### Summary

This model follows basically the same architecture as the one presented before. The only architectural change is the fact that they use a CNN, instead of a LSTM, to generate word-level character embeddings.

# Comparative Summary

I would say the main lessons learned from reading these papers are:

• Use two LSTMs (forward and backward)
• CRF on the top/final layer to model tag transitions
• Final embeddings are a combinations of word- and character embeddings

In the following table I try to summarize the main characteristics of each of the models

Features Architecture Resume Structured Tagging Embeddings
(Huang et. al 2015) Yes bi-LSTM output vectors +
features vectors connected to CRF
CRF Collobert et al. 2011
pre-trained
50-dimensions
(Chiu and Nichols 2016) Yes word embeddings + features vector
input to a bi-LSTM the output
at each time step is decoded by a
linear layer and a log-softmax layer
into log-probabilities for each tag category
Sentence-level log-likelihood - Collobert et al. 2011
- char-level embeddings
extracted with a CNN
(Lample et. al 2016) No chars and word embeddings
input for the bi-LSTM
output vectors are fed to the CRF layer to jointly decode the best label sequence
CRF - char-level embeddings
extracted with a bi-LSTM
- pre-trained word embeddings
with skip-n-gram
(Ma and Hovy 2016) No chars and word embeddings
input for the bi-LSTM
output vectors are fed to the CRF layer to jointly decode the best label sequence
CRF - char embeddings extracted with a CNN
- word embeddings: GloVe 100-dimensions

# Extra: Why a Conditional Random Field at the top?

Deciding the label for word independently of the label of any other word makes sense if correlations between consequtive labels are weak, but independent classification decisions is a limitation on model complexity that is not always appropriate. Strong dependencies between output labels can carry important information for the prediction task. For sequence labeling or structured prediction tasks in general, it is beneficial to consider the correlations between labels and jointly decode the best chain of labels for a given input sentence. NER is one such task since interpretable sequences of tags have constraints. For instance, I-PER cannot follow B-LOC. Another example is in POS tagging, an adjective is more likely to be followed by a noun than a verb.

The idea of using a CRF at the top is to model tagging decisions jointly, that is the probability of a given label for a word depends on the features associated to that word (i.e. final word embedding) and the assigned tag the word(s) before. This means that the CRF layer could add constrains to the final predicted labels ensuring the tag sequences are valid. The constraints are learned by the CRF layer automatically based on the annotated samples during the training process.

### Emission score matrix

The output of the LSTM is given as input to the CRF layer, that is, a matrix $\textrm{P}$ with the scores of the LSTM of size $n \times k$, where $n$ is the number of words in the sentence and $k$ is the number of possible labels each word can have, and $\textrm{P}_{i,j}$ is the score of the $j^{th}$ tag of the $i^{th}$ word in the sentence. In the image below the matrix would be the concatenation of the yellow blocks coming out of each LSTM.

### Transition matrix

$\textrm{T}$ is a matrix of transition scores such that $\textrm{P}_{i,j}$ represents the score of a transition from the tag $i$ to tag $j$. Two extra tags are added, $y_{0}$ and $y_{n}$ are the start and end tags of a sentence, that we add to the set of possible tags, $\textrm{T}$ is therefore a square matrix of size $\textrm{k}+2$.

### Score of a prediction

For a given sequence of predictions for a sequence of words $x$:

we can compute its score based on the emission and transition matrices:

so the score of a sequence of predictions is, for each word, the sum of the transition from the current assigned tag $y_i$ to the next assigned tag $y_{i+1}$ plus the probability given by the LSTM to the tag assigned for the current word $i$.

### Training: parameter estimation

During training, we assign a probability to each tag but maximize the probability of the correct tag $y$ sequence among all the other possible tag sequences.

This is modeled by applying a softmax over all the possible taggings $y$:

where $Y(x)$ denotes the set of all possible label sequences for $x$, this denominator is also known as the partition function. So, finding the best sequence is the equivalent of finding the sequence that maximizes $\textrm{score(X,y)}$.

The loss can be defined as the negative log likelihood of the current tagging $y$:

so, in simplifying the function above, a first step is to get rid of the fraction using log equivalences, and then get rid of the $\textrm{log}\ e$ in the first term since they cancel each other out:

then the second term can be simplified by applying the log-space addition logadd, equivalence, i.e.: $\oplus(a, b, c, d) = log(e^a+e^b+e^c+e^d)$:

then, replacing the $\textrm{score}$ by its definition:

The first term is the score for the true data. Computing the second term might be computationally expensive since it requires summing over the $k^{n}$ different sequences in $Y(x)$, i.e. the set of all possible label sequences for $x$. This computation can be solved using a variant of the Viterbi algorithm, the forward algorithm.

The gradients are then computed using back-propagation since the CRF is inside the neural-network. Note that the transition scores in the matrix are randomly initialized, but they can also be initialized based on some criteria to speed up training. The parameters will be updated automatically during the training process.

### Inference: determining the most likely label sequence $y$ given $X$

Decoding is equivalent to searching for the single label sequence with the largest joint probability conditioned on the input sequence:

the parameters $\theta$ correspond to the transition and emission matrices, basically the task is finding the best $\hat{y}$ given the transition matrix $\textrm{T}$ and the matrix $\textrm{P}$ with scores for each tag for the individual word:

a linear-chain sequence CRF model, models only interactions between two successive labels, i.e bi-gram interactions, therefore one can find the sequence $y$ that maximizes the score function above by adopting the Viterbi algorithm (Rabiner, 1989).

### David Batista, PhD

Research Engineer in NLP

David loves to play with data and extract knowledge from large volumes of data, especially to transform natural language into structured data. He holds a Ph.D. degree in Natural Language Processing and Machine Learning from the University of Lisbon. His thesis explored and proposed new methods for semantic relationship extraction.

Updated