Overview of Generative and Retrieval-Based Dialogue Systems

We distinguish two categories of dialogue systems: generative and retrieval-based systems. In this post I summarize the differences between these two categories and I provide latest research on building each of these categories.

Generative dialogue systems, as their name indicates, they generate utterances word by word to a given conversation history (that we call context). Most of these systems are based on the sequence-to-sequence architecture in which the context is encoded into a summary vector. Then, this vector is decoded into a response or the next utterance. They suffer from generalization since they tend to generate “Thank you”, “Ok”, “Nice”, etc. most of the time. Their strength is related to their capacity to generate responses even if they do not figure in the corpus.

Unlike generative systems, retrieval-based dialogue systems, do not generate responses. For each context, they find the best response among a set of predefined responses (called candidate responses). For each candidate response, a matching score with the context is computed and then used in order to rank the candidate responses. The response with the highest score is then chosen. Even if these systems are able to provide syntactically correct and specific responses, they are limited to responses available in the candidate set.

The challenges of building dialogue systems are their evaluation and the large amount of data necessary to build data-driven dialogue systems. Evaluation of dialogue systems is still an open research domain in which there are no standard metrics. Information Retrieval metrics were widely used in order to evaluate retrieval-based dialogue systems such as Recall@k, Mean Recall Rank (MRR), Precision@k and Mean Average Precision (MAP). Metrics adopted from machine translation and automatic summary were used in evaluating generative dialogue systems. These metrics include Rouge, Blue, Meteor in addition to Precision and Recall.

In the following points, I cite recent papers describing systems of each category:

  1. Generative systems:
    1. End-to-End Offline Goal-Oriented Dialog Policy Learning via Policy Gradient: 
      Workshop on Conversational AI, NIPS 2017
    2. End-to-End Optimization of Task-Oriented Dialogue Model with Deep Reinforcement Learning: Workshop on Conversational AI, NIPS 2017
    3. Generative Encoder-Decoder Models for Task-Oriented Spoken Dialog Systems with Chatting Capability: SIGDIAL 2017
    4. End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning
  2. Retrieval systems:
    1. Enhance word representation for out-of-vocabulary on Ubuntu dialogue corpus
    2. Training End-to-End Dialogue Systems with the Ubuntu Dialogue Corpus: Dialogue and Discourse 2017
    3. Incorporating loose-structured knowledge into conversation modeling via recall-gate LSTM: IJCNN 2017
    4. Response selection with topic clues for retrieval-based chatbots

Some datasets used in building these dialogue systems include:

  1. The Ubuntu Dialogue Corpus Version 2
  2. Douban Coversations Corpus: Available on request
  3. Multi-Turn, Multi-Domain, Task-Oriented Dialogue Dataset
  4. Dialog State Tracking Challenge 2 (DSTC2)
  5. Frames: A Corpus for Adding Memory to Goal-Oriented Dialogue Systems

Implementation of dual encoder using Keras

I decided to implement the dual encoder using Keras and to give further detail about my code here. One thing that motivated me to write this code is that the available implementations are in Tensorflow or Theano and I found that both are hard to understand (not intuitive). So let’s start !

The dual encoder architecture was used in this paper, where the task was defined as follows. Given a history (called context) of conversation between two users, we want to predict the response (called next utterance) by ranking a set of candidate responses. This is a ranking task, which is different from generating a response word by word. In the paper above, researchers presented the Ubuntu Dialogue Corpus, which is a large dataset of two-user conversations issued from IRC (ubuntu channel). Please refer to the paper for further informations.

Dataset description

The Ubuntu Dialogue Corpus is a large dataset of about 1 million of multi-turn dialogues (dialogues with at least 3 turns between two users) containing 7 million of utterances and 100 million words. The corpus can be downloaded from here with different possible preprocessing including lemmatization, tokenization .. etc. The dataset is separated into training, dev and test subsets with 1000000, 195600 and 189200 samples respectively. Each sample is composed of a context, a response (utterance) and a label.

The training set contains contains 50% of positive and 50% of negative samples. Positive ones are the context with the good next utterance. The negatives ones are obtained by selecting randomly a response from the corpus which does not match the good response of the context. Here are some training examples from the dataset extracted from this interesting blog post.


The characters __eot__ and __eou__ denote the end of turn and end of utterance respectively. Each sample of the dev and test datasets contains a context with 10 candidate responses: 1 ground truth response and 9 distractors randomly taken from the corpus as shown in the picture below.


Model architecture and implementation

The model presented by the authors of the paper is based on a dual encoder architecture. The context and the response have variable lengths, we use an encoder to have a fixed size vector that represents the context and the response. The context encoder has as input the words of the context represented with embedding vectors. In the same way, the response encoder receives as input the word embeddings of the response as mentioned below:

Screenshot from 2017-10-18 17-27-47

Each time a word embedding is fed into the context or the response encoders, they learn a vector representation of the entire text by updating each time their hidden layer. At the end of the encoding process, vectors c and r in the illustration above represent the context and the response respectively as a fixed size vectors. Let’s see the implementation of the first part (data loading and encoders architectures) using Keras.

First we will load the word embeddings from Glove, we used this method to reproduce the state of the art results, but you can use word2vec or your customized embedding vectors. Note that I tried to train word2vec using Gensim on the training set but this does not improve scores.

print("Start building model...")

# first, build index mapping words in the embeddings set
# to their embedding vector

print('Indexing word vectors.')

embeddings_index = {}
f = open(args.embedding_file, 'r')
for line in f:
   values = line.split()
   word = values[0]
       coefs = np.asarray(values[1:], dtype='float32')
   except ValueError:
   embeddings_index[word] = coefs
MAX_SEQUENCE_LENGTH, MAX_NB_WORDS, word_index = pickle.load(open(args.input_dir + 'params.pkl', 'rb'))
print("MAX_NB_WORDS: {}".format(MAX_NB_WORDS))
print("Now loading embedding matrix...")
num_words = min(MAX_NB_WORDS, len(word_index)) + 1
embedding_matrix = np.zeros((num_words , args.emb_dim))
for word, i in word_index.items():
   if i > = MAX_NB_WORDS:
   embedding_vector = embeddings_index.get(word)
   if embedding_vector is not None:
       # words not found in embedding index will be all-zeros.
       embedding_matrix[i] = embedding_vector

We followed the state of the art implementation and we chose embedding dimension = 300. We used Glove pretrained word vectors to initialize the embedding_matrix. This variable regroups all vocab words embedding vectors into one single variable. Now let’s do the most important part of the code, which is building the model.

print("Now building dual encoder lstm model...")
# define lstm for sentence1
encoder = Sequential()

You can see how much it is easy to implement an encoder using Keras 😉 We define a sequential model and we add a first layer which is Embedding layer that is initialized with the word embedding matrix loaded previously. We set trainable to true which means that the word vectors are fine-tuned during training. At the end we add an LSTM layer which will encode the hole input (context or response) into one vector of size args.hidden_size. Now we have our encoder that we will duplicate to have separate encoder for context and response.

context_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
response_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')

context_branch = encoder(context_input)
response_branch = encoder(response_input)

concatenated = merge([context_branch, response_branch], mode='mul')
out = Dense((1), activation = "sigmoid") (concatenated)

dual_encoder = Model([context_input, response_input], out)

Well here we defined simply the context and the response inputs: context_input and response_input. Then, we encode separately the context and the response into context_branch and response_branch. We merge these two vectors using multiplication which will compute a similarity vector between the context and the response. Finally we add a Dense layer of size 1 with a sigmoid activation to transform the vector into a similarity probability. We built the mode in only few lines thanks to Keras. Now we will simply load the data and feed it into our dual encoder model.

print("Now loading UDC data...")

train_c, train_r, train_l = pickle.load(open(args.input_dir + 'train.pkl', 'rb'))
test_c, test_r, test_l = pickle.load(open(args.input_dir + 'test.pkl', 'rb'))
dev_c, dev_r, dev_l = pickle.load(open(args.input_dir + 'dev.pkl', 'rb'))

print('Found %s training samples.' % len(train_c))
print('Found %s dev samples.' % len(dev_c))
print('Found %s test samples.' % len(test_c))

print("Now training the dual_encoder...")

histories = my_callbacks.Histories()

bestAcc = 0.0
patience = 0

print("\tbatch_size={}, nb_epoch={}".format(args.batch_size, args.n_epochs))

for ep in range(1, args.n_epochs):
     dual_encoder.fit([train_c, train_r], train_l,
                      batch_size=args.batch_size, epochs=1, callbacks=[histories],
                      validation_data=([dev_c, dev_r], dev_l), verbose=1)

     curAcc = histories.accs[0]
     if curAcc >= bestAcc:
           bestAcc = curAcc
           patience = 0
           patience = patience + 1
     y_pred = dual_encoder.predict([test_c, test_r])
     print("Perform on test set after Epoch: " + str(ep) + "...!")
     recall_k = compute_recall_ks(y_pred[:,0])

     #stop the dual_encoder whch patience = 10
     if patience > 10:
           print("Early stopping at epoch: "+ str(ep))
     # saving the model
     if args.save_dual_encoder:
           print("Now saving the dual_encoder... at {}".format(args.dual_encoder_fname))

I have already preprocessed data and saved it in .pkl format. The train, test and dev files can be downloaded from here. In case you want to do the preprocessing step by yourself, I also shared the preprocessing script on my github repository so you can download raw data from the corpus repository and do whatever you want using the script. Once data loaded on RAM, we will simply feed it into the network at each epoch. We follow the baseline and we use Recall@k as an evaluation metric. I implemented early stopping based on the evaluation metric. If the recall@1 does not increase during 10 successive epochs we stop training and we save the best model.

That’s all you need to implement a dual encoder for the Next Utterance Ranking Task using Keras. I hope that you enjoyed reading this post and it will be useful for you. At the end please feel free to fork, send pull request, comment … etc.


Installing Brown Coherence toolkit on Ubuntu 64 bit

I was looking for a nice tutorial on how to install brown coherence tool on Ubuntu 64 bit and unfortunately there were only one blog but their solution didn’t work for me. This tool permits to reproduce the results of three papers “Disentangling Chat with Local Coherence Models” and “Extending the Entity Grid with Entity-Specific Features“, and “Coreference-inspired Coherence Modeling“. In these papers Prof. Micha Melsner worked on entity grid extension and application for different tasks. This is the official website of the software and you can find here a full description, installation steps and some examples.  The installation steps are not detailed and if you follow them you will get bugs for sure. Here is the steps that you can follow to successfully install the software on your Ubuntu machine.

  1. Download WordNet-3.0 from here and follow the instruction in the install file*.
  2. Download GSL from this link, unpack it and follow the instructions in the install file*.
  3. You need to install tcl/tk, download them from this link and this one too. Unpack them and follow installation instructions*.
  4. Install libboost using: sudo apt-get install libboost-iostreams-dev
  5. Once you prepared the installation environment, download the brown coherence software from the official website. To do so, you can run hg clone https://bitbucket.org/melsner/browncoherence or download the zipped file from here.
  6. Navigate into the tool’s directory and edit the Makefile to specify the path to WordNet as below:
  7. ifeq (${ARCH}, 64)
    WNDIR = /usr/local/WordNet-3.0
    WNINCLUDE = -idirafter$(WNDIR)/include -DUSE_WORDNET
    WNLIBS = -L$(WNDIR)/lib -lwordnet # -lWN
    WNDIR = /usr/local/WordNet-3.0
    WNINCLUDE = -idirafter$(WNDIR)/include -DUSE_WORDNET
    WNLIBS = -L$(WNDIR)/lib -lWN
  8. Add the flag –fpermissive to CFLAGS as below
  10. Edit the scr/stubs.c file and add #define USE_INTERP_RESULT 1 at the beginning of the doc (before includes)
  11. Create two directories bin64 and lib64 in the same level as the Makefile.
  12. At this point, you have prepared everything for installation and it should work.
  13. Run make everything to install the toolkit.

To test if everything works as expected you should be able to run bin32/TestGrid data/smokedoc.txt without any errors. It will print the entity grid matrix of the example.

* The installation instructions are the same for all those packages:

  1. unpack the folder.
  2. run ./configure
  3. run make
  4. run sudo make install

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

The aim of this paper is to study the relevance of evaluation metrics used for conversation generative models. Researchers adopted automatic evaluation metrics from machine translation tasks using METEOR and BLEU and also ROUGE which is generally used in text automatic summarization. In this work, the authors demonstrate that these metrics are not very adapted to the generation task through a comparison with human evaluation and show that they don’t correlate. These metrics assume that the generated responses or (translation) share words with the ground truth responses which is not always valid.

Capture du 2017-03-20 16-53-06.png
This example shows a valid generated response that share no word with the ground-truth response for which BLEU score = 0

One reason for why the machine translation metrics are not useful is that the task of generating dialogue responses is more difficult than translating the given text into another language. This is due to the large set of correct answers and the higher diversity. We distinguish two major categories of evaluation metrics:

  1. Word overlap-based metrics: metrics under this category evaluate the amount of word-overlap between the proposed and the ground-truth responses. Let p be the proposed response and g the ground-truth response. BLEU analyzes the co-occurances of n-grams in p and g. METEOR addresses weaknesses observed in BLEU, it is the harmonic mean of precision and recall between p and g. ROUGE, as we mentioned before, has been initially been used for automatic summarization. ROUGE-L is a F-measure based on the number of words that occur in p and g in the same order known for the Longest Common Subsequence (LCS).
  2. Embedding-based metrics: the idea behind the metrics of this category is to consider the meaning of the words rather than the words that match between both generated and ground-truth response. Each word has a word embedding computed using word2vec which represents its meaning based on its co-occurance with the other words of the corpus. Two approaches exist: greedy matching consists of averaging the score of similarity of each word in p and each word in g. In the other hand, embedding average consists of computing the average of word embeddings of all the words in p and g.

Another point discussed in this paper is dialogue response generative models. The are two kinds of dialogue systems: generative and retrievals models.

Retrieval models

  1. TFIDF: compute the TFIDF of the context and each utterance and return the utterance with the highest score.
  2. Dual Encoder: two RNNs are used to encode the context and the utterance into a fixed size vector and then the dot product of these two vectors is computed. The highest is this product the best is the utterance. The model is trained to minimize the cross-entropy error of all (context, response).

Generative models

  1. Long Short-Term Memory (LSTM) language model: and LSTM model is used to encode the context and another one is used to decode the context into the next utterance. The generated utterance is obtained using beam search.
  2. Hierarchical Recurrent Encoder Decoder (HRED): This model handles the long term dependencies between the utterances in the context. Each utterance in the context is encoded by an utterance-level encoder and then the output is encoded by a context-level encoder.

Results of the study

  • Metrics that have been used in the literature don’t correlate with human judgment.
  • Best performing metrics:
    • word-overlaps – BLEU-2 score
    • word embeddings – vector average
  • BLEU correlates better with human judgment when used to evaluate tasks that generate less diverse text such as machine translation and mapping from dialogue acts to natural language sentences.
  • Research interest should be oriented to find new metrics of evaluation to resolve the raised problems.

Discourse segmentation of multi-party conversation

This paper presents a domain-independent topic segmentation algorithm for multi-party speech. The aim of topic segmentation is to automatically divide text, speech, video into lexically related segments. The authors worked on ICSI meeting corpus freely available for download from here and you can find here is the description paper of the corpus. This meeting corpus contains 75 multi-party meeting recordings ranged over three different recurring meeting types with an average of duration of 60 min. Note that all the meeting recordings were transcribed and each conversation turn was marked with four informations: speaker, start time, end time, and word content.

The aim of this work is to determine boundaries of a given conversation and thus performing topic segmentation. This is mainly done by using an Algorithm called LCseg which permits to segment a discussion based on Lexical Cohesion. The steps of the algorithm are as following:

  1. The recording document is preprocessed: tokenized, removing non content words, stemming.
  2. Identify and weight strong term repetitions using lexical chains.
  3. Hypothesize topic boundaries based on the knowledge of multiple, simultaneous chains of term repetitions extracted in step 2.

They evaluated this algorithm against the state of the art segmentation algorithms based on lexical cohesion too and showed that LCseg outperforms them. LCseg does not only detect boundaries in a discussion but also computes a segmentation probability at each potential boundary. This information based on the content of the discussion through the lexical cohesion has been used as in the feature based segmenter that we will detail below.

In addition to this content based feature, other features has been extracted which concern the form of the discussion including: cue phrases, silences, overlaps and speaker change. The classifier was based on two programs where the first one generates an unpruned decision trees which is analyzed then by the second program to generate a set of pruned production rules. Evaluation showed that the feature based segmenter outperforms either the LCseg and state of the art segmenters. Statistics show that lexical coherence plays a stronger role in determining topic boundaries compared to other features.

What I learned from the accelerated deep learning with Tensorflow workshop

Hello everyone !

Recently I attended a 2-day accelerated wrokshop on deep learning with Tensorflow in London on 24-25 January 2017. Actually it was a very nice experience for me and I learned too many things that I will share with you here in this article.

Tensoflow is an open source library published by Google in 2015 to implement deep learning models in python. Computations in TensorFlow are expressed as stateful dataflow graphs. It is written in Python and C++ and actually the GitHub repository of Tensorflow has been forked more than 21.000 times. The lectures were given by Dr. Ole Winther, a Professor in Data Science and Complexity at Cognitive Systems at the Technical University of Denmark (DTU). Lab sessions were given by Toke Faurby, M.Sc. student at the same university. I encourage you to visit the website of the workshop and to register for the next edition.

Deep learning is being the new fashion since 2015 thanks to the increase in computing capacities (GPU)  and has been applied for multiple domains including games, self-driving cars, speech recognition, image captioning and many others applications. Before the introduction of deep learning, modular systems were used to learn specific representations using hand crafted features.

A survey of available corpora for building data-driven dialogue systems

Here some points that I found important in this paper:

  • The actual progress in the domain of NLP is mainly due to three factors:
  1. The availability of public datasets.
  2. The remarkable progress in the field of deep learning.
  3. Computing power.
  • There are two different types of conversational models:
  1. Generative models which are smarter and harder to train:
    • Mainly based on Machine translation techniques.
    • They require more data for training.
    • Actually they are making grammatical mistakes.
    • They use to choose evident answers like I don’t know.
    • The tendency is using retrieval based models to generate responses.
    • The generated response is usually limited in length.
  2. Retrieval based models are easier but are not that smart:
    • Choose and answer among a set of probable answers.
    • Based on simple rule-based expression match or more sophisticated techniques using machine learning.
    • They don’t make grammatical mistakes.
    • They may fail to find answers on unseen data.
    • Can’t refer to contextual information.
  • Goal driven dialogue systems: technical support services.
  • Non-goal driven systems: language learning tools and computer game characters.
  • Spoken language: informal, more pronouns, lower information contents compared to written language.
  • An idea to reduce misspelling in corpus is to look for wikipedia most misspelled words and replace them with the best (correct) ones.
  • A high BLEU or METEOR scores mean that the generative system is able to generate human-like utterances but the converse is not true.

Finding problem solving threads in online forums

The authors suggest in this paper a two-step approach to classify threads to solved or not. The approach is based on .

The first step is post level classification. It consists of determining the categories of posts based on the content of the post and its context using a Naive Bayes classifier with a bag of words model. The post categories are 4: problem, solution, good feedback and bad feedback.

  1. Each post is represented by a binary feature vector using a dictionary of top 600 words.
  2. Next, use a multinomial Naive Bayes classifier.

The second step is thread level classification. It consists of

Gensim warning

Hi guys, I was running this example on Ubuntu which permits to compute vector embeddings of sentences (documents) using doc2vec model of gensim. sadly I got the warning below :

# -*- coding: utf-8 -*-

import sys

# Import libraries

from gensim.models import doc2vec
from collections import namedtuple

# Load data

doc1 = ["This is a sentence", "This is another sentence"]

# Transform data (you can add more data preprocessing steps)

docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
words = text.lower().split()
tags = [i]
docs.append(analyzedDocument(words, tags))

# Train model (set min_count = 1, if you want the model to work with the provided example data set)

model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)

# Get the vectors


UserWarning: Pattern library is not installed, lemmatization won’t be available.

Unfortunately I didn’t find a solution to this problem on the net, so I entered to the source code of Gensim in this file /usr/local/lib/python2.7/dist-packages/gensim/utils.py and found that the error is raised due to this part of the code:

if not has_pattern():
raise ImportError("Pattern library is not installed. Pattern library is needed in order to use lemmatize function")
from pattern.en import parse

The simple solution I did was to install the pattern library using :

sudo pip install pattern

I hope that this will help you guys in case you get the same warning 😉

The Ubuntu Dialogue Corpus: A large dataset for research in unstructured Multi-Turn Dialogue systems

In this paper, authors present a large corpus of dialogue extracted from Ubuntu Chat logs from Ubuntu IRC. This corpus contains 930.000 dialogues and has been sampled to provide a training set with 1.000.000 entries equitably separated between positive and negative (50% each). The test set contains 18920 entries where each line contains context and 10 replies where the first one is the ground truth one. There has been two versions of this corpus, the last one has been published in order to solve some raised problems in the first version as well as separation the training, test and validation sets going on non-overlapping periods and also differentiating between the end of utterance and end of turn in each conversation … etc (mode details are given here).

Key points

  • The aim of this paper is to present the Ubuntu Corpus and also to perform Next Utterance Classification over it.
  • For this aim, authors experiment TFIDF, RNN and LSTM over the dataset with a preprocessing technique.
  • As the task consists of ranking the 10 answers according to their suitability to the given context, Recall @ k has been used in order to evaluate the accuracy of the solution. They varied k between 2, 5 and 10 in order to capture the capacity  of the system to find the good reply among top candidate answers.
  • Results show that with LSTM, it is able to find the good answer on the top 5 in 92.6% of cases which outperforms RNN and TFIDF. This is mainly justified by the fact that LSTM are capable to capture the short term dependencies that we encounter when having long contexts.


After analyzing the given corpus (download-able from here) I think that their approach of constructing the dataset is a bit biased. In fact, the task would be much more difficult if the random replies generated for the test set were generated from the same chat room as the context in order to have a more reliable results. They don’t give details about the preprocessing they did in order to obtain these results and also I rally have a doubt in their results since sometimes it’s too hard for human to guess the good answer between “yes”, “good”, “thanks” since all of them are suitable for a given context. But having 92 % of good results is a bit surprising, the only way to prove this results is to implement the same models and run it with myself.