Scope and Challenges in Conversational AI using Transformer Models

Conversational AI is an interesting problem in the field of Natural Language Processing and combines natural language processing with machine learning. There has been quite a lot of advancements in this field with each new model architecture capable of processing more data, better optimisation and execution, handling more parameters and having higher accuracy and efficiency. This paper discusses various trends and advancements in the field of natural language processing and conversational AI like RNNs and RNN based architectures such as LSTMs, Sequence to Sequence models, and finally, the Transformer networks, the latest in NLP and conversational AI. The authors have given a comparison between the various models discussed in terms of efficiency/accuracy and also discussed the scope and challenges in Transformer models.

Once an input layer is determined, weights are assigned. These weights help determine the importance of any given variable, with larger ones contributing more significantly to the output compared to other inputs [3]. All inputs are then multiplied by their respective weights and then summed. Afterward, the output is passed through an activation function, which determines the output. If that output exceeds a given threshold, it "fires" (or activates) the node, passing data to the next layer in the network. This results in the output of one node becoming in the input of the next node.
The manner in which the nodes in a neural network are structured is intimately linked with the learning algorithm used to train the network. In general, there are a few different classes of network architectures: • Single-layer Feedforward Networks • Multi-layer Feedforward Networks

• Recurrent Neural Networks
Conversational AI models explicitely make use of recurrent neural networks (RNNs) or a modification of the network architecture. In traditional neural networks it is generally assumed that all inputs (and outputs) are independent of each other, but for many tasks that is a bad idea. In order to predict the next word in a sentence, it is essential to know which words came before it. [4] II. Recurrent Neural Networks (RNN) A recurrent neural network (RNN) is a type of artificial neural network which uses sequential data or time series data. They are incorporated into popular applications such as Siri, voice search, and Google Translate [5]. They are distinguished by their "memory" as they take information from prior inputs to influence the current input and output. RNNs are called recurrent because they perform the same task for every element in a sequence, with the output being dependent on the previous computations and inputs.
Another distinguishing characteristic of recurrent networks is that they share parameters across each layer of the network. While feedforward networks have different weights across each node, recurrent neural networks share the same weight parameter within each layer of the network.
Recurrent Neural Network (RNNs) [6] • Input: x(t) is taken as the input to the network at time step t. For example, x1, could be a vector corresponding to a word of a sentence. • Hidden state: h(t) represents a hidden state at time t and acts as "memory" of the network. h(t) is calculated based on the current input and the previous time step's hidden state: The function f is taken to be a non-linear transformation such as tanh, ReLU.
• Weights: The RNN has input to hidden connections parameterized by a weight matrix U, hidden-to-hidden recurrent connections parameterized by a weight matrix W, and hidden-to-output connections parameterized by a weight matrix V and all these weights (U,V,W) are shared across time.
• Output: o(t) illustrates the output of the network.
In the figure there are further arrows after o(t) which is also often subjected to nonlinearity,especially when the network contains further layers downstream [6].  [9].An LSTM has protect and control the cell state

Recurrent neural network utilises Backpropagation
Understanding LSTM Networks --colah's blog [9] The first step is to decide what information is not relevant to be considered in the next hidden layers and will be "thrown away" from the cell state. This Finally, the output of the cell state is generated which will be forwarded to the next layer. Unsurprisingly, a sigmoid layer is used to output values between 0 and 1, determining which components of cell state will be part of the next hidden state (based on relevance to the input at hand) [9].
This decoupling of cell state and hidden state is noteworthy, because it means that the network can remember features in cell state for longer periods of time without including them in the hidden state that affects the current prediction [8] IV. Sequence to Sequence Models (Seq2Seq) Despite their flexibility and power, deep neural networks can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality. It is a significant limitation since many important problems are best expressed with sequences whose lengths are not known a-priori [12]. For example, speech recognition and machine translation are sequential problems.
Likewise, question answering can also be seen as mapping a sequence of words representing the question to a sequence of words representing the answer. It is therefore clear that a domainindependent method that learns to map sequences to sequences would be useful.
Sequence to sequence models are a straightforward application of the Long Short Term Memory architecture. A novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence [13].
Chatbots with Seq2Seq (suriyadeepan.github.io) [14] The idea is to use one LSTM (called the encoder) to read the input sequence, one time step at a time, to obtain fixed dimensional vector representation, and then use another LSTM (called the decoder) to extract the output sequence from that vector. Each hidden state influences the next hidden state and the final hidden state is called the context or thought vector since it represents the intention of the sequence.
From the context vector, the decoder generates another sequence, one symbol at a time.
In order to solve the variable length problem, the concept of padding was introduced. Prior to training, the dataset is modified from variable length sequences Bucketing aims to solve this problem, by putting the sequences into buckets of different sizes [14].

V. Transformer Models
Most sequence generation models have an encoderdecoder structure (as in sequence to sequence model).
In the case of the Transformer model, the encoder maps an input sequence of symbol representations (x1, . . . , xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, . . . , ym) of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next [15].
The Encoder and Decoder models used in the Transformer do not use LSTM, GRU or RNNs hence there are no recurrent connections and thus no "memory" of previous states are implemented.
Transformers get around this lack of memory by perceiving entire sequences simultaneously.
Attention Is All You Need -Transformer Model [15] V.1 Input Embeddings Here is a detailed visualisation of the embedding process in Natural Language Processing models.
Consider an input from the user, Input -Do you like Game of Thrones

Word Embeddings and their representation
Hence, the main purpose of the embedding layer is to select the proper embedding of the input words and pass them on to the positional encoding module.

V.2 Positional Encoding
Position and order of words in a sentence is an essential part of any language. They define the grammar and thus the actual semantics of a sentence.
Recurrent Neural Networks (RNNs) inherently take the order of word into account i.e. they parse a sentence word by word in a sequential manner, but the Transformer architecture ditched the recurrence mechanism in favor of multi-head self-attention mechanism (discussed later). Avoiding the RNNs' method of recurrence will result in massive speed-up in the training time and theoretically, it can capture longer dependencies in a sentence [20].
As each word in a sentence simultaneously flows • pos is the position index of the word in the input sequence.
• dmodel is the embedding size (which is 512). Scaled Dot Product Attention [18] After feeding the query, key and value matrices through a linear layer, the queries and keys undergo a dot-product matrix multiplication operation to produce a score matrix. The score matrix determines how much focus should a particular word put on other words. So each word will have a score that corresponds to other words in the time-step. The higher the score the more focus. This is how the queries are mapped to the keys.
Example of a Score Matrix [20] Then these scores are scaled down by dividing each score by the square root of the dimension of the query and key (which is dk). The scaling is done in order to achieve more stable and normalized gradients since multiplication can have exploding effects.
Softmax operation on the scaled scores to get attention weights [20] A softmax of the scaled scores is performed to get the attention weights, which outputs probability values between 0 and 1. By doing a softmax the higher scores get elevated, and lower scores are depressed. This allows the model to be more confident about which words to attend too. Multi-Head Attention Layer [18] A critical and apparent disadvantage of this fixedlength context vector design as used in sequence to sequence models has the incapability of the system to remember longer sequences. It often forgets the earlier parts of the sequence once it has processed the 380 entire sequence. The attention mechanism was born to resolve this problem. [18] The multi-headed attention output vector is added to the original positional input embedding. This is called a residual connection. The output of the residual connection goes through a layer normalization.

V.3.2 Feed Forward Network and Normalization
The normalized residual output gets projected

V.4 The Decoder Stack
The main purpose of the decoder is to generate text sequences. The decoder stack also comprises multiple decoder layers. Each decoder layer has similar sublayers as the encoder layers -it has two multi-head attention layers (one is masked, discussed later), a pointwise feed-forward layer, and layer normalization (add and normalization) layers after each sub-layer.
These sub-layers behave similarly to the layers in the encoder but each multi-headed attention layer has a different job. The decoder is autoregressive, it begins with a start token, and it takes in a list of previous outputs as inputs, as well as the encoder outputs that contain the attention information from the input [20].
The beginning of the decoder is pretty much the same as the encoder. The input goes through an embedding layer and positional encoding layer to get positional embeddings. The positional embeddings get fed into the first multi-head attention layer which computes the attention scores for the decoder's input only.

V.4.1 Masked Multi-head Attention Layer
This multi-headed attention layer operates slightly differently. Since the decoder is autoregressive and generates the sequence word by word, the model needs to prevent it from taking future words from the encoder into account [20]. This is true for all other words, where they can only attend to previous words. Therefore, the model needs to prevent computing attention scores for future word embeddings. This method is called masking and hence this attention layer is called Masked Multi-Head Attention Layer.
To prevent the decoder from taking future words into account, a look-ahead mask is applied to the scaled score matrix. The look-ahead mask is a matrix with the same size as the score matrix with values of zeros and negative infinities only. The mask is simply added to the scales score matrix, and a masked score matrix is obtained.
Using Look-Ahead Mask to generate Masked Score Matrix [20] The reason for the mask is because once the model The Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output allows for significantly more parallelization [15] and can reach a new state of the art in translation quality.

VI. Switch Transformers
Large scale training has been an effective path towards flexible and powerful neural language models.

VII. Result and Discussion
Transformer based self-supervised pre-trained models have transformed the concept of Transfer learning in Natural language processing (NLP) using Deep learning approach. Self-attention mechanism made transformers more popular in transfer learning across a broad range of NLP tasks [21].
An experiment as carried out and published in the paper by Xiaoyu Yin, Dagmar Gromann and Sebastian Rudolph [23], which used different datasets to train RNNs based models, CNN based models and Transformer models and compare their performance, accuracy and BLEU scores. BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Although the models were not trained on large sequences of data, it is still an interesting experiment.  BLEU scores on the English-to-German and Englishto-French newstest2014 tests [15] On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big)) outperforms the best previously reported models Transformer XL, and many more state-of-the-art models. These models further improve upon the underlying transformer architecture in many aspects providing higher accuracy and better performance.

VIII. Conclusion and Future Scope
The recent developments in language modeling offer a lot of improvements in the field of Natural