Foreword: This article was written as Oursky Skylab.ai Team recently completed an article generation project for a startups client, and we want to share the technique we used in the project. The demo is available here .
In this article, I will try to:
- Explain what a language model is.
- Discuss how to use language modeling to generate article.
- Explain what Generative Pre-Trained Transformer 2 (GPT-2) is and how it can be used for language modeling.
- Visualize GPT-2 model’s internal state and which input words affect the next word prediction the most.
Language model is a probability distribution of a sequence of words. For example, given a language model of English, we can ask what the probability is of seeing “All roads lead to Rome” in English.
We could also expect that the probability of seeing grammatically wrong or nonsensical sentences, such as “jump hamburger I,” will be much lower than grammatically correct and more meaningful sentences, such as “I eat hamburger.”
Let’s pull in some mathematical notation to help describe a language model.
P(w1, w2, …, wn) means the probability of having the sentence “w1 w2 … wn”.
Notice that language model is a probability distribution instead of just a probability. Having a probability distribution means we could tell what is the value of P(All, roads, lead, to, Rome) and P(I, eat, hamburger), as we know P(w1, w2, …, wn) for any wi=1…n, for any n.
Some clarification on the notation: Whenever you see P(hello, world), where things inside P() are actual words, this means that we know that wi=1…n and n are, i.e., that P() is describing a probability. However, when you see P(w1, w2, …, wn) where things inside inside P() are unknown, P() is describing a probability distribution. Most of the time I use the terms “probability” and “probability distribution” interchangeably unless further specified.
Sometimes, it is handier if we express P(w1, w2, …, wn) as P(w, context).
What this means is we lump w1 to wn-1 , i.e., all words of a sentence except the last one, to a bulky stuff that we call “context”. We then ask what the chance is of being in this “context” (seeing previous n-1 words) and seeing the word “w” at the end. As you can see, those two expressions are describing the same thing.
Using chain rule, we could write P(w, context) as P(w | context) P(context). The reason why we do that is because P(w | context) is the thing we usually want to know.
P(w | context) is a conditional probability distribution. It tells the chance of seeing a word w given that we know what content is, i.e. knowing what the previous words are.
As P(w | context) is a probability distribution and could ask P(apple | context), P(orange | context) or any other words in the English dictionary, this means we could use P(w | context) to somehow predict what the next word is if the sentence goes on. We could also use this probability distribution to sample the next word, which is the basis of article generation.
So language model is good thing to have, but how can we obtain a language model? This question is answered in another article. One approach is counting the number of wn that comes after w1 to wn-1 on a large text corpus, which will build the n-gram language model.
Another approach is to directly learn the language model using a neural network by feeding lots of text. In our case, we are using the GPT-2 model to learn the language model.
Article Generation Using Language Model
As mentioned in last section, P(w | context) is the basis for article generation.
P(w | context) tells us the probability distribution of all English words given all seen words (as context). For example, for P(w | “I eat”), we would expect a higher probability when w is a noun rather than a verb. We would also expect that the probability of w being a noun of food, like “bread,” will have a higher chance than being “book”.
As we have P(w | context), we could use it to predict or generate the next word given all previous words. We could keep adding one word at a time until we have a enough for a sentence or have reached some “ending word” like a full stop.
There are different approaches on how to pick the next word, and we will discuss some of them below.
One approach for picking the next word is picking the word with the highest probability. Let’s take P(w | “I eat”) with w being “hamburger” having the highest probability among all words in dictionary as example. We will pick “hamburger” as the next word, and now we have “I eat hamburger”. We call this the greedy approach for sentence generation.
This approach is very simple and quick. The main drawback is that for the same set of previous words, we will always generate the same sentence.
Additionally, when we always pick the highest probability, it is very easy to fall in the case of degenerate repetition, i.e., we keep getting the same chunk of text during sentence generation. For example:
I eat hamburger for breakfast. I eat hamburger for breakfast. I eat hamburger for breakfast ...
In the greedy approach, we pick the word with the highest probability word each time we picking the next word. However, we could also generate lots of sentences first, and pick the sentence with highest probability.
Let’s assume that there are 20,000 words in the dictionary, and we want to generate a sentence with 5 words starting with word “I”. The number of all possible sentences that we could generate will be 200004, or one hundred and sixty quadrillion. Clearly, we cannot check all those sentences’ probability within a reasonable time, even with a fast computer.
Instead of constructing all possible sentences, we could instead just track top-N partial sentences. At the end, we just need to check the probability on N sentences. By doing so, we hope to search the top-N likeliest sentence without trying all combinations. This kind of searching is called beam search, and N is the beam width.
The figure below illustrates a case of generating a sentence with three words, starting with “I” with N = 2. This means we only track top-2 partial sentences.
In this case, we first check P(w | “I”). Among all the possible words, the language model tells “eat” and “read” are the most probable words that would come next. Hence, in the next step, we will only consider P(w | “I eat”) and P(w | “I read”) and ignore other possibilities like sentences that start with “I drink”.
In the next step, we repeat the same procedure and find two words that most probably come after “I eat” or “I read”. Among all those sentences that start with “I eat” and “I read”, P(“hamburger” | “I eat”) and P(“cake” | “I eat”) have highest two probabilities. We will thus only expand the search with sentence prefixes “I eat hamburger” and “I eat cake”. You can see, the whole “I read” branch died out.
We will keep repeating the expansion and pick best-N procedure until we have a sentence with desired length. Finally, we need to report the sentence with the highest probability.
You may already notice that when the beam width is reduced to 1, beam search will become greedy approach. When the beam width is equal to the size of the dictionary, beam search becomes exhaustive search. Beam search gives us a way to choose between sentence quality and speed.
With beam width larger than 1, beam search tends to generate more promising sentences. However greedy approach’s issues remain. The same sentence prefix will lead to the same sentence. It can easily to result in degenerate repetition when the beam width is not large enough.
The issues with beam search and greedy approach are due to the fact that we are picking the most probable choice during the sentence generation. Instead of picking the most probable word from P(w | context), we could sample a word with P(w | context).
For example, with a sentence that start with “I”, we can sample a word according to P(w | “I”), as the sampling is random. Even if P(“eat” | “I”) > P(“read” | “I”), we could still sample the word “read”. Using sampling, we will have very high chance of getting a new sentence in each generation.
Sentence generated from pure sampling will be free from degenerate repetition, but it tends to result in gibberish.
Top-k Sampling and Sampling with Temperature
There are three common ways to try to improve pure sampling. The first is Top-k sampling. Instead of sampling from full P(w | context), we only sample from top K words according to P(w | context).
The second is sampling with temperature. It means we reshape the P(w | context) with a temperature factor t, where t is between 0 and 1.
This is when we are using neural network to estimate a language model. Instead of probability values (which are in the range of 0 to 1), we are getting real number that could be in any range, which is called logits. We can convert logits to probability value using the softmax function.
Temperature t is a part in the step of applying the softmax function to get the probabilities. This is in order to reshape the resultant P(w | context) by dividing each logit value by t before applying the softmax function. As t is between 0 and 1, dividing it will amplify the logit value. This results in making more probable word more probable, and less probable word even less probable.
Top-k sampling and sampling with temperature tend to be applied together.
When using Top-k sampling, we need to decide which k to use. The best k varies depending on different contexts. The idea of Top-k sampling is to ignore words are that very unlikely to be the next word according to P(w | context). We can achieve this goal in another way. Instead of focusing on Top-k words in sampling, we ignore words whose sum of probabilities is less than a certain threshold, and we only sample from the remaining words.
This approach is called nucleus sampling. According to The Curious Case of Neural Text Degeneration, the original paper that proposed nucleus sampling, we should choose p = 0.95, which implies that the threshold value is 1-p = 0.05. Solely doing nucleus sampling with p = 0.95, we could generate text that are statistically most similar to human-written text.
I recommended that you take a look at the nucleus sampling paper. It gives lots of comparison among texts generated using different approaches (beam search, top-k sampling, nucleus sampling, etc) and human-written text measured by different metrics.
Introduction to GPT-2 Model
As we mentioned in the beginning, we will use the neural network called GPT-2 model from OpenAI to estimate the language model.
GPT-2 is a Transformer-based model trained for language modelling. It could be easily fine-tuned to use on other natural language processing (NLP) tasks such as text generation, summarization, question answering, translation, and sentiment analysis, among others.
Discussing the GPT-2 model deserves a separate article. Here, I will only focus on few main concepts. I highly recommend that you read two awesome articles from Jay Alammar on Transformer and GPT-2 for more in-depth information.
I will explain how GPT-2 model works by building it piece by piece.
Input and Output
First, let’s describe the input and output of the GPT-2 model.
Given words in its embedded form, GPT-2 could be considered a transformer that transforms the input word-embedding vector (blue ellipses) to the output word embedding (purple ellipses). This transformation would not change the dimension of the word embedding (although it could). Output word embedding is also known as the hidden state.
During the transformation, the previous words’ input embedding will affect the result of the current word’s output embedding, but not the other way round. In our example, the output embedding of “cake” will depend on the input embedding of “I”, “eat” and “cake”. On the other hand, the output embedding of “I” will only depend on the input embedding of “I”.
Due to this, the output embedding of the last input word somehow captures the essence of the whole input sentence.
To get the language model, we could have a matrix WLM whose number of column is equal to dimension of output embedding; has a number of row that is equal to the dictionary size; and has bias vector bLM with its dimension being the dictionary size.
We can then compute the logit of each word in the dictionary by multiplying WLM with the output embedding of the last word and adding bLM. To convert those logits to probabilities, we apply the softmax function, and its result could be interpreted as P(w | context).
Inside the GPT-2 Model
Until now, we’ve discussed how output word embeddings are computed from input word embeddings.
Input word embeddings are vectors, and the first steps of the transformation is to create even more vectors from those input word embeddings. To be precise, three vectors, namely the key vector, query vector and value vector, will be created based on each input word embedding.
Producing these vectors is simple. We just need three matrices Wkey, Wquery and Wvalue. By multiplying the input word embedding with these three matrices, we will get the corresponding key, query, and value vector of the corresponding input word. Wkey, Wquery and Wvalue are parts of the parameters of the GPT-2 model.
To further demonstrate, let’s consider Iinput the input word embedding of “I”. Here, we have:
Ikey = Wkey Iinput, Iquery =Wquery Iinput, Ivalue =Wvalue Iinput
Notice that we will use the same Wkey, Wquery and Wvalue to compute key, query, and value vectors for all other words.
After we know how to compute key, query, and value vectors for each input word, it is time to discuss how to use these vectors to compute the output word embedding.
As mentioned in the previous section, the current word’s output embedding will depend on the current word’s input embedding and all the previous words’ input embedding.
Indeed, the output embedding of a current word is the weighted sum of the current word and all its previous words’ value vectors. This also explains why value vectors are called as such.
Let’s take eatoutput as the output embedding of “eat”. Its value is computed by:
eatoutput = IAeat Ivalue +eatAeat eatvalue
Here, IAeat and eatAeat are the attention values. They could be interpreted as how much attention should “eat” pay on “I” and “eat” itself when computing its output embedding. To avoid blowing up and shrinking the output embedding, the sum of attention values need to be to 1.
This implies that for the first word, its output embedding will be equal to its value vector; for example, Ioutput is equal to Ivalue.
Each attention value xAyis computed by taking the dot product between the key vector of x and query vector of y; scaling down the dot product with the square root of the dimension of the key vector; and finally, taking the softmax to ensure the related attention values are summing up to 1, as shown below:
xAy = softmax(xkeyT yquery / sqrt(k)), where k is the dimension of key vector.
To recap, we should now know how output embedding is computed as the weighted sum of value vectors of the current and previous words. The weights used in the sum are called attention value, which is a value between two words and is computed by taking the dot product of key vector of one word and query vector of another word. As the weights should sum up to 1, we will also take the softmax on the dot product.
What we’ve discussed so far is just a layer called attention layer in GPT-2. This layer covers most of the details, as the rest of the GPT-2 model structure is just a replication of the attention layer.
Let’s continue our GPT-2 model construction journey. GPT-2 is not just using one attention layer but multiple ones. This is the so-called multi-head attention.
While those attention layers are run in parallel, they are not dependent on each other and don’t share weights, i.e., there will be different set of Wkey, Wquery and Wvalue for each attention layer.
As we have multiple attention layers, we will have multiple output word embeddings for each word. To combine all those output word embeddings into one, we first concatenate all the output word embeddings from different attention layers. We then multiply the concatenated matrix Wproject to make the output word embedding have the same dimension as the input word embedding.
Actually, the output word embeddings we got so far is not the final one. The output word embeddings will further go through a feedforward layer, and transform into actual output word embeddings.
These attention layers running in parallel together with the feedforward layer are grouped to a block called the decoder block1.
GPT-2 includes not just one decoder block, but a chain of it. We choose the input word embedding and output word embedding to have the same dimensionality so that we could chain decoder blocks.
These decoder blocks have exactly the same structure, but they don’t share weight.
GPT-2 model has different size. They are different in the embedding dimensionality, key, query, value vector’s dimensionality, number of attention layer in each decoder block, and number of decoder blocks in the model.
Some Omitted Details
The following are some details worth noting, as you can take this list as a pointer to learn more about them.
- GPT-2 is using Byte pair encoding when tokenizing the input string. One token does not necessarily corresponding to one word. GPT-2 works in terms of tokens instead of words.
- Positional embeddings are added to the input embeddings of the first decoder block so as to encode the word order information in the word embedding.
- All residual addition and normalization layers are omitted.
Training of GPT-2 Model
After knowing how GPT-2 works and how GPT-2 can be to estimate the language model (by converting last word’s output embedding to logits using WLM and bLM, then to probabilities), we can briefly discuss how to train the GPT-2 model.
The training of GPT-2 model is doing language model estimation. Given an input string, such as “I eat cake”, GPT-2 can estimate P(eat | “I”) and P(cake | “I eat”).
For this input string, in training, we will assume the following:
P(eat | “I”) = 1, P(w != eat | “I”) = 0
P(cake | “I eat”) = 1, P(w != cake | “I eat”) = 0
Now that we have estimated and targeted probability distributions, we can then compute the cross entropy loss, and use this loss to update the weights.
As you can see, to train the GPT-2 model, what we need to do is feed it with a large amount of text.
Trying Out GPT-2
To quickly try GPT-2 on article generation, we could use Huggingface🤗 Transformers. It is a Python library for developers to quickly use pre-trained transformers-based NLP models. Of course, GPT-2 is supported. It also supports both PyTorch and TensorFlow as the underlying deep learning framework.
To fine tune a pre-trained model, we could use the example/run_langauge_modeling.py. What we need are a text file containing the training text and another containing the text for evaluation.
Here’s an example of using run_language_modeling.py for fine-tuning a pre-trained model:
python run_language_modeling.py \ --output_dir=output \ # The trained model will be store at ./output --model_type=gpt2 \ # Tell huggingface transformers we want to train gpt-2 --model_name_or_path=gpt2 \ # This will use the pre-trained gpt2 samll model --do_train \ --train_data_file=$TRAIN_FILE \ --do_eval \ --eval_data_file=$TEST_FILE \ --per_gpu_train_batch_size=1 # For GPU training only, you may increase it if your GPU has more memory to hold more training data.
Huggingface🤗 Transformers has a lot of built-in functions, and text generation is one of them.
The following is a code snippet of doing text generation using a pre-trained GPT-2 model:
from transformers import ( GPT2LMHeadModel, GPT2Tokenizer, ) tokenizer = GPT2Tokenizer.from_pretrained("gpt2") model = GPT2LMHeadModel.from_pretrained("gpt2") sentence_prefix = "I eat" input_ids = tokenizer.encode( sentence_prefix, add_special_tokens=False, return_tensors="pt", add_space_before_punct_symbol=True ) output_ids = model.generate( input_ids=input_ids, do_sample=True, max_length=20, # desired output sentence length pad_token_id=model.config.eos_token_id, ).tolist() generated_text = tokenizer.decode( output_ids, clean_up_tokenization_spaces=True) print(generated_text)
Thanks to jessevig’s BertViz tool, we could try peeking at how GPT-2 works by visualizing the attention values.
The figure above is a visualization of attention values on each decoder block (from top to bottom of the grid, with the first row as the first block). Each attention head (from left to right) of the GPT-2 small model take “I disapprove of what you say, but” as input.
On the left is the zoomed-in look at the 2nd block’s 6th attention head’s result.
The words on the left are the output, and those on the right are the input. The opacity of the line indicate how much attention the output word paid to the input words.
One interesting fact we could see here is, most of the time, the first word is paid the most attention. This general pattern remains even if we use other input sentences.
Word Importance Visualization
Purely looking at the attention values doesn’t seem to give us clues on how input sentence affects how the GPT-2 model picks its next word. One of the reasons could be is that, as there’s still Wproject feedforward layer to transform the attention layer’s output, it is hard to imagine how the attention is utilized to predict the next word.
We are particularly interested in how the input sentence affects the probability distribution of the next word. More precisely, we want to know which word in the input sentence will affect the next word’s probability distribution the most.
Measure Word Importance Through Input Perturbation
In Towards a Deep and Unified Understanding of Deep Neural Models in NLP, the authors propose a way to answer this question. They also provide the code which we could use to analyze the GPT-2 model with as well.
The paper discussed measuring the importance of input word. The idea is to assign a value σi to each input word, where σi is initially a random value between 0 and 1.
Later on we will generate some noise vector with the size of input word embedding. This noise vector will be added to the input word embedding with the weight specified in σi. So σi tells how much noise is added to the corresponding input word.
With the original and perturbed input word embeddings, we feed both of them to our GPT-2 model and get two sets of logit from the last output embeddings.
We then measure the difference (using L2 norm) between these two logits. This difference tells us how severe the perturbation is affecting the resultant logits that we use to construct the language model. We then optimise σi on minimizing the difference between two logits.
We keep repeating the generation of new noise vector and add them to the original input word embedding using the updated σi. We then compute the difference between the resultant logits, and use this difference to guide the update of the σi.
During the iteration, we will track the best σi that leads to the smallest difference in the resultant logits, and report it as the result after we reach the maximum number of iteration.
The reported σi tells us how much noise the corresponding input word could withstand in a way that will not lead to significant change in the resultant logits.
If a word is important to the resultant logits, we would expect that the small perturbation on that word’s input embedding will lead to significant change in the logits. Hence, the reported σi is inversely proportional to the importance of the words. The smaller the the reported σi, the more important the corresponding input word is.
The following is the code snippet for visualizing the word importance. Interpreter.py could be found here.
import torch from transformers import GPT2Tokenizer, GPT2LMHeadModel from Interpreter import Interpreter def Phi(x): global model result = model(inputs_embeds=x) return result # return the logit of last word model_path = "gpt2" model = GPT2LMHeadModel.from_pretrained(model_path, output_attentions=True) tokenizer = GPT2Tokenizer.from_pretrained(model_path) input_embedding_weight_std = ( model.get_input_embeddings().weight.view(1,-1) .std().item() ) text = "I disapprove of what you say , but" inputs = tokenizer.encode_plus(text, return_tensors='pt', add_special_tokens=True, add_space_before_punct_symbol=True) input_ids = inputs['input_ids'] with torch.no_grad(): x = model.get_input_embeddings()(input_ids).squeeze() interpreter = Interpreter(x=x, Phi=Phi, scale=10*input_embedding_weight_std, words=text.split(' ')).to(model.device) # This will take sometime. interpreter.optimize(iteration=1000, lr=0.01, show_progress=True) interpreter.get_sigma() interpreter.visualize()
Below are the reported σi and its visualization. The smaller the value, the darker the color.
array([0.8752377, 1.2462736, 1.3040292, 0.55643 , 1.3775877, 1.2515365, 1.2249271, 0.311358 ], dtype=float32)
From the figures above, we can now know that P( w | “I disapprove of what you say, but”) will be affected by the word “but” the most, followed by “what”, then “I”.
In this article, we discussed what a language model is and how use it to do article generation with different approaches to get human-written-like text.
We also had a brief introduction to GPT-2 model and some of its internal workings. We also saw how to use Huggingface🤗 Transformers in applying the GPT-2 model in text generation.
We have visualized the attention values in GPT-2 model, and used the input perturbation approach to see which word/s in the input sentence would affect the next word prediction the most.
- The actual structure of decoder block consists of one attention layer only. What we describe in this article as attention layer should be called attention head. One attention layer includes multiple attention heads and the Wproject for combining the attention heads’ output.
Subscribe to Oursky Code Blog for more expert advice for developers – by developers: