Most of the keyboards in smartphones give next word prediction features; google also uses next word prediction based on our browsing history. So a preloaded data is also stored in the keyboard function of our smartphones to predict the next word correctly. In this article, I will train a Deep Learning model for next word prediction using Python. I will use the Tensorflow and Keras library in Python for next word prediction model.
For making a Next Word Prediction model, I will train a Recurrent Neural Network (RNN). So let’s start with this task now without wasting any time.
Also, Read – 100+ Machine Learning Projects Solved and Explained.
To start with our next word prediction model, let’s import some all the libraries we need for this task:
import numpy as np from nltk.tokenize import RegexpTokenizer from keras.models import Sequential, load_model from keras.layers import LSTM from keras.layers.core import Dense, Activation from keras.optimizers import RMSprop import matplotlib.pyplot as plt import pickle import heapq
As I told earlier, Google uses our browsing history to make next word predictions, smartphones, and all the keyboards that are trained to predict the next word are trained using some data. So I will also use a dataset. You can download the dataset from here.
Now let’s load the data and have a quick look at what we are going to work with:
path = '1661-0.txt' text = open(path).read().lower() print('corpus length:', len(text))
corpus length: 581887
Now I will split the dataset into each word in order but without the presence of some special characters.
tokenizer = RegexpTokenizer(r'w+') words = tokenizer.tokenize(text)
['project', 'gutenberg', 's', 'the', 'adventures', 'of', 'sherlock', 'holmes', 'by', ............................... , 'our', 'email', 'newsletter', 'to', 'hear', 'about', 'new', 'ebooks']
Now the next process will be performing the feature engineering in our data. For this purpose, we will require a dictionary with each word in the data within the list of unique words as the key, and it’s significant portions as value.
unique_words = np.unique(words) unique_word_index = dict((c, i) for i, c in enumerate(unique_words))
Feature Engineering
Feature Engineering means taking whatever information we have about our problem and turning it into numbers that we can use to build our feature matrix. If you want a detailed tutorial of feature engineering, you can learn it from here.
Here I will define a Word length which will represent the number of previous words that will determine our next word. I will define prev words to keep five previous words and their corresponding next words in the list of next words.
WORD_LENGTH = 5 prev_words = [] next_words = [] for i in range(len(words) - WORD_LENGTH): prev_words.append(words[i:i + WORD_LENGTH]) next_words.append(words[i + WORD_LENGTH]) print(prev_words[0]) print(next_words[0])
['project', 'gutenberg', 's', 'the', 'adventures']
Now I will create two numpy arrays x for storing the features and y for storing its corresponding label. I will iterate x and y if the word is available so that the corresponding position becomes 1.
X = np.zeros((len(prev_words), WORD_LENGTH, len(unique_words)), dtype=bool) Y = np.zeros((len(next_words), len(unique_words)), dtype=bool) for i, each_words in enumerate(prev_words): for j, each_word in enumerate(each_words): X[i, j, unique_word_index[each_word]] = 1 Y[i, unique_word_index[next_words[i]]] = 1
Now before moving forward, have a look at a single sequence of words:
[False False False … False False False]
Building the Recurrent Neural network
As I stated earlier, I will use the Recurrent Neural networks for next word prediction model. Here I will use the LSTM model, which is a very powerful RNN.
model = Sequential() model.add(LSTM(128, input_shape=(WORD_LENGTH, len(unique_words)))) model.add(Dense(len(unique_words))) model.add(Activation('softmax'))
Training the Next Word Prediction Model
I will be training the next word prediction model with 20 epochs:
optimizer = RMSprop(lr=0.01) model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy']) history = model.fit(X, Y, validation_split=0.05, batch_size=128, epochs=2, shuffle=True).history
Now we have successfully trained our model, before moving forward to evaluating our model, it will be better to save this model for our future use.
model.save('keras_next_word_model.h5') pickle.dump(history, open("history.p", "wb")) model = load_model('keras_next_word_model.h5') history = pickle.load(open("history.p", "rb"))
Evaluating the Next Word Prediction Model
Now let’s have a quick look at how our model is going to behave based on its accuracy and loss changes while training:
plt.plot(history['acc']) plt.plot(history['val_acc']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper left')
plt.plot(history['loss']) plt.plot(history['val_loss']) plt.title('model loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper left')
Testing Next Word Prediction Model
Now let’s build a python program to predict the next word using our trained model. For this, I will define some essential functions that will be used in the process.
def prepare_input(text): x = np.zeros((1, SEQUENCE_LENGTH, len(chars))) for t, char in enumerate(text): x[0, t, char_indices[char]] = 1. return x
Now before moving forward, let’s test the function, make sure you use a lower() function while giving input :
prepare_input("This is an example of input for our LSTM".lower())
array([[[ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], ..., [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.]]])
Note that the sequences should be 40 characters (not words) long so that we could easily fit it in a tensor of the shape (1, 40, 57). Not before moving forward, let’s check if the created function is working correctly.
def prepare_input(text): x = np.zeros((1, WORD_LENGTH, len(unique_words))) for t, word in enumerate(text.split()): print(word) x[0, t, unique_word_index[word]] = 1 return x prepare_input("It is not a lack".lower())
array([[[ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], ..., [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.]]])
Now I will create a function to return samples:
def sample(preds, top_n=3): preds = np.asarray(preds).astype('float64') preds = np.log(preds) exp_preds = np.exp(preds) preds = exp_preds / np.sum(exp_preds) return heapq.nlargest(top_n, range(len(preds)), preds.take)
And now I will create a function for next word prediction:
def predict_completion(text): original_text = text generated = text completion = '' while True: x = prepare_input(text) preds = model.predict(x, verbose=0)[0] next_index = sample(preds, top_n=1)[0] next_char = indices_char[next_index] text = text[1:] + next_char completion += next_char if len(original_text + completion) + 2 > len(original_text) and next_char == ' ': return completion
This function is created to predict the next word until space is generated. It will do this by iterating the input, which will ask our RNN model and extract instances from it. Now I will modify the above function to predict multiple characters:
def predict_completions(text, n=3): x = prepare_input(text) preds = model.predict(x, verbose=0)[0] next_indices = sample(preds, n) return [indices_char[idx] + predict_completion(text[1:] + indices_char[idx]) for idx in next_indices]
Now I will use the sequence of 40 characters that we can use as a base for our predictions.
quotes = [ "It is not a lack of love, but a lack of friendship that makes unhappy marriages.", "That which does not kill us makes us stronger.", "I'm not upset that you lied to me, I'm upset that from now on I can't believe you.", "And those who were seen dancing were thought to be insane by those who could not hear the music.", "It is hard enough to remember my opinions, without also remembering my reasons for them!" ]
Now finally, we can use the model to predict the next word:
for q in quotes: seq = q[:40].lower() print(seq) print(predict_completions(seq, 5)) print()
it is not a lack of love, but a lack of ['the ', 'an ', 'such ', 'man ', 'present, '] that which does not kill us makes us str ['ength ', 'uggle ', 'ong ', 'ange ', 'ive '] i'm not upset that you lied to me, i'm u ['nder ', 'pon ', 'ses ', 't ', 'uder '] and those who were seen dancing were tho ['se ', 're ', 'ugh ', ' servated ', 't ']it is hard enough to remember my opinion [' of ', 's ', ', ', 'nof ', 'ed ']
Also Read: Data Augmentation in Deep Learning.
I hope you liked this article of Next Word Prediction Model, feel free to ask your valuable questions in the comments section below.
Follow Us:
Next-Word-Prediction
Our problem statement is to predict the next word of a sentence given its previous words and a corpus for training the model. We have trained neural models and n-gram language models to predict the next word of a sequence.
LSTM Model
FileName: NextWordPrediction_LSTM_optimized-GPU.ipynb
This code needs to be run on Google Collab(25 GB RAM) due to large memory requirements, unless you have a big RAM.
Google Collab — Set the runtime to GPU. Since it uses GPU based libraries.
The files that need to be accessed are on my personal google drive folder.
You need to mount your drive from the notebook(code in the notebook) and add the required files to your personal drive which are as follows.
1.emailTokens
https://drive.google.com/open?id=1AE1Rx2EwWWSu3kpr-nMX1ANnzq8zs2iC
2.vocab.txt
https://drive.google.com/open?id=13fKFol5ARV0yPniQyX8e06Fopq8H3_sy
Add these files to a folder Named: «Next Word Prediction» in your personal drive.
You can then access it from the code using the path ‘drive/My Drive/Next Word Prediction/filename’.(Code in notebook)
You can run the code sequentially cell by cell in the notebook.
Feature Engineering Code:
FileName: NextWordPrediction-FeatureEngineering.ipynb
Get the emails dataset from Kaggle.(1GB file)
https://www.kaggle.com/wcukierski/enron-email-dataset
Add it in the path ./emailData/emails.csv with respect to the feature engineering notebook.
Download Google News Vectors, unzip it and place it in the same directory as the current notebook.
https://github.com/mmihaltz/word2vec-GoogleNews-vectors
And then run the notebook cell by cell.
N-gram Model with Laplace smoothing
FileName: Word_Prediction_Laplace.ipynb
Run the code from the block «For testing the model»
For this you need to download all the files from the following folder.
https://drive.google.com/open?id=1w6atfLocHpEAKguwMzirgU8rPSqinLdt
Place them as the same folder as the code file.
Project description
Generative Pretrained Transformer 2 (GPT-2) for Language Modeling using the PyTorch-Transformers library.
Installation
Requires python>=3.5, pytorch>=1.6.0, pytorch-transformers>=1.2.0
pip install next-word-prediction
How to use
>>> from next_word_prediction import GPT2 >>> gpt2 = GPT2() >>> text = "The course starts next" >>> gpt2.predict_next(text, 5) The course starts next ['week', 'to', 'month', 'year', 'Monday']
Demo via Streamlit
streamlit run app/run.py
Download files
Download the file for your platform. If you’re not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Most of the keyboards in smartphones give next word prediction features; google also uses next word prediction based on our browsing history. So a preloaded data is also stored in the keyboard function of our smartphones to predict the next word correctly. In this article, I will train a Deep Learning model for next word prediction using Python. I will use the Tensorflow and Keras library in Python for next word prediction model.
For making a Next Word Prediction model, I will train a Recurrent Neural Network (RNN). So let’s start with this task now without wasting any time.
Next Word Prediction Model
To start with our next word prediction model, let’s import some all the libraries we need for this task:
import numpy as np
from nltk.tokenize import RegexpTokenizer
from keras.models import Sequential, load_model
from keras.layers import LSTM
from keras.layers.core import Dense, Activation
from keras.optimizers import RMSprop
import matplotlib.pyplot as plt
import pickle
import heapq
As I told earlier, Google uses our browsing history to make next word predictions, smartphones, and all the keyboards that are trained to predict the next word are trained using some data. So I will also use a dataset. You can download the dataset from here.
Now let’s load the data and have a quick look at what we are going to work with:
path = '1661-0.txt'
text = open(path).read().lower()
print('corpus length:', len(text))
corpus length: 581887
Now I will split the dataset into each word in order but without the presence of some special characters.
tokenizer = RegexpTokenizer(r'w+')
words = tokenizer.tokenize(text)
['project', 'gutenberg', 's', 'the', 'adventures', 'of', 'sherlock', 'holmes', 'by', ............................... , 'our', 'email', 'newsletter', 'to', 'hear', 'about', 'new', 'ebooks']
Now the next process will be performing the feature engineering in our data. For this purpose, we will require a dictionary with each word in the data within the list of unique words as the key, and it’s significant portions as value.
unique_words = np.unique(words)
unique_word_index = dict((c, i) for i, c in enumerate(unique_words))
Feature Engineering
Feature Engineering means taking whatever information we have about our problem and turning it into numbers that we can use to build our feature matrix.
Here I will define a Word length which will represent the number of previous words that will determine our next word. I will define prev words to keep five previous words and their corresponding next words in the list of next words.
WORD_LENGTH = 5
prev_words = []
next_words = []
for i in range(len(words) - WORD_LENGTH):
prev_words.append(words[i:i + WORD_LENGTH])
next_words.append(words[i + WORD_LENGTH])
print(prev_words[0])
print(next_words[0])
['project', 'gutenberg', 's', 'the', 'adventures']
Now I will create two numpy arrays x for storing the features and y for storing its corresponding label. I will iterate x and y if the word is available so that the corresponding position becomes 1.
X = np.zeros((len(prev_words), WORD_LENGTH, len(unique_words)), dtype=bool)
Y = np.zeros((len(next_words), len(unique_words)), dtype=bool)
for i, each_words in enumerate(prev_words):
for j, each_word in enumerate(each_words):
X[i, j, unique_word_index[each_word]] = 1
Y[i, unique_word_index[next_words[i]]] = 1
Now before moving forward, have a look at a single sequence of words:
print(X[0][0])
[False False False … False False False]
Building the Recurrent Neural network
As I stated earlier, I will use the Recurrent Neural networks for next word prediction model. Here I will use the LSTM model, which is a very powerful RNN.
model = Sequential()
model.add(LSTM(128, input_shape=(WORD_LENGTH, len(unique_words))))
model.add(Dense(len(unique_words)))
model.add(Activation('softmax'))
Training the Next Word Prediction Model
I will be training the next word prediction model with 20 epochs:
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = model.fit(X, Y, validation_split=0.05, batch_size=128, epochs=2, shuffle=True).history
Now we have successfully trained our model, before moving forward to evaluating our model, it will be better to save this model for our future use.
model.save('keras_next_word_model.h5')
pickle.dump(history, open("history.p", "wb"))
model = load_model('keras_next_word_model.h5')
history = pickle.load(open("history.p", "rb"))
Evaluating the Next Word Prediction Model
Now let’s have a quick look at how our model is going to behave based on its accuracy and loss changes while training:
plt.plot(history['acc'])
plt.plot(history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.plot(history['loss'])
plt.plot(history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
Testing Next Word Prediction Model
Now let’s build a python program to predict the next word using our trained model. For this, I will define some essential functions that will be used in the process.
return xdef prepare_input(text):
x = np.zeros((1, SEQUENCE_LENGTH, len(chars)))
for t, char in enumerate(text):
x[0, t, char_indices[char]] = 1.
Now before moving forward, let’s test the function, make sure you use a lower() function while giving input :
prepare_input("This is an example of input for our LSTM".lower())
array([[[ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], ..., [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.]]])
Note that the sequences should be 40 characters (not words) long so that we could easily fit it in a tensor of the shape (1, 40, 57). Not before moving forward, let’s check if the created function is working correctly.
def prepare_input(text):
x = np.zeros((1, WORD_LENGTH, len(unique_words)))
for t, word in enumerate(text.split()):
print(word)
x[0, t, unique_word_index[word]] = 1
return x
prepare_input("It is not a lack".lower())
array([[[ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], ..., [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.]]])
Now I will create a function to return samples:
return heapq.nlargest(top_n, range(len(preds)), preds.take)def sample(preds, top_n=3):
preds = np.asarray(preds).astype('float64')
preds = np.log(preds)
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
And now I will create a function for next word prediction:
if len(original_text + completion) + 2 > len(original_text) and next_char == ‘ ‘:def predict_completion(text):
original_text = text
generated = text
completion = ''
while True:
x = prepare_input(text)
preds = model.predict(x, verbose=0)[0]
next_index = sample(preds, top_n=1)[0]
next_char = indices_char[next_index]
text = text[1:] + next_char
completion += next_char
return completion
This function is created to predict the next word until space is generated. It will do this by iterating the input, which will ask our RNN model and extract instances from it. Now I will modify the above function to predict multiple characters:
def predict_completions(text, n=3):
x = prepare_input(text)
preds = model.predict(x, verbose=0)[0]
next_indices = sample(preds, n)
return [indices_char[idx] + predict_completion(text[1:] + indices_char[idx]) for idx in next_indices]
Now I will use the sequence of 40 characters that we can use as a base for our predictions.
quotes = [
"It is not a lack of love, but a lack of friendship that makes unhappy marriages.",
"That which does not kill us makes us stronger.",
"I'm not upset that you lied to me, I'm upset that from now on I can't believe you.",
"And those who were seen dancing were thought to be insane by those who could not hear the music.",
"It is hard enough to remember my opinions, without also remembering my reasons for them!"
]
Now finally, we can use the model to predict the next word:
for q in quotes:
seq = q[:40].lower()
print(seq)
print(predict_completions(seq, 5))
print()
it is not a lack of love, but a lack of ['the ', 'an ', 'such ', 'man ', 'present, '] that which does not kill us makes us str ['ength ', 'uggle ', 'ong ', 'ange ', 'ive '] i'm not upset that you lied to me, i'm u ['nder ', 'pon ', 'ses ', 't ', 'uder '] and those who were seen dancing were tho ['se ', 're ', 'ugh ', ' servated ', 't ']it is hard enough to remember my opinion [' of ', 's ', ', ', 'nof ', 'ed ']
Python Example for Beginners
Two Machine Learning Fields
There are two sides to machine learning:
- Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
- Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.
Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes
Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!
Latest end-to-end Learn by Coding Recipes in Project-Based Learning:
Applied Statistics with R for Beginners and Business Professionals
Data Science and Machine Learning Projects in Python: Tabular Data Analytics
Data Science and Machine Learning Projects in R: Tabular Data Analytics
Python Machine Learning & Data Science Recipes: Learn by Coding
R Machine Learning & Data Science Recipes: Learn by Coding
Comparing Different Machine Learning Algorithms in Python for Classification (FREE)
Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.
Your output is a TensorFlow list and it is possible to get its max argument (the predicted most probable class) with a TensorFlow function. This is normally the list that contains the next word’s probabilities.
At «Evaluate the Model» from this page, your output list is y
in the following example:
First we’ll figure out where we predicted the correct label.
tf.argmax
is an extremely useful function which gives you the index of the
highest entry in a tensor along some axis. For example,tf.argmax(y,1)
is the label our model thinks is most likely for each input, while
tf.argmax(y_,1)
is the true label. We can usetf.equal
to check if our
prediction matches the truth.
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
Another approach that is different is to have pre-vectorized (embedded/encoded) words. You could vectorize your words (therefore embed them) with Word2vec to accelerate learning, you might want to take a look at this. Each word could be represented as a point in a 300 dimensions space of meaning, and you could find automatically the «N words» closest to the predicted point in space at the output of the network. In that case, the argmax
way to proceed does not work anymore and you could probably compare on cosine similarity with the words you truly wanted to compare to, but for that I am not sure actually how does this could cause numerical instabilities. In that case y
will not represent words as features, but word embeddings over a dimensionality of, let’s say, 100 to 2000 in size according to different models. You could Google something like this for more info: «man woman queen word addition word2vec» to understand the subject of embeddings more.
Note: when I talk about word2vec here, it is about using an external pre-trained word2vec model to help your training to only have pre-embedded inputs and create embedding outputs. Those outputs’ corresponding words can be re-figured out by word2vec to find the corresponding similar top predicted words.
Notice that the approach I suggest is not exact since it would be only useful to know if we predict EXACTLY the word that we wanted to predict. For a more soft approach, it would be possible to use ROUGE or BLEU metrics for evaluating your model in case you use sentences or something longer than a word.