Name Prediction with RNN using PyTorch – Part II

This is the second post in the series – Predicting country based on the surnames. If you’ve not read the earlier post, please read it here.

So, we have seen a high level approach for building an RNN. In this post let’s try to understand the coding part of it. We’ll use pytorch framework for building our neural network.

Let’s get started.

1 Read the files and store content in a dictionary

import os, glob
from io import open

## Store the file locations in a list calles files

files = glob.glob('./dataset/surnames/names/*.txt')

glob command let’s us select the files using the regex. In our case we want to select all the files ending with .txt in a particular folder. glob command returns a list of filenames which we can use in other command to loop across different files.

After we execute the above code, we’ll have ‘files’ list with names of all files.

Next we’ll create a dictionary called country_names which will have name of the country as key a list of surnames as values.

country_names = {}

for file in files:
    country = os.path.basename(file).split(".")[0]
    names = open(file,'r',encoding='UTF-8').read().split('\\n')
    
    country_names[country] = names

In the code above, os.path.basename will select only the actual name of the file and strips of the path location. We’ll split this output again (which also has .txt as extension) and select only the name (to be used as key in our dictionary).

We’ll now have the dictionary with country and names.

1.1 Check for Unicode characters and convert to ASCII

Surnames are from different countries and can have non English letters. These are represented by unicode characters. For this exercise we’ll convert all the unicode ( which is close to English) into ASCII letters.

import unicodedata,string
ascii_letters = string.ascii_letters+" .,;''"

def convert_to_ascii(word):
    return "".join(letter for letter in unicodedata.normalize("NFD",word) if letter in ascii_letters and unicodedata.category(letter)!="Mn")

We’ll define a method which uses normalize function from unicodedata package. For more details on type of normalization that are available, refer this link.

After normalization we’ll select only those letters that are available in list of ascii letters. In this exercise we’re not processing non-English letters.

for country,names in country_names.items():
    country_names[country] = [convert_to_ascii(name) for name in names]

Above code will loop through each value in the dictionary of names and calls the method to convert unicode to ASCII.

2 Helper function to get samples from dictionary

import random

def get_sample_name():
    country = random.choice(list(country_names.keys()))
    countries = country_names[country]
    name = countries[random.randint(0,len(countries)-2)]

The above function will randomly choose a country from list of countries and will choose one name from the same country.

The output of above function will be like – (Vietnamese’, ‘Hoang’)

For some reason there’s a blank value in the last item of each list. So, using random.choice on list of countries would select blank values some times. Hence we’re using random.randint and excluding the last index of the list.

3 Helper function to convert string to tensors

import torch

def get_letter_tensor(letter):    

    tnsr = torch.zeros((1,len(ascii_letters)))

    tnsr[:,ascii_letters.index(letter)]=1

    return tnsr

def get_word_tensor(word):

    word_tnsr = torch.zeros(len(word),1,len(ascii_letters))

    for i,letter in enumerate(word):
        word_tnsr[i,:,:]=get_letter_tensor(letter)
        
    return word_tnsr

As the names suggest, get_letter_tensor will convert letter to tensor and  get_word_tensor would convert word to tensor using the previous method.

In get_letter_tensor function we’ll first define an empty tensor of fixed dimension ( length of ascii character list). Then, based on the corresponding letter, we pick the index and make value of that index as ‘1’ to make it a one hot encoded vector.

The function below returns index of the country from the country_names dictionary. This will be used by RNN’s negative loss function for calculating error.

all_categories=list(country_names.keys())
def get_cat_tensor(category):
    return torch.tensor([all_categories.index(category)],dtype=torch.long)

4 A neural network – RNN

import torch.nn as nn

class RNN(nn.Module):
    
    def __init__(self,input_size, hidden_size,output_size):
        
        super(RNN,self).__init__()
        
        self.hidden_size=hidden_size
        self.output_calc = nn.Linear(input_size+hidden_size,output_size)
        self.hidden_calc = nn.Linear(input_size+hidden_size,hidden_size)
        
        self.softmax = nn.LogSoftmax(dim=1)
        
    def forward(self,input,hidden):
        combo = torch.cat((input,hidden),axis=1)
        
        hidden_layer = self.hidden_calc(combo)
        output_layer = self.output_calc(combo)
        
        output_layer = self.softmax(output_layer)
        
        return output_layer,hidden_layer
    
    def initHidden(self):
        return torch.zeros(1,self.hidden_size)

In the init function we create instances of three classes :

  1. output_calc : We create an instance of nn.Layer(). This creates a weight matrix and a function for converting input matrix to output matrix
  2. hidden_calc : This instance converts input matrix(values) into output matrix according to output_size
  3. softmax : This instance is used for squashing the values (from output matrix) between 0,1 .

forward method is called for different time steps depending on the number of letters in the word. In each loop, the forward function calculates hidden layer and an output layer. Hidden layer from previous time step is used in the current time step for matrix multiplication. In this example, we use output layer from the last time step as we’re dealing with classification problem.

When we’re running the forward function for the first time, we need an empty hidden layer. We use initHidden method for getting that.

We’ll now create instance of RNN to be used in training.

hidden_size = 128
rnn = RNN(len(ascii_letters),hidden_size,len(country_names.keys()))

As we already know, the input size will be the fixed size vector we created for each letter tensor, which is the length of ascii letters (1 x 56). output_size will be the length of total unique keys in the country_names dictionary which is the names of countries. In this example we have 18 country names so the output will be a vector of size (1 x 18)

5 Function to train single example

Let’s create an instance of error function and use it while training for a single example. Negative Log Likelihood is the error function we’ll use for this exercise. It is available in the class nn.NLLLoss(). All we need to do is create an instance of it.

criterion = nn.NLLLoss()

learning_rate = 0.005

def train(country_tnsr, word_tnsr):
    hidden = rnn.initHidden()

    rnn.zero_grad()

    for i in range(word_tnsr.size()[0]):
        output, hidden = rnn(word_tnsr[i], hidden)
        

    loss = criterion(output, country_tnsr)
    loss.backward()

    # Update parameters
    for p in rnn.parameters():
        p.data.add_(p.grad.data, alpha=-learning_rate)

    return output, loss.item()

For a single training example, say, (Vietnamese’, ‘Hoang’) we need to create tensors of it and pass it to trainfunction.

The train function will first create a empty hidden layer which will be used in the first time step of RNN model. Depending on the number of letters in the word, for loop will call rnn and store all the hidden layer and output layer in corresponding variables.

The loss is calculated using the criterion instance we created earlier loss.backward() would do the backpropagation required for updating the weights.

The function will return the final output value and the loss.

6 Train the network on many examples

We know that output from the train function will be a vector (of size 18) with values showing the distribution of probabilities for each of the countries. We need a function that will pick the country based on the maximum value. Following function does that :

def categoryFromOutput(output):
    top_n, top_i = output.topk(1)
    category_i = top_i[0].item()
    return all_categories[category_i], category_i

n_iters = 100_000
print_every = 5000
plot_every = 1000

current_loss = 0
all_losses = []

for iter in range(1, n_iters + 1):
    ## Get random pair of name & country and convert them into tensors.
    country,name = get_sample_name()
    country_tnsr,name_tnsr= (get_cat_tensor(country),get_word_tensor(name))
    
    output, loss = train(country_tnsr, name_tnsr)
    current_loss += loss

    # Print iter number, loss, name and guess
    if iter % print_every == 0:
        guess, guess_i = categoryFromOutput(output)
        correct = '✓' if guess == country else '✗ (%s)' % country
        print('%d %d%% %.4f %s / %s %s' % (iter, iter / n_iters * 100, loss, name, guess, correct))

    if iter % plot_every == 0:
        all_losses.append(current_loss / plot_every)
        current_loss = 0

We run the training examples through 100,000 random (country, name) pairs. In each epoch we create random training example and convert it into tensors. We then call the train function which will calculate output, loss and backpropagation. We also print the results for every 5000 iterations.

7 Evaluating the Results

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt.figure()
plt.plot(all_losses)

We will plot loss over a period of time and see that loss has decreased with more training.

7.1 Testing new values

## Output for a given word

def evaluate(line_tensor):
    hidden = rnn.initHidden()

    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    return output

categoryFromOutput(evaluate(get_word_tensor('Dovesky')))
# ('Russian', 6)

categoryFromOutput(evaluate(get_word_tensor('Jackson')))
# ('Scottish', 14)

Conclusion

We’re able to build a RNN model for training the word-country pair and able to reduce loss with more epochs. We are also able to predict the country based on the name.

You can get the entire code for this post here.

There’s still scope for improving the efficiency by tuning the parameters and trying out different version of RNN, which we’ll cover in future posts.