Categories

# Fast One Hot Encoding using Numpy (No For Loops)

In a lot of artificial intelligence applications, specially in supervised learning classification problems, where the labels for each of the datapoints are available, we often have a one-dimensional array containing the classes of each of the datapoints. Depending upon the machine learning algorithm we are going to apply to our dataset for classification, we might need to one-hot encode our labels.

## What is One Hot Encoding?

Say, we are given a labels array like,

``````>>> Y=numpy.randdom.randint(15,size=10).reshape(-1,1)
>>> Y
array([,
[ 8],
[ 4],
[ 4],
,
[ 0],
,
,
[ 1],
[ 8]])``````

Where each row contains the class number of the correspoding row for the datapoint `X`. Note that, in our example there are 6 unique classes, namely, `0,1,4,8,10,11,12` .One hot encoding for the above array Y will be,

``````>>> one_hot(Y)
array([[0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0.],
[1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.],
[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0.]])``````

Where in each row has only one column set to 1 representing the class to which the datapoint belongs. For example, the first row has the last column set to 1. Which represents the column of the class ’12’. Which would mean we would also require the class to column mapping along with the one hot encoding. Something like,

``````array([[ 0],
[ 1],
[ 4],
[ 8],
,
,
])``````

Here the row number of the class represents it’s column in the one hot encoding. It is simply a sorted order of all unique classes in `Y`.

## The easy solution to the easy problem(For loop)

``````def one_hot_for(Y):
data_size=Y.shape
classes=np.unique(Y).reshape(-1,1)
num_classes=classes.shape

one_hot=np.zeros((data_size,num_classes))
for row in range(data_size):
one_hot[row,np.where(classes==Y[row])]=1

return one_hot,classes``````

## The hard solution to the easy problem(vector(ish))

``````def one_hot(Y):
data_size=Y.shape
classes=np.unique(Y).reshape(-1,1)
num_classes=classes.shape

class_mappings=np.arange(0,max(Y)+1)
class_mappings[np.unique(classes)]=np.arange(num_classes)
Y=class_mappings[Y]

one_hot=np.zeros((data_size,num_classes))
one_hot[np.arange(data_size).reshape(-1,1),Y.reshape(-1,1)]=1
class_col=np.sort(classes)
return one_hot,class_col``````

## Speed comparison

#### Generating random labels and storing in a file

``````import random

file_name="randoms.txt"
with open(file_name,"w+") as random_labels:
for i in range(10000):
random_labels.write(str(random.randint(0,1000))+"\n")``````

The above code will generate 10,000 random numbers and store them on individual lines in the file `randoms.txt`

#### Script for comparing the two functions

``````import matplotlib.pyplot as plt
import time
import numpy as np
from helper_functions import one_hot,one_hot_for

filename= "randoms.txt"

with open(filename,"r+") as f:
int_map=map(int,Y)
Y=list(int_map)
Y=np.asarray(Y).reshape(-1,1)

one_hot_timings=[]
one_hot_for_timings=[]
for i in range(100,10000,100):
start=time.time()
_,_=one_hot(Y[:i])
end=time.time()
one_hot_timings.append(end-start)

start=time.time()
_,_=one_hot_for(Y[:i])
end=time.time()
one_hot_for_timings.append(end-start)

plt.plot(one_hot_timings,label="one_hot_vector")
plt.plot(one_hot_for_timings,label="one_hot_for")
plt.xlabel('data_size for every 100 datapoints')
plt.ylabel('time of execution')
plt.legend(loc='best')
plt.show()``````