In a lot of artificial intelligence applications, specially in supervised learning classification problems, where the labels for each of the datapoints are available, we often have a one-dimensional array containing the classes of each of the datapoints. Depending upon the machine learning algorithm we are going to apply to our dataset for classification, we might need to one-hot encode our labels.

## What is One Hot Encoding?

Say, we are given a labels array like,

```
>>> Y=numpy.randdom.randint(15,size=10).reshape(-1,1)
>>> Y
array([[12],
[ 8],
[ 4],
[ 4],
[11],
[ 0],
[10],
[10],
[ 1],
[ 8]])
```

Where each row contains the class number of the correspoding row for the datapoint `X`

. Note that, in our example there are 6 unique classes, namely, `0,1,4,8,10,11,12`

.One hot encoding for the above array Y will be,

```
>>> one_hot(Y)
array([[0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0.],
[1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.],
[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0.]])
```

Where in each row has only one column set to 1 representing the class to which the datapoint belongs. For example, the first row has the last column set to 1. Which represents the column of the class ’12’. Which would mean we would also require the class to column mapping along with the one hot encoding. Something like,

```
array([[ 0],
[ 1],
[ 4],
[ 8],
[10],
[11],
[12]])
```

Here the row number of the class represents it’s column in the one hot encoding. It is simply a sorted order of all unique classes in `Y`

.

## The easy solution to the easy problem(For loop)

```
def one_hot_for(Y):
data_size=Y.shape[0]
classes=np.unique(Y).reshape(-1,1)
num_classes=classes.shape[0]
one_hot=np.zeros((data_size,num_classes))
for row in range(data_size):
one_hot[row,np.where(classes==Y[row])[0]]=1
return one_hot,classes
```

## The hard solution to the easy problem(vector(ish))

```
def one_hot(Y):
data_size=Y.shape[0]
classes=np.unique(Y).reshape(-1,1)
num_classes=classes.shape[0]
class_mappings=np.arange(0,max(Y)+1)
class_mappings[np.unique(classes)]=np.arange(num_classes)
Y=class_mappings[Y]
one_hot=np.zeros((data_size,num_classes))
one_hot[np.arange(data_size).reshape(-1,1),Y.reshape(-1,1)]=1
class_col=np.sort(classes)
return one_hot,class_col
```

## Speed comparison

#### Generating random labels and storing in a file

```
import random
file_name="randoms.txt"
with open(file_name,"w+") as random_labels:
for i in range(10000):
random_labels.write(str(random.randint(0,1000))+"\n")
```

The above code will generate 10,000 random numbers and store them on individual lines in the file `randoms.txt`

#### Script for comparing the two functions

```
import matplotlib.pyplot as plt
import time
import numpy as np
from helper_functions import one_hot,one_hot_for
filename= "randoms.txt"
with open(filename,"r+") as f:
Y=f.readlines()
int_map=map(int,Y)
Y=list(int_map)
Y=np.asarray(Y).reshape(-1,1)
one_hot_timings=[]
one_hot_for_timings=[]
for i in range(100,10000,100):
start=time.time()
_,_=one_hot(Y[:i])
end=time.time()
one_hot_timings.append(end-start)
start=time.time()
_,_=one_hot_for(Y[:i])
end=time.time()
one_hot_for_timings.append(end-start)
plt.plot(one_hot_timings,label="one_hot_vector")
plt.plot(one_hot_for_timings,label="one_hot_for")
plt.xlabel('data_size for every 100 datapoints')
plt.ylabel('time of execution')
plt.legend(loc='best')
plt.show()
```

#### Result

## Additional notes

- Libraries such as scipy, torch, sklearn etc, could probably do this faster.
- Depending on how often you call the above functions in your applications, it might not be relevant for you to choose between the above two methods as both need less than a few seconds at most.
- The assimilated code for the above can be found here.