Categories
Algorithms

Linear vs Logistic Regression, all in Numpy

The two entry level machine learning algorithms , linear and logistic regression are quite easy to understand and provide a good way to practice coding the general machine learning pipeline, in their vectorized form. Namely,

  • Prepping the dataset, eg: removing outliers, adding features(polynomial multiples of existing features), normalization, feature scaling.
  • Implementing the learning algorithm function.
  • Calculating the loss through the chosen loss function and hypothesis.
  • Optimization algorithm – Update you model’s parameters depending on the loss and ground truth.
  • Maintaining a config file for the total training process.

An entry level toy dataset: pima indians diabetes dataset, has a target of one variable for each datapoint(a binary classification task of predicting whether a person has diabetes or not), will be used for the sake of this tutorial blog.

About the dataset

You can know all about the dataset on Kaggle. The dataset has around 768 datapoints and 8 features, which is quite a sweet spot for having a decent model without worrying about underfitting. The data is about female patients, specifically, their BMI, insulin level, age, skin thickness, glucose level etc. The target tells if the person has diabetes or not. 500 of all the datapoints are non-diabetic. 268 are diabetic. The data is not highly skewed so normal test accuracy should suffice(no need to find precision, recall and F1 score).

Helper Functions

For normalizing the dataset

This helps in normalizing the data and bringing them in a range of 0-1.

def scalify_min_max(np_dataframe):
    minimum_array=np.amin(np_dataframe,axis=0)
    maximum_array=np.amax(np_dataframe,axis=0)
    range_array = maximum_array-minimum_array

    scaled = (np_dataframe-minimum_array)/range_array
    return scaled

For calculating the accuracy

def accuracy_calculator(Y_out,Y):
    accuracy=np.sum(np.logical_not(np.logical_xor(Y_out,Y)))/Y.shape[0]
    true_positives=np.sum(np.logical_and(Y_out,Y))
    false_positives=np.sum(np.logical_and(Y_out,np.logical_not(Y)))
    false_negatives=np.sum(np.logical_and(np.logical_not(Y_out),Y))
    precision=true_positives/(true_positives+false_positives)
    recall=true_positives/(true_positives+false_negatives)
    print("Precision:",precision,".Recall:",recall)
    F1_score=precision*recall/(precision+recall)
    return [accuracy,precision,recall,F1_score]

For preparing the dataset – creating train/val/test splits

def pre_data_prep(filename,dest_fileloc):
    with open(filename,'rb') as f:
        gzip_fd=gzip.GzipFile(fileobj=f)
        next(gzip_fd)#Skip first row
        diabetes_df = loadtxt(gzip_fd,delimiter=',',dtype=np.float32)
    Y=diabetes_df[:,-1]
    scaled_diabetes_df = scalify_min_max(diabetes_df[:,:-1])
    concat_diabetes = np.concatenate((scaled_diabetes_df,np.array([Y]).T),axis=1)
    savetxt(dest_fileloc,concat_diabetes,delimiter=',')

def dataprep(fileloc,split):
    assert len(split) == 3
    assert sum(split) == 1
    diabetes_data = loadtxt(fileloc,delimiter=',',dtype=np.float32)
    Y=np.array([diabetes_data[:,-1]]).T
    classes = np.unique(Y)
    assert len(classes) == 2
    X=diabetes_data[:,:-1]
    data_size=X.shape[0]
    print(data_size,X.shape,Y.shape)

    split_size=int(split[0]*data_size)
    val_split=int(split[0]*data_size)
    X_train=X[:split_size]
    X_val=X[split_size:split_size+val_split]
    X_test=X[split_size+val_split:]
    Y_train=Y[:split_size]
    Y_val=Y[split_size:split_size+val_split]
    Y_test=Y[split_size+val_split:]
    return X_train,X_val,X_test,Y_train,Y_val,Y_test

Evaluation function

For for finding accuracy of learned model on the test dataset.

def evaluate(theta_params,X,Y=None,thresh=0.5):
    data_size=X.shape[0]
    X_extend=np.concatenate((np.ones((data_size,1)),X),axis=1)
    pred = np.greater(np.matmul(X_extend,theta_params),thresh)*1
    cost=np.sum(np.square(np.matmul(X_extend,theta_params)-Y))/(data_size*2)
    return pred,cost

Logistic Regression Function

def sigmoid_func(theta,X):
    retval = 1/(1+np.exp(-1*np.matmul(theta.T,X)))
    return retval
    
def logistic_regression(X,Y,learning_rate=0.001,num_iters=100,thresh=0.5,rand_seed=None):
    if rand_seed!=None:#For reproducible results
        np.random.seed(rand_seed)
    data_size = X.shape[0]
    theta_params=np.array([np.random.randn(X.shape[1]+1)]).T
    #Add bias column to X
    X_extend = np.concatenate((np.ones((data_size,1)),X),axis=1).T
    cost=[]#Keep track of cost after each iteration of learning
    for i in tqdm(range(num_iters),desc="Training.."):
        h_theta=sigmoid_func(theta_params,X_extend).T#mX1
        grad=np.matmul(X_extend,(h_theta-Y))/data_size#nXm*mX1=nX1
        theta_params=theta_params-learning_rate*grad
        cost.append(-1*np.sum(Y*np.log(h_theta)+(1-Y)*np.log(1-h_theta))/(data_size))
    final_pred = np.greater(np.matmul(X_extend.T,theta_params),thresh)*1
    accuracy=np.sum(np.logical_not(np.logical_xor(final_pred,Y)))/data_size
    cost=np.array(cost)
    return theta_params,accuracy,cost

Linear Regression Function

def linear_regression(X,Y,learning_rate=0.001,num_iters=100,thresh=0.5,rand_seed=None):
    if rand_seed!=None:
        np.random.seed(rand_seed)
    data_size = X.shape[0]
    #print(X.shape,Y.shape)
    theta_params=np.array([np.random.randn(X.shape[1]+1)]).T
    X_extend = np.concatenate((np.ones((data_size,1)),X),axis=1)
    cost=[]
    for i in tqdm(range(num_iters),desc="Training.."):
        theta_params=theta_params-learning_rate*np.matmul((np.matmul(theta_params.T,X_extend.T)-Y.T),X_extend).T/data_size
        cost.append(np.sum(np.square(np.matmul(X_extend,theta_params)-Y)[0])/(data_size*2))
    final_pred = np.greater(np.matmul(X_extend,theta_params),thresh)*1
    accuracy=np.sum(np.logical_not(np.logical_xor(final_pred,Y)))/data_size
    cost=np.array(cost)
    return theta_params,accuracy,cost

Runner functions for Linear and Logistic Regressions

#######################--------Linear RUNNER---------###############################
def regression_runner(fileloc,data_split_ratios,seed_values):
    X_train,X_val,X_test,Y_train,Y_val,Y_test = dataprep(fileloc,data_split_ratios)
    all_models=[]
    all_val_accuracies=[]
    random_seeds=seed_values
    num_iters=500
    x_axis=np.arange(num_iters)
    for i in range(len(random_seeds)):
        model,train_accuracy,cost=linear_regression(X_train,Y_train,rand_seed=random_seeds[i],num_iters=num_iters)
        print("Trial:",i,".Train Accuracy:",train_accuracy)
        all_models.append(model)
        plt.plot(x_axis,cost,label=str(random_seeds[i]))
        
        val_prediction,val_cost=evaluate(model,X_val,Y_val)
        accuracy_precision=accuracy_calculator(val_prediction,Y_val)
        all_val_accuracies.append(accuracy_precision[0])
        print("Validation Accuracy:",accuracy_precision)
        print("Validation Cost:",val_cost)

    #plt.legend()
    plt.title("Linear Regression")
    plt.xlabel('Number of iterations')
    plt.ylabel('Cost')
    plt.show()
    max_accuracy_idx=np.where(all_val_accuracies==np.amax(all_val_accuracies))[0][0]
    best_model=all_models[max_accuracy_idx]
    print(best_model.shape)
    #print(X_test.shape,Y_test.shape)
    test_pred,test_cost=evaluate(best_model,X_test,Y_test)
    print(test_pred.shape,print(test_cost))
    test_accuracy,test_precision,test_recall,test_f1=accuracy_calculator(test_pred,Y_test)
    print("Test accuracy:",test_accuracy,".Test cost:",test_cost)

#####################-------------LOGISTIC RUNNER--------------##########################
def logistic_runner(fileloc,data_split_ratios,seed_values):
    X_train,X_val,X_test,Y_train,Y_val,Y_test = dataprep(fileloc,data_split_ratios)
    all_models=[]
    all_val_accuracies=[]
    random_seeds=seed_values
    num_iters=1500
    x_axis=np.arange(num_iters)
    for i in range(10):
        model,train_accuracy,cost=logistic_regression(X_train,Y_train,rand_seed=random_seeds[i],num_iters=num_iters)
        print("Trial:",i,".Train Accuracy:",train_accuracy)
        all_models.append(model)
        plt.plot(x_axis,cost,label=str(random_seeds[i]))

        val_prediction,val_cost=evaluate(model,X_val,Y_val)
        accuracy_precision=accuracy_calculator(val_prediction,Y_val)
        all_val_accuracies.append(accuracy_precision[0])
        print("Validation Accuracy:",accuracy_precision)
        print("Validation Cost:",val_cost)
    #plt.legend()
    plt.title("Logistic Regression")
    plt.xlabel('Number of iterations')
    plt.ylabel('Cost')
    plt.show()
    max_accuracy_idx=np.where(all_val_accuracies==np.amax(all_val_accuracies))[0][0]
    best_model=all_models[max_accuracy_idx]

    test_pred,test_cost=evaluate(best_model,X_test,Y_test)
    #print(test_pred.shape,print(test_cost))
    test_accuracy,test_precision,test_recall,test_f1=accuracy_calculator(test_pred,Y_test)
    print("Test accuracy:",test_accuracy,".Test cost:",test_cost)

Training Curves

Note that each of the below two trainings was performed with 10 different values of initial theta. The initial value of theta effects the overall training performance. The best of the 10 was taken in consideration for the final evaluation on the test dataset.

linear regression training
logistic regression training

Test Accuracies

Linear Regression:
Test accuracy: 0.7068965517241379 .Test cost: 0.14745936729023856

Logistic Regression:
Test accuracy: 0.646551724137931 .Test cost: 0.2865915372479961

Assimilated code

Gist Link

Additional Notes

  • The above code does not use regularization.
  • It may appear that for a few curves training was stopped prematurely, but infact the test results were more near optimal for the above training parameters.
  • Although linear regression appears to be performing better for the above case it might give poorer results for other datasets.
  • Note that logistic regression took 3X as many iterations as linear regression to converge.
  • Initial model parameters are chosen randomly by varying the seed values. Initial model parameters(theta) effects the overall training performance, hence 10 such values were taken.

Advertisement
Categories
Algorithms

Fast One Hot Encoding using Numpy (No For Loops)

In a lot of artificial intelligence applications, specially in supervised learning classification problems, where the labels for each of the datapoints are available, we often have a one-dimensional array containing the classes of each of the datapoints. Depending upon the machine learning algorithm we are going to apply to our dataset for classification, we might need to one-hot encode our labels.

What is One Hot Encoding?

Say, we are given a labels array like,

>>> Y=numpy.randdom.randint(15,size=10).reshape(-1,1)
>>> Y
array([[12],
       [ 8],
       [ 4],
       [ 4],
       [11],
       [ 0],
       [10],
       [10],
       [ 1],
       [ 8]])

Where each row contains the class number of the correspoding row for the datapoint X. Note that, in our example there are 6 unique classes, namely, 0,1,4,8,10,11,12 .One hot encoding for the above array Y will be,

>>> one_hot(Y)
array([[0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0.]])

Where in each row has only one column set to 1 representing the class to which the datapoint belongs. For example, the first row has the last column set to 1. Which represents the column of the class ’12’. Which would mean we would also require the class to column mapping along with the one hot encoding. Something like,

array([[ 0],
       [ 1],
       [ 4],
       [ 8],
       [10],
       [11],
       [12]])

Here the row number of the class represents it’s column in the one hot encoding. It is simply a sorted order of all unique classes in Y.

The easy solution to the easy problem(For loop)

def one_hot_for(Y):
    data_size=Y.shape[0]
    classes=np.unique(Y).reshape(-1,1)
    num_classes=classes.shape[0]

    one_hot=np.zeros((data_size,num_classes))
    for row in range(data_size):
        one_hot[row,np.where(classes==Y[row])[0]]=1

    return one_hot,classes

The hard solution to the easy problem(vector(ish))

def one_hot(Y):
    data_size=Y.shape[0]
    classes=np.unique(Y).reshape(-1,1)
    num_classes=classes.shape[0]

    class_mappings=np.arange(0,max(Y)+1)
    class_mappings[np.unique(classes)]=np.arange(num_classes)
    Y=class_mappings[Y]

    one_hot=np.zeros((data_size,num_classes))
    one_hot[np.arange(data_size).reshape(-1,1),Y.reshape(-1,1)]=1
    class_col=np.sort(classes)
    return one_hot,class_col

Speed comparison

Generating random labels and storing in a file

import random

file_name="randoms.txt"
with open(file_name,"w+") as random_labels:
    for i in range(10000):
        random_labels.write(str(random.randint(0,1000))+"\n")

The above code will generate 10,000 random numbers and store them on individual lines in the file randoms.txt

Script for comparing the two functions

import matplotlib.pyplot as plt
import time
import numpy as np
from helper_functions import one_hot,one_hot_for

filename= "randoms.txt"

with open(filename,"r+") as f:
    Y=f.readlines()
    int_map=map(int,Y)
    Y=list(int_map)
    Y=np.asarray(Y).reshape(-1,1)

one_hot_timings=[]
one_hot_for_timings=[]
for i in range(100,10000,100):
    start=time.time()
    _,_=one_hot(Y[:i])
    end=time.time()
    one_hot_timings.append(end-start)

    start=time.time()
    _,_=one_hot_for(Y[:i])
    end=time.time()
    one_hot_for_timings.append(end-start)

plt.plot(one_hot_timings,label="one_hot_vector")
plt.plot(one_hot_for_timings,label="one_hot_for")
plt.xlabel('data_size for every 100 datapoints')
plt.ylabel('time of execution')
plt.legend(loc='best')
plt.show()

Result

one hot encoding algorithm execution time comparison
Comparison for time of execution for different one-hot encoding algos

Additional notes

  • Libraries such as scipy, torch, sklearn etc, could probably do this faster.
  • Depending on how often you call the above functions in your applications, it might not be relevant for you to choose between the above two methods as both need less than a few seconds at most.
  • The assimilated code for the above can be found here.

Categories
Algorithms

(Slightly) Non-Trivial Implementation of CLI Hangman. A Hands-on Python Tutorial.

Bored of learning python the regular way? Make a command line interface Hangman game and learn while you’re at it.

First try out the game for yourself! Run the following in your terminal,

pip install hangman-ultimate
hangman-ultimate

Never played the chalk board game of Hangman before? Rules are simple, you have to guess the name of a random animal letter by letter. Every wrong guess of a letter will bring the man closer to hanging. In retrospection, the game is a little dark for me to have played as a kid.

Alternatively, you can download the code from here.

Remember to like this post and start my Repository on GitHub if this post helped you!

Tear apart the code and make sure you have learnt the following at the end of it,
1. Opening a file and reading from it line by line.
2. Making multi-line strings.
3. Python Classes and Class instances(objects).
4. Class fields(variables) and methods(functions).
5. Passing class objects to functions.
6. Shared and non-shared fields(variables) amongst objects of same class.
7. Calling system(terminal) commands via the python script.
8. Catching an error using try except block.
9. Making basic command line animations.
10. Making programs compatible with both python 2 and python 3.
11. Pip packaging your python programs!

Got more ideas for CLI based games? Put them in the comments below!