Categories
Algorithms

Linear vs Logistic Regression, all in Numpy

The two entry level machine learning algorithms , linear and logistic regression are quite easy to understand and provide a good way to practice coding the general machine learning pipeline, in their vectorized form. Namely,

  • Prepping the dataset, eg: removing outliers, adding features(polynomial multiples of existing features), normalization, feature scaling.
  • Implementing the learning algorithm function.
  • Calculating the loss through the chosen loss function and hypothesis.
  • Optimization algorithm – Update you model’s parameters depending on the loss and ground truth.
  • Maintaining a config file for the total training process.

An entry level toy dataset: pima indians diabetes dataset, has a target of one variable for each datapoint(a binary classification task of predicting whether a person has diabetes or not), will be used for the sake of this tutorial blog.

About the dataset

You can know all about the dataset on Kaggle. The dataset has around 768 datapoints and 8 features, which is quite a sweet spot for having a decent model without worrying about underfitting. The data is about female patients, specifically, their BMI, insulin level, age, skin thickness, glucose level etc. The target tells if the person has diabetes or not. 500 of all the datapoints are non-diabetic. 268 are diabetic. The data is not highly skewed so normal test accuracy should suffice(no need to find precision, recall and F1 score).

Helper Functions

For normalizing the dataset

This helps in normalizing the data and bringing them in a range of 0-1.

def scalify_min_max(np_dataframe):
    minimum_array=np.amin(np_dataframe,axis=0)
    maximum_array=np.amax(np_dataframe,axis=0)
    range_array = maximum_array-minimum_array

    scaled = (np_dataframe-minimum_array)/range_array
    return scaled

For calculating the accuracy

def accuracy_calculator(Y_out,Y):
    accuracy=np.sum(np.logical_not(np.logical_xor(Y_out,Y)))/Y.shape[0]
    true_positives=np.sum(np.logical_and(Y_out,Y))
    false_positives=np.sum(np.logical_and(Y_out,np.logical_not(Y)))
    false_negatives=np.sum(np.logical_and(np.logical_not(Y_out),Y))
    precision=true_positives/(true_positives+false_positives)
    recall=true_positives/(true_positives+false_negatives)
    print("Precision:",precision,".Recall:",recall)
    F1_score=precision*recall/(precision+recall)
    return [accuracy,precision,recall,F1_score]

For preparing the dataset – creating train/val/test splits

def pre_data_prep(filename,dest_fileloc):
    with open(filename,'rb') as f:
        gzip_fd=gzip.GzipFile(fileobj=f)
        next(gzip_fd)#Skip first row
        diabetes_df = loadtxt(gzip_fd,delimiter=',',dtype=np.float32)
    Y=diabetes_df[:,-1]
    scaled_diabetes_df = scalify_min_max(diabetes_df[:,:-1])
    concat_diabetes = np.concatenate((scaled_diabetes_df,np.array([Y]).T),axis=1)
    savetxt(dest_fileloc,concat_diabetes,delimiter=',')

def dataprep(fileloc,split):
    assert len(split) == 3
    assert sum(split) == 1
    diabetes_data = loadtxt(fileloc,delimiter=',',dtype=np.float32)
    Y=np.array([diabetes_data[:,-1]]).T
    classes = np.unique(Y)
    assert len(classes) == 2
    X=diabetes_data[:,:-1]
    data_size=X.shape[0]
    print(data_size,X.shape,Y.shape)

    split_size=int(split[0]*data_size)
    val_split=int(split[0]*data_size)
    X_train=X[:split_size]
    X_val=X[split_size:split_size+val_split]
    X_test=X[split_size+val_split:]
    Y_train=Y[:split_size]
    Y_val=Y[split_size:split_size+val_split]
    Y_test=Y[split_size+val_split:]
    return X_train,X_val,X_test,Y_train,Y_val,Y_test

Evaluation function

For for finding accuracy of learned model on the test dataset.

def evaluate(theta_params,X,Y=None,thresh=0.5):
    data_size=X.shape[0]
    X_extend=np.concatenate((np.ones((data_size,1)),X),axis=1)
    pred = np.greater(np.matmul(X_extend,theta_params),thresh)*1
    cost=np.sum(np.square(np.matmul(X_extend,theta_params)-Y))/(data_size*2)
    return pred,cost

Logistic Regression Function

def sigmoid_func(theta,X):
    retval = 1/(1+np.exp(-1*np.matmul(theta.T,X)))
    return retval
    
def logistic_regression(X,Y,learning_rate=0.001,num_iters=100,thresh=0.5,rand_seed=None):
    if rand_seed!=None:#For reproducible results
        np.random.seed(rand_seed)
    data_size = X.shape[0]
    theta_params=np.array([np.random.randn(X.shape[1]+1)]).T
    #Add bias column to X
    X_extend = np.concatenate((np.ones((data_size,1)),X),axis=1).T
    cost=[]#Keep track of cost after each iteration of learning
    for i in tqdm(range(num_iters),desc="Training.."):
        h_theta=sigmoid_func(theta_params,X_extend).T#mX1
        grad=np.matmul(X_extend,(h_theta-Y))/data_size#nXm*mX1=nX1
        theta_params=theta_params-learning_rate*grad
        cost.append(-1*np.sum(Y*np.log(h_theta)+(1-Y)*np.log(1-h_theta))/(data_size))
    final_pred = np.greater(np.matmul(X_extend.T,theta_params),thresh)*1
    accuracy=np.sum(np.logical_not(np.logical_xor(final_pred,Y)))/data_size
    cost=np.array(cost)
    return theta_params,accuracy,cost

Linear Regression Function

def linear_regression(X,Y,learning_rate=0.001,num_iters=100,thresh=0.5,rand_seed=None):
    if rand_seed!=None:
        np.random.seed(rand_seed)
    data_size = X.shape[0]
    #print(X.shape,Y.shape)
    theta_params=np.array([np.random.randn(X.shape[1]+1)]).T
    X_extend = np.concatenate((np.ones((data_size,1)),X),axis=1)
    cost=[]
    for i in tqdm(range(num_iters),desc="Training.."):
        theta_params=theta_params-learning_rate*np.matmul((np.matmul(theta_params.T,X_extend.T)-Y.T),X_extend).T/data_size
        cost.append(np.sum(np.square(np.matmul(X_extend,theta_params)-Y)[0])/(data_size*2))
    final_pred = np.greater(np.matmul(X_extend,theta_params),thresh)*1
    accuracy=np.sum(np.logical_not(np.logical_xor(final_pred,Y)))/data_size
    cost=np.array(cost)
    return theta_params,accuracy,cost

Runner functions for Linear and Logistic Regressions

#######################--------Linear RUNNER---------###############################
def regression_runner(fileloc,data_split_ratios,seed_values):
    X_train,X_val,X_test,Y_train,Y_val,Y_test = dataprep(fileloc,data_split_ratios)
    all_models=[]
    all_val_accuracies=[]
    random_seeds=seed_values
    num_iters=500
    x_axis=np.arange(num_iters)
    for i in range(len(random_seeds)):
        model,train_accuracy,cost=linear_regression(X_train,Y_train,rand_seed=random_seeds[i],num_iters=num_iters)
        print("Trial:",i,".Train Accuracy:",train_accuracy)
        all_models.append(model)
        plt.plot(x_axis,cost,label=str(random_seeds[i]))
        
        val_prediction,val_cost=evaluate(model,X_val,Y_val)
        accuracy_precision=accuracy_calculator(val_prediction,Y_val)
        all_val_accuracies.append(accuracy_precision[0])
        print("Validation Accuracy:",accuracy_precision)
        print("Validation Cost:",val_cost)

    #plt.legend()
    plt.title("Linear Regression")
    plt.xlabel('Number of iterations')
    plt.ylabel('Cost')
    plt.show()
    max_accuracy_idx=np.where(all_val_accuracies==np.amax(all_val_accuracies))[0][0]
    best_model=all_models[max_accuracy_idx]
    print(best_model.shape)
    #print(X_test.shape,Y_test.shape)
    test_pred,test_cost=evaluate(best_model,X_test,Y_test)
    print(test_pred.shape,print(test_cost))
    test_accuracy,test_precision,test_recall,test_f1=accuracy_calculator(test_pred,Y_test)
    print("Test accuracy:",test_accuracy,".Test cost:",test_cost)

#####################-------------LOGISTIC RUNNER--------------##########################
def logistic_runner(fileloc,data_split_ratios,seed_values):
    X_train,X_val,X_test,Y_train,Y_val,Y_test = dataprep(fileloc,data_split_ratios)
    all_models=[]
    all_val_accuracies=[]
    random_seeds=seed_values
    num_iters=1500
    x_axis=np.arange(num_iters)
    for i in range(10):
        model,train_accuracy,cost=logistic_regression(X_train,Y_train,rand_seed=random_seeds[i],num_iters=num_iters)
        print("Trial:",i,".Train Accuracy:",train_accuracy)
        all_models.append(model)
        plt.plot(x_axis,cost,label=str(random_seeds[i]))

        val_prediction,val_cost=evaluate(model,X_val,Y_val)
        accuracy_precision=accuracy_calculator(val_prediction,Y_val)
        all_val_accuracies.append(accuracy_precision[0])
        print("Validation Accuracy:",accuracy_precision)
        print("Validation Cost:",val_cost)
    #plt.legend()
    plt.title("Logistic Regression")
    plt.xlabel('Number of iterations')
    plt.ylabel('Cost')
    plt.show()
    max_accuracy_idx=np.where(all_val_accuracies==np.amax(all_val_accuracies))[0][0]
    best_model=all_models[max_accuracy_idx]

    test_pred,test_cost=evaluate(best_model,X_test,Y_test)
    #print(test_pred.shape,print(test_cost))
    test_accuracy,test_precision,test_recall,test_f1=accuracy_calculator(test_pred,Y_test)
    print("Test accuracy:",test_accuracy,".Test cost:",test_cost)

Training Curves

Note that each of the below two trainings was performed with 10 different values of initial theta. The initial value of theta effects the overall training performance. The best of the 10 was taken in consideration for the final evaluation on the test dataset.

linear regression training
logistic regression training

Test Accuracies

Linear Regression:
Test accuracy: 0.7068965517241379 .Test cost: 0.14745936729023856

Logistic Regression:
Test accuracy: 0.646551724137931 .Test cost: 0.2865915372479961

Assimilated code

Gist Link

Additional Notes

  • The above code does not use regularization.
  • It may appear that for a few curves training was stopped prematurely, but infact the test results were more near optimal for the above training parameters.
  • Although linear regression appears to be performing better for the above case it might give poorer results for other datasets.
  • Note that logistic regression took 3X as many iterations as linear regression to converge.
  • Initial model parameters are chosen randomly by varying the seed values. Initial model parameters(theta) effects the overall training performance, hence 10 such values were taken.

Advertisement
Categories
Algorithms

Fast One Hot Encoding using Numpy (No For Loops)

In a lot of artificial intelligence applications, specially in supervised learning classification problems, where the labels for each of the datapoints are available, we often have a one-dimensional array containing the classes of each of the datapoints. Depending upon the machine learning algorithm we are going to apply to our dataset for classification, we might need to one-hot encode our labels.

What is One Hot Encoding?

Say, we are given a labels array like,

>>> Y=numpy.randdom.randint(15,size=10).reshape(-1,1)
>>> Y
array([[12],
       [ 8],
       [ 4],
       [ 4],
       [11],
       [ 0],
       [10],
       [10],
       [ 1],
       [ 8]])

Where each row contains the class number of the correspoding row for the datapoint X. Note that, in our example there are 6 unique classes, namely, 0,1,4,8,10,11,12 .One hot encoding for the above array Y will be,

>>> one_hot(Y)
array([[0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0.]])

Where in each row has only one column set to 1 representing the class to which the datapoint belongs. For example, the first row has the last column set to 1. Which represents the column of the class ’12’. Which would mean we would also require the class to column mapping along with the one hot encoding. Something like,

array([[ 0],
       [ 1],
       [ 4],
       [ 8],
       [10],
       [11],
       [12]])

Here the row number of the class represents it’s column in the one hot encoding. It is simply a sorted order of all unique classes in Y.

The easy solution to the easy problem(For loop)

def one_hot_for(Y):
    data_size=Y.shape[0]
    classes=np.unique(Y).reshape(-1,1)
    num_classes=classes.shape[0]

    one_hot=np.zeros((data_size,num_classes))
    for row in range(data_size):
        one_hot[row,np.where(classes==Y[row])[0]]=1

    return one_hot,classes

The hard solution to the easy problem(vector(ish))

def one_hot(Y):
    data_size=Y.shape[0]
    classes=np.unique(Y).reshape(-1,1)
    num_classes=classes.shape[0]

    class_mappings=np.arange(0,max(Y)+1)
    class_mappings[np.unique(classes)]=np.arange(num_classes)
    Y=class_mappings[Y]

    one_hot=np.zeros((data_size,num_classes))
    one_hot[np.arange(data_size).reshape(-1,1),Y.reshape(-1,1)]=1
    class_col=np.sort(classes)
    return one_hot,class_col

Speed comparison

Generating random labels and storing in a file

import random

file_name="randoms.txt"
with open(file_name,"w+") as random_labels:
    for i in range(10000):
        random_labels.write(str(random.randint(0,1000))+"\n")

The above code will generate 10,000 random numbers and store them on individual lines in the file randoms.txt

Script for comparing the two functions

import matplotlib.pyplot as plt
import time
import numpy as np
from helper_functions import one_hot,one_hot_for

filename= "randoms.txt"

with open(filename,"r+") as f:
    Y=f.readlines()
    int_map=map(int,Y)
    Y=list(int_map)
    Y=np.asarray(Y).reshape(-1,1)

one_hot_timings=[]
one_hot_for_timings=[]
for i in range(100,10000,100):
    start=time.time()
    _,_=one_hot(Y[:i])
    end=time.time()
    one_hot_timings.append(end-start)

    start=time.time()
    _,_=one_hot_for(Y[:i])
    end=time.time()
    one_hot_for_timings.append(end-start)

plt.plot(one_hot_timings,label="one_hot_vector")
plt.plot(one_hot_for_timings,label="one_hot_for")
plt.xlabel('data_size for every 100 datapoints')
plt.ylabel('time of execution')
plt.legend(loc='best')
plt.show()

Result

one hot encoding algorithm execution time comparison
Comparison for time of execution for different one-hot encoding algos

Additional notes

  • Libraries such as scipy, torch, sklearn etc, could probably do this faster.
  • Depending on how often you call the above functions in your applications, it might not be relevant for you to choose between the above two methods as both need less than a few seconds at most.
  • The assimilated code for the above can be found here.

Categories
Algorithms

When is the Best Sorting Algorithm the best?

There are many a problems in the real world and to each problem, a solution. In computer science, this solution is referred to as an algorithm. However, to a single problem, there often are many a algorithms to achieve the required result. For finding the best solution, we do time complexity analysis(here on referred to as TCA).

For those of you who know the basics of TCA, would know that we select the algorithm which performs best for a very large input size. For example, given below is the graph depicting the run-time of two algorithms A and B  for increasing input sizes.

example

According to the conventional TCA, the algorithm A is better than the algorithm B, as for large input sizes the average time of execution is less for the algorithm A.

The algorithm A, as we can see, has a linear time complexity, O(n), where as its quadratic for algorithm B, O(n²). In general, in TCA we consider an O(n) algorithm better than a O(n²) algorithm, as for a very large input size (an input size tending to infinity actually), a O(n) algorithm will work better than a O(n²) algorithm.

Now consider the following example,example3

Here again according to TCA, the algorithm C is better than the algorithm D. However, notice that for input sizes less than 10,000 , the algorithm D performed better than the algorithm C.

Note that, an input size of 10,000 is quite big, and might be the actual input size of the problem. With the conventional TCA we would have used the algorithm C for our practical application, when it clearly should have been better to use the algorithm D.

A very common problem in computers, is that of sorting. That is given an input array of numbers, give an output array with increasing/decreasing order of the elements of the input array. For example,

Input: [7,3,8,4,2,5,1,6,9,0]

Output: [0,1,2,3,4,5,6,7,8,9]

There are as we know, quite a few solutions to this problem, you can check out these algorithms here.

Five most common sorting algorithms are

  1. Bubble Sort
  2. Insertion Sort
  3. Selection Sort
  4. Quick Sort
  5. Merge Sort

The time complexities of the above algorithms are as follows,

table_low

We know that quick sort is the best sorting algorithm among all the above five algorithms. For big input sizes, you may checkout this comparison.

However, the question that we are trying to address here, is that after what input size is quick sort the best, and which is the best sorting algorithm for input sizes less than that input size?

I wrote a python code which plots the variations of average time of execution (averaged for 1000 inputs of the same size) for various input sizes. I took the actual time of execution on my machine rather than the counts as this is exercise is for finding the most practical solution.

big_size3

The reason for a lot of noise on in the graph are, the operating system overheads. It is to be noted that, I closed as many applications and background processes as I could. 

In spite of the noise, we still are able to make out that for a large input size what is the order or preference for sorting algorithms.

Quick > Merge > Insertion > Selection > Bubble

I ran multiple runs of the above code for small input sizes as well,

Here we notice that for input sizes <40 insertion sort performs the best out of all the above algorithms. So if you are working on some application which requires sorting of less than input sizes 40, you might want to consider using insertion sort rather than quick sort!

The code that I used for the above analysis can be found in the link,

https://github.com/devarshi16/sorting_comparison

Remember to like this post and start my GitHub repository if this post helped you!

Note that, you need to close as many applications as you can before the execution of the code. Once the code is running do not disturb the computer as it will raise additional overheads. If possible try to run the code while the system is offline.

There are still a lot of things that you can do with this code. Two such things are,

  1. Add more sorting algorithms for comparison from the plethora of sorting algorithms available here
  2. If you look carefully, you will see that merge sort performs horribly for small input sizes (probably attributed to the fact that it uses extra space for sorting). However, for big input sizes, it is the second best sorting algorithm. When does that happen? Try to find out!

Feel free to comment your opinions and pointing out errors. Thanks!