Categories
Algorithms

Linear vs Logistic Regression, all in Numpy

The two entry level machine learning algorithms , linear and logistic regression are quite easy to understand and provide a good way to practice coding the general machine learning pipeline, in their vectorized form. Namely,

  • Prepping the dataset, eg: removing outliers, adding features(polynomial multiples of existing features), normalization, feature scaling.
  • Implementing the learning algorithm function.
  • Calculating the loss through the chosen loss function and hypothesis.
  • Optimization algorithm – Update you model’s parameters depending on the loss and ground truth.
  • Maintaining a config file for the total training process.

An entry level toy dataset: pima indians diabetes dataset, has a target of one variable for each datapoint(a binary classification task of predicting whether a person has diabetes or not), will be used for the sake of this tutorial blog.

About the dataset

You can know all about the dataset on Kaggle. The dataset has around 768 datapoints and 8 features, which is quite a sweet spot for having a decent model without worrying about underfitting. The data is about female patients, specifically, their BMI, insulin level, age, skin thickness, glucose level etc. The target tells if the person has diabetes or not. 500 of all the datapoints are non-diabetic. 268 are diabetic. The data is not highly skewed so normal test accuracy should suffice(no need to find precision, recall and F1 score).

Helper Functions

For normalizing the dataset

This helps in normalizing the data and bringing them in a range of 0-1.

def scalify_min_max(np_dataframe):
    minimum_array=np.amin(np_dataframe,axis=0)
    maximum_array=np.amax(np_dataframe,axis=0)
    range_array = maximum_array-minimum_array

    scaled = (np_dataframe-minimum_array)/range_array
    return scaled

For calculating the accuracy

def accuracy_calculator(Y_out,Y):
    accuracy=np.sum(np.logical_not(np.logical_xor(Y_out,Y)))/Y.shape[0]
    true_positives=np.sum(np.logical_and(Y_out,Y))
    false_positives=np.sum(np.logical_and(Y_out,np.logical_not(Y)))
    false_negatives=np.sum(np.logical_and(np.logical_not(Y_out),Y))
    precision=true_positives/(true_positives+false_positives)
    recall=true_positives/(true_positives+false_negatives)
    print("Precision:",precision,".Recall:",recall)
    F1_score=precision*recall/(precision+recall)
    return [accuracy,precision,recall,F1_score]

For preparing the dataset – creating train/val/test splits

def pre_data_prep(filename,dest_fileloc):
    with open(filename,'rb') as f:
        gzip_fd=gzip.GzipFile(fileobj=f)
        next(gzip_fd)#Skip first row
        diabetes_df = loadtxt(gzip_fd,delimiter=',',dtype=np.float32)
    Y=diabetes_df[:,-1]
    scaled_diabetes_df = scalify_min_max(diabetes_df[:,:-1])
    concat_diabetes = np.concatenate((scaled_diabetes_df,np.array([Y]).T),axis=1)
    savetxt(dest_fileloc,concat_diabetes,delimiter=',')

def dataprep(fileloc,split):
    assert len(split) == 3
    assert sum(split) == 1
    diabetes_data = loadtxt(fileloc,delimiter=',',dtype=np.float32)
    Y=np.array([diabetes_data[:,-1]]).T
    classes = np.unique(Y)
    assert len(classes) == 2
    X=diabetes_data[:,:-1]
    data_size=X.shape[0]
    print(data_size,X.shape,Y.shape)

    split_size=int(split[0]*data_size)
    val_split=int(split[0]*data_size)
    X_train=X[:split_size]
    X_val=X[split_size:split_size+val_split]
    X_test=X[split_size+val_split:]
    Y_train=Y[:split_size]
    Y_val=Y[split_size:split_size+val_split]
    Y_test=Y[split_size+val_split:]
    return X_train,X_val,X_test,Y_train,Y_val,Y_test

Evaluation function

For for finding accuracy of learned model on the test dataset.

def evaluate(theta_params,X,Y=None,thresh=0.5):
    data_size=X.shape[0]
    X_extend=np.concatenate((np.ones((data_size,1)),X),axis=1)
    pred = np.greater(np.matmul(X_extend,theta_params),thresh)*1
    cost=np.sum(np.square(np.matmul(X_extend,theta_params)-Y))/(data_size*2)
    return pred,cost

Logistic Regression Function

def sigmoid_func(theta,X):
    retval = 1/(1+np.exp(-1*np.matmul(theta.T,X)))
    return retval
    
def logistic_regression(X,Y,learning_rate=0.001,num_iters=100,thresh=0.5,rand_seed=None):
    if rand_seed!=None:#For reproducible results
        np.random.seed(rand_seed)
    data_size = X.shape[0]
    theta_params=np.array([np.random.randn(X.shape[1]+1)]).T
    #Add bias column to X
    X_extend = np.concatenate((np.ones((data_size,1)),X),axis=1).T
    cost=[]#Keep track of cost after each iteration of learning
    for i in tqdm(range(num_iters),desc="Training.."):
        h_theta=sigmoid_func(theta_params,X_extend).T#mX1
        grad=np.matmul(X_extend,(h_theta-Y))/data_size#nXm*mX1=nX1
        theta_params=theta_params-learning_rate*grad
        cost.append(-1*np.sum(Y*np.log(h_theta)+(1-Y)*np.log(1-h_theta))/(data_size))
    final_pred = np.greater(np.matmul(X_extend.T,theta_params),thresh)*1
    accuracy=np.sum(np.logical_not(np.logical_xor(final_pred,Y)))/data_size
    cost=np.array(cost)
    return theta_params,accuracy,cost

Linear Regression Function

def linear_regression(X,Y,learning_rate=0.001,num_iters=100,thresh=0.5,rand_seed=None):
    if rand_seed!=None:
        np.random.seed(rand_seed)
    data_size = X.shape[0]
    #print(X.shape,Y.shape)
    theta_params=np.array([np.random.randn(X.shape[1]+1)]).T
    X_extend = np.concatenate((np.ones((data_size,1)),X),axis=1)
    cost=[]
    for i in tqdm(range(num_iters),desc="Training.."):
        theta_params=theta_params-learning_rate*np.matmul((np.matmul(theta_params.T,X_extend.T)-Y.T),X_extend).T/data_size
        cost.append(np.sum(np.square(np.matmul(X_extend,theta_params)-Y)[0])/(data_size*2))
    final_pred = np.greater(np.matmul(X_extend,theta_params),thresh)*1
    accuracy=np.sum(np.logical_not(np.logical_xor(final_pred,Y)))/data_size
    cost=np.array(cost)
    return theta_params,accuracy,cost

Runner functions for Linear and Logistic Regressions

#######################--------Linear RUNNER---------###############################
def regression_runner(fileloc,data_split_ratios,seed_values):
    X_train,X_val,X_test,Y_train,Y_val,Y_test = dataprep(fileloc,data_split_ratios)
    all_models=[]
    all_val_accuracies=[]
    random_seeds=seed_values
    num_iters=500
    x_axis=np.arange(num_iters)
    for i in range(len(random_seeds)):
        model,train_accuracy,cost=linear_regression(X_train,Y_train,rand_seed=random_seeds[i],num_iters=num_iters)
        print("Trial:",i,".Train Accuracy:",train_accuracy)
        all_models.append(model)
        plt.plot(x_axis,cost,label=str(random_seeds[i]))
        
        val_prediction,val_cost=evaluate(model,X_val,Y_val)
        accuracy_precision=accuracy_calculator(val_prediction,Y_val)
        all_val_accuracies.append(accuracy_precision[0])
        print("Validation Accuracy:",accuracy_precision)
        print("Validation Cost:",val_cost)

    #plt.legend()
    plt.title("Linear Regression")
    plt.xlabel('Number of iterations')
    plt.ylabel('Cost')
    plt.show()
    max_accuracy_idx=np.where(all_val_accuracies==np.amax(all_val_accuracies))[0][0]
    best_model=all_models[max_accuracy_idx]
    print(best_model.shape)
    #print(X_test.shape,Y_test.shape)
    test_pred,test_cost=evaluate(best_model,X_test,Y_test)
    print(test_pred.shape,print(test_cost))
    test_accuracy,test_precision,test_recall,test_f1=accuracy_calculator(test_pred,Y_test)
    print("Test accuracy:",test_accuracy,".Test cost:",test_cost)

#####################-------------LOGISTIC RUNNER--------------##########################
def logistic_runner(fileloc,data_split_ratios,seed_values):
    X_train,X_val,X_test,Y_train,Y_val,Y_test = dataprep(fileloc,data_split_ratios)
    all_models=[]
    all_val_accuracies=[]
    random_seeds=seed_values
    num_iters=1500
    x_axis=np.arange(num_iters)
    for i in range(10):
        model,train_accuracy,cost=logistic_regression(X_train,Y_train,rand_seed=random_seeds[i],num_iters=num_iters)
        print("Trial:",i,".Train Accuracy:",train_accuracy)
        all_models.append(model)
        plt.plot(x_axis,cost,label=str(random_seeds[i]))

        val_prediction,val_cost=evaluate(model,X_val,Y_val)
        accuracy_precision=accuracy_calculator(val_prediction,Y_val)
        all_val_accuracies.append(accuracy_precision[0])
        print("Validation Accuracy:",accuracy_precision)
        print("Validation Cost:",val_cost)
    #plt.legend()
    plt.title("Logistic Regression")
    plt.xlabel('Number of iterations')
    plt.ylabel('Cost')
    plt.show()
    max_accuracy_idx=np.where(all_val_accuracies==np.amax(all_val_accuracies))[0][0]
    best_model=all_models[max_accuracy_idx]

    test_pred,test_cost=evaluate(best_model,X_test,Y_test)
    #print(test_pred.shape,print(test_cost))
    test_accuracy,test_precision,test_recall,test_f1=accuracy_calculator(test_pred,Y_test)
    print("Test accuracy:",test_accuracy,".Test cost:",test_cost)

Training Curves

Note that each of the below two trainings was performed with 10 different values of initial theta. The initial value of theta effects the overall training performance. The best of the 10 was taken in consideration for the final evaluation on the test dataset.

linear regression training
logistic regression training

Test Accuracies

Linear Regression:
Test accuracy: 0.7068965517241379 .Test cost: 0.14745936729023856

Logistic Regression:
Test accuracy: 0.646551724137931 .Test cost: 0.2865915372479961

Assimilated code

Gist Link

Additional Notes

  • The above code does not use regularization.
  • It may appear that for a few curves training was stopped prematurely, but infact the test results were more near optimal for the above training parameters.
  • Although linear regression appears to be performing better for the above case it might give poorer results for other datasets.
  • Note that logistic regression took 3X as many iterations as linear regression to converge.
  • Initial model parameters are chosen randomly by varying the seed values. Initial model parameters(theta) effects the overall training performance, hence 10 such values were taken.

Advertisement
Categories
Algorithms

Fast One Hot Encoding using Numpy (No For Loops)

In a lot of artificial intelligence applications, specially in supervised learning classification problems, where the labels for each of the datapoints are available, we often have a one-dimensional array containing the classes of each of the datapoints. Depending upon the machine learning algorithm we are going to apply to our dataset for classification, we might need to one-hot encode our labels.

What is One Hot Encoding?

Say, we are given a labels array like,

>>> Y=numpy.randdom.randint(15,size=10).reshape(-1,1)
>>> Y
array([[12],
       [ 8],
       [ 4],
       [ 4],
       [11],
       [ 0],
       [10],
       [10],
       [ 1],
       [ 8]])

Where each row contains the class number of the correspoding row for the datapoint X. Note that, in our example there are 6 unique classes, namely, 0,1,4,8,10,11,12 .One hot encoding for the above array Y will be,

>>> one_hot(Y)
array([[0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0.]])

Where in each row has only one column set to 1 representing the class to which the datapoint belongs. For example, the first row has the last column set to 1. Which represents the column of the class ’12’. Which would mean we would also require the class to column mapping along with the one hot encoding. Something like,

array([[ 0],
       [ 1],
       [ 4],
       [ 8],
       [10],
       [11],
       [12]])

Here the row number of the class represents it’s column in the one hot encoding. It is simply a sorted order of all unique classes in Y.

The easy solution to the easy problem(For loop)

def one_hot_for(Y):
    data_size=Y.shape[0]
    classes=np.unique(Y).reshape(-1,1)
    num_classes=classes.shape[0]

    one_hot=np.zeros((data_size,num_classes))
    for row in range(data_size):
        one_hot[row,np.where(classes==Y[row])[0]]=1

    return one_hot,classes

The hard solution to the easy problem(vector(ish))

def one_hot(Y):
    data_size=Y.shape[0]
    classes=np.unique(Y).reshape(-1,1)
    num_classes=classes.shape[0]

    class_mappings=np.arange(0,max(Y)+1)
    class_mappings[np.unique(classes)]=np.arange(num_classes)
    Y=class_mappings[Y]

    one_hot=np.zeros((data_size,num_classes))
    one_hot[np.arange(data_size).reshape(-1,1),Y.reshape(-1,1)]=1
    class_col=np.sort(classes)
    return one_hot,class_col

Speed comparison

Generating random labels and storing in a file

import random

file_name="randoms.txt"
with open(file_name,"w+") as random_labels:
    for i in range(10000):
        random_labels.write(str(random.randint(0,1000))+"\n")

The above code will generate 10,000 random numbers and store them on individual lines in the file randoms.txt

Script for comparing the two functions

import matplotlib.pyplot as plt
import time
import numpy as np
from helper_functions import one_hot,one_hot_for

filename= "randoms.txt"

with open(filename,"r+") as f:
    Y=f.readlines()
    int_map=map(int,Y)
    Y=list(int_map)
    Y=np.asarray(Y).reshape(-1,1)

one_hot_timings=[]
one_hot_for_timings=[]
for i in range(100,10000,100):
    start=time.time()
    _,_=one_hot(Y[:i])
    end=time.time()
    one_hot_timings.append(end-start)

    start=time.time()
    _,_=one_hot_for(Y[:i])
    end=time.time()
    one_hot_for_timings.append(end-start)

plt.plot(one_hot_timings,label="one_hot_vector")
plt.plot(one_hot_for_timings,label="one_hot_for")
plt.xlabel('data_size for every 100 datapoints')
plt.ylabel('time of execution')
plt.legend(loc='best')
plt.show()

Result

one hot encoding algorithm execution time comparison
Comparison for time of execution for different one-hot encoding algos

Additional notes

  • Libraries such as scipy, torch, sklearn etc, could probably do this faster.
  • Depending on how often you call the above functions in your applications, it might not be relevant for you to choose between the above two methods as both need less than a few seconds at most.
  • The assimilated code for the above can be found here.

Categories
Algorithms

Image Labelling tool for annotating object detection models

Create or modify your bounding boxes and annotations using this image labelling tool made using Python and tkinter. If you have a custom json you can modify the existing code according to your own needs.

Link to github

$ python gui.py

Categories
Algorithms

(Slightly) Non-Trivial Implementation of CLI Hangman. A Hands-on Python Tutorial.

Bored of learning python the regular way? Make a command line interface Hangman game and learn while you’re at it.

First try out the game for yourself! Run the following in your terminal,

pip install hangman-ultimate
hangman-ultimate

Never played the chalk board game of Hangman before? Rules are simple, you have to guess the name of a random animal letter by letter. Every wrong guess of a letter will bring the man closer to hanging. In retrospection, the game is a little dark for me to have played as a kid.

Alternatively, you can download the code from here.

Remember to like this post and start my Repository on GitHub if this post helped you!

Tear apart the code and make sure you have learnt the following at the end of it,
1. Opening a file and reading from it line by line.
2. Making multi-line strings.
3. Python Classes and Class instances(objects).
4. Class fields(variables) and methods(functions).
5. Passing class objects to functions.
6. Shared and non-shared fields(variables) amongst objects of same class.
7. Calling system(terminal) commands via the python script.
8. Catching an error using try except block.
9. Making basic command line animations.
10. Making programs compatible with both python 2 and python 3.
11. Pip packaging your python programs!

Got more ideas for CLI based games? Put them in the comments below!

Categories
Algorithms

Custom ASCII art and random one-liner welcome messages on Linux Terminal start-up

Would it not be cool if your terminal welcomes you with an ASCII art and a random one-liner each time you start it up!

Like this:-

Custom welcome to terminal

Remember to like my post and star my GitHub repository if this post helps you!

Linux users (Sorry Mac owners (for the time being, use the custom method after this)) can simply install my poketerm pip package!

Remember to install with sudo permissions!

$ sudo pip install poketerm
$ poketerm -t 1
$ poketerm -h

You can customise the above according to your need or you can save yourself some trouble and use my code on github which does everything that I am going to describe below.

Firstly download and put this file in your /home/$USER/ folder.

All you need to do now is manipulate the .bashrc shell script that Bash(Terminal) runs whenever it is started interactively. You can put any command in that file that you could type at the command prompt and they will run whenever you start Bash.

Note: The following makes changes to default terminal Bash. If you are using some other shell you will have to make changes to their respective analogous file.

NOTE FOR MAC USERS: The .bashrc equivalent of Mac is .bash_profile, Mac users may use that file for the following changes.

After opening this file with your favourite editor with a command like

vim ~/.bashrc

You will find the line containing the following in the start:

#!/bin/bash

Just after which you will have to add this small piece of code that I wrote:

echo "Welcome $USER "
echo "|\_                  _"
echo " \ \               _/_|"
echo "  \ \_          __/ /"
echo "   \  \________/   /"
echo "    |              |"
echo "    /              |"
echo "   |   0       0   |"
echo "   |       _       |"
echo "   |()    __    () |"
echo "    \    (__)      |"
file="/home/$USER/fortunes.txt"
if [ -f "$file" ]
then
    shuf -n 1 $file
fi

The file should look similar to this now:

.bashrc code after addition of lines

Save it and you are good to go!

Most of the above code is self explanatory. The only thing that you might not get is the statement inside the if conditional. If the file exists, then a random line is returned out of the 10,000 lines in the fortunes.txt file.

What more can you do?

You can create your own custom ASCII art or you could find some here. Just format it and replace it with my noobish art. Do keep in mind to not use art which is very large in line length/height else it might not appear as intended. Also keep in mind the characters that are used in the art. Test the piece of code as a separate shell script before you add it to the file. You can also save the art in a file/folder and invoke one randomly just like the one-liner.

Categories
Algorithms

When is the Best Sorting Algorithm the best?

There are many a problems in the real world and to each problem, a solution. In computer science, this solution is referred to as an algorithm. However, to a single problem, there often are many a algorithms to achieve the required result. For finding the best solution, we do time complexity analysis(here on referred to as TCA).

For those of you who know the basics of TCA, would know that we select the algorithm which performs best for a very large input size. For example, given below is the graph depicting the run-time of two algorithms A and B  for increasing input sizes.

example

According to the conventional TCA, the algorithm A is better than the algorithm B, as for large input sizes the average time of execution is less for the algorithm A.

The algorithm A, as we can see, has a linear time complexity, O(n), where as its quadratic for algorithm B, O(n²). In general, in TCA we consider an O(n) algorithm better than a O(n²) algorithm, as for a very large input size (an input size tending to infinity actually), a O(n) algorithm will work better than a O(n²) algorithm.

Now consider the following example,example3

Here again according to TCA, the algorithm C is better than the algorithm D. However, notice that for input sizes less than 10,000 , the algorithm D performed better than the algorithm C.

Note that, an input size of 10,000 is quite big, and might be the actual input size of the problem. With the conventional TCA we would have used the algorithm C for our practical application, when it clearly should have been better to use the algorithm D.

A very common problem in computers, is that of sorting. That is given an input array of numbers, give an output array with increasing/decreasing order of the elements of the input array. For example,

Input: [7,3,8,4,2,5,1,6,9,0]

Output: [0,1,2,3,4,5,6,7,8,9]

There are as we know, quite a few solutions to this problem, you can check out these algorithms here.

Five most common sorting algorithms are

  1. Bubble Sort
  2. Insertion Sort
  3. Selection Sort
  4. Quick Sort
  5. Merge Sort

The time complexities of the above algorithms are as follows,

table_low

We know that quick sort is the best sorting algorithm among all the above five algorithms. For big input sizes, you may checkout this comparison.

However, the question that we are trying to address here, is that after what input size is quick sort the best, and which is the best sorting algorithm for input sizes less than that input size?

I wrote a python code which plots the variations of average time of execution (averaged for 1000 inputs of the same size) for various input sizes. I took the actual time of execution on my machine rather than the counts as this is exercise is for finding the most practical solution.

big_size3

The reason for a lot of noise on in the graph are, the operating system overheads. It is to be noted that, I closed as many applications and background processes as I could. 

In spite of the noise, we still are able to make out that for a large input size what is the order or preference for sorting algorithms.

Quick > Merge > Insertion > Selection > Bubble

I ran multiple runs of the above code for small input sizes as well,

Here we notice that for input sizes <40 insertion sort performs the best out of all the above algorithms. So if you are working on some application which requires sorting of less than input sizes 40, you might want to consider using insertion sort rather than quick sort!

The code that I used for the above analysis can be found in the link,

https://github.com/devarshi16/sorting_comparison

Remember to like this post and start my GitHub repository if this post helped you!

Note that, you need to close as many applications as you can before the execution of the code. Once the code is running do not disturb the computer as it will raise additional overheads. If possible try to run the code while the system is offline.

There are still a lot of things that you can do with this code. Two such things are,

  1. Add more sorting algorithms for comparison from the plethora of sorting algorithms available here
  2. If you look carefully, you will see that merge sort performs horribly for small input sizes (probably attributed to the fact that it uses extra space for sorting). However, for big input sizes, it is the second best sorting algorithm. When does that happen? Try to find out!

Feel free to comment your opinions and pointing out errors. Thanks!