Classification with scikit-learn

Imagine that you have a collection of images. Those images can be divided into a few separate groups. Problem of sorting them out is a problem of classification, if you know, what groups are and clustering if you don't know.

Today we will learn how to make a simple machine learning classification using python libraries:

scikit learn
numpy
matplotlib

What is a classifier? Classifier is a name for an algorithm, you train with classes and which can further predict classes of next items.

To solve our image classification problem we will use scikit-learn.

Scikit learn is a python library for machine learning. It has state of the art classifiers already implemented for us and simple to use.

Very simple classification problem

We have to start with data. Let's imagine, that we have a zoo.

In our zoo, there are three kinds of animals:

mice
elephants
giraffes

Those animals have features such as height and weight. Having trainging set with already known animals, how to classify newly arrived animals?

Preparing data

Let's create our data:

from random import random


giraffe_features = [(random() * 4 + 3, random() * 2 + 30) for x in range(4)]
elephant_features = [(random() * 3 + 20, (random() - 0.5) * 4 + 23)
                     for x in range(6)]

xs = mice_features + elephant_features + giraffe_features
ys = ['mouse'] * len(mice_features) + ['elephant'] * len(elephant_features) +\
     ['giraffe'] * len(giraffe_features)

Visualization of features

Ok, they're just number. Let's visualize them with matplotlib:

from matplotlib import pyplot as plt

fig, axis = plt.subplots(1, 1)

mice_weight, mice_height = zip(*mice_features)
axis.plot(mice_weight, mice_height, 'ro', label='mice')

elephant_weight, elephant_height = zip(*elephant_features)
axis.plot(elephant_weight, elephant_height, 'bo', label='elephants')

giraffe_weight, giraffe_height = zip(*giraffe_features)
axis.plot(giraffe_weight, giraffe_height, 'yo', label='giraffes')

axis.legend(loc=4)
axis.set_xlabel('Weight')
axis.set_ylabel('Height')

plot

First approach to classification

That looks simple to classify. Now, we'll build and train classifier with scikit-learn. Scikit learn offers a very wide rang of clasifiers with different characteristics. Here is a comparison example with pictures.

Every classifier has its own benefits and drawbacks. For our example we will use naive bayes gaussian classifier.

from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()

clf.fit(xs, ys)

new_xses = [[2, 3], [3, 31], [21, 23], [12, 16]]

print clf.predict(new_xses)

print clf.predict_proba(new_xses)

['mouse' 'giraffe' 'elephant' 'elephant']
[[  0.00000000e+000   0.00000000e+000   1.00000000e+000]
 [  9.65249329e-273   1.00000000e+000   2.21228571e-285]
 [  1.00000000e+000   5.47092266e-083   0.00000000e+000]
 [  1.00000000e+000   2.73586896e-132   0.00000000e+000]]

It looks good!

Summing up what we did:

extracted features: weight and height for each imaginary animal
prepared labels, which map features to particular types of animals
visualized three groups of animals in feature space - weight on x axis and heigth on y axis using matplotlib
chose classifier and trained with our data
predicted new samples

We were able to predict classes for new elements. But we don't know, how well our classifier performs so we cannot guarantee anything.

We have to find a method to score our classifiers to find the best one.

Testing our model

Scikit has a guide on model selection and evaluation. It's worth reading.

What first we can do is crossvalidation and scoring and visualization of decision boundaries.

import numpy as np
import pylab as pl
import matplotlib
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets


def plot_classification_results(clf, X, y, title):
    # Divide dataset into training and testing parts
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    X, y, test_size=0.2)

    # Fit the data with classifier.
    clf.fit(X_train, y_train)

    # Create color maps
    cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
    cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

    h = .02  # step size in the mesh
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    pl.figure()
    pl.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    pl.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cmap_bold)

    y_predicted = clf.predict(X_test)
    score = clf.score(X_test, y_test)
    pl.scatter(X_test[:, 0], X_test[:, 1], c=y_predicted, alpha=0.5, cmap=cmap_bold)
    pl.xlim(xx.min(), xx.max())
    pl.ylim(yy.min(), yy.max())
    pl.title(title)
    return score

We can later use this function like this:

xs = np.array(xs)
ys = [0] * len(mice_features) + [1] * len(elephant_features) + [2] * len(giraffe_features)

score = plot_classification_results(clf, xs, ys, "3-Class classification")
print "Classification score was: %s" % score

Classification score was: 1.0

decision boundaries

Cool! But what actually happened there?

First we converted features to numpy array and labels to integer values instead of string names. It doesn't change much, but helps in visualization.

In plotting function we:

divided dataset for crossvalidation
trained classifier with fit method
created meshgrid and predicted Z values on meshgrid to generate decision boundaries
plotted decision boundaries
plotted training data
plotted testing data in lighter color on the same plot
scored classifier and returned score

Our dataset was extremely simple for classification. Real datasets look more messed up.

Testing our model on more complicated dataset

How our method will work on more complicated dataset? Scikit learn have a module with popular machine learning datasets.

One of them is iris dataset

import numpy as np
from sklearn import cross_validation
from sklearn import datasets

iris = datasets.load_iris()

# there are three classes of iris flowers
print(np.unique(iris.target))

Lets look in depth how our cross validation works.

We use standard cross validation function train_test_split. We pass there features with labels and get randomized two randomized subsets of desired size. It's very handy.

X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    iris.data, iris.target, test_size=0.4)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

There are also more complicated crossvalidation methods that use more of our trainging data, which is valuable for us.

One of the most popular is KFolds. KFolds divides dataset into K groups, chooses K - 1 for training and leaves K-th for testing. KFolds can choose K-th element in K-ways, so we can use it as generator with K tuples of training and testing elements.

So lets test how well performs our classifier on Iris dataset. We will use only two of three features for better visualization on the plane.

clf = GaussianNB()

plot_classification_results(clf, X_train[:, :2], y_train, "3-Class classification")

iris boundaries

And our score is 0.83 (1 is the best possible).

Summary

It could be better. We could use three available features or use better parameters in classifier or choose another classifier... There are many options how can we approach improving our classification.

In next post we'll learn how to create and choose good features and choose best options for model.