Imagine that you have a collection of images. Those images can be divided into a few separate groups. Problem of sorting them out is a problem of classification, if you know, what groups are and clustering if you don't know.
Today we will learn how to make a simple machine learning classification using python libraries:
- scikit learn
- numpy
- matplotlib
What is a classifier? Classifier is a name for an algorithm, you train with classes and which can further predict classes of next items.
To solve our image classification problem we will use scikit-learn.
Scikit learn is a python library for machine learning. It has state of the art classifiers already implemented for us and simple to use.
Very simple classification problem
We have to start with data. Let's imagine, that we have a zoo.
In our zoo, there are three kinds of animals:
- mice
- elephants
- giraffes
Those animals have features such as height and weight. Having trainging set with already known animals, how to classify newly arrived animals?
Preparing data
Let's create our data:
from random import random giraffe_features = [(random() * 4 + 3, random() * 2 + 30) for x in range(4)] elephant_features = [(random() * 3 + 20, (random() - 0.5) * 4 + 23) for x in range(6)] xs = mice_features + elephant_features + giraffe_features ys = ['mouse'] * len(mice_features) + ['elephant'] * len(elephant_features) +\ ['giraffe'] * len(giraffe_features)
Visualization of features
Ok, they're just number. Let's visualize them with matplotlib:
from matplotlib import pyplot as plt fig, axis = plt.subplots(1, 1) mice_weight, mice_height = zip(*mice_features) axis.plot(mice_weight, mice_height, 'ro', label='mice') elephant_weight, elephant_height = zip(*elephant_features) axis.plot(elephant_weight, elephant_height, 'bo', label='elephants') giraffe_weight, giraffe_height = zip(*giraffe_features) axis.plot(giraffe_weight, giraffe_height, 'yo', label='giraffes') axis.legend(loc=4) axis.set_xlabel('Weight') axis.set_ylabel('Height')
First approach to classification
That looks simple to classify. Now, we'll build and train classifier with scikit-learn. Scikit learn offers a very wide rang of clasifiers with different characteristics. Here is a comparison example with pictures.
Every classifier has its own benefits and drawbacks. For our example we will use naive bayes gaussian classifier.
from sklearn.naive_bayes import GaussianNB clf = GaussianNB() clf.fit(xs, ys) new_xses = [[2, 3], [3, 31], [21, 23], [12, 16]] print clf.predict(new_xses) print clf.predict_proba(new_xses)
['mouse' 'giraffe' 'elephant' 'elephant'] [[ 0.00000000e+000 0.00000000e+000 1.00000000e+000] [ 9.65249329e-273 1.00000000e+000 2.21228571e-285] [ 1.00000000e+000 5.47092266e-083 0.00000000e+000] [ 1.00000000e+000 2.73586896e-132 0.00000000e+000]]
It looks good!
Summing up what we did:
- extracted features: weight and height for each imaginary animal
- prepared labels, which map features to particular types of animals
- visualized three groups of animals in feature space - weight on x axis and heigth on y axis using matplotlib
- chose classifier and trained with our data
- predicted new samples
We were able to predict classes for new elements. But we don't know, how well our classifier performs so we cannot guarantee anything.
We have to find a method to score our classifiers to find the best one.
Testing our model
Scikit has a guide on model selection and evaluation. It's worth reading.
What first we can do is crossvalidation and scoring and visualization of decision boundaries.
import numpy as np import pylab as pl import matplotlib from matplotlib.colors import ListedColormap from sklearn import neighbors, datasets def plot_classification_results(clf, X, y, title): # Divide dataset into training and testing parts X_train, X_test, y_train, y_test = cross_validation.train_test_split( X, y, test_size=0.2) # Fit the data with classifier. clf.fit(X_train, y_train) # Create color maps cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF']) cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF']) h = .02 # step size in the mesh # Plot the decision boundary. For that, we will assign a color to each # point in the mesh [x_min, m_max]x[y_min, y_max]. x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Put the result into a color plot Z = Z.reshape(xx.shape) pl.figure() pl.pcolormesh(xx, yy, Z, cmap=cmap_light) # Plot also the training points pl.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cmap_bold) y_predicted = clf.predict(X_test) score = clf.score(X_test, y_test) pl.scatter(X_test[:, 0], X_test[:, 1], c=y_predicted, alpha=0.5, cmap=cmap_bold) pl.xlim(xx.min(), xx.max()) pl.ylim(yy.min(), yy.max()) pl.title(title) return score
We can later use this function like this:
xs = np.array(xs) ys = [0] * len(mice_features) + [1] * len(elephant_features) + [2] * len(giraffe_features) score = plot_classification_results(clf, xs, ys, "3-Class classification") print "Classification score was: %s" % score
Classification score was: 1.0
Cool! But what actually happened there?
First we converted features to numpy array and labels to integer values instead of string names. It doesn't change much, but helps in visualization.
In plotting function we:
- divided dataset for crossvalidation
- trained classifier with
fit
method - created meshgrid and predicted Z values on meshgrid to generate decision boundaries
- plotted decision boundaries
- plotted training data
- plotted testing data in lighter color on the same plot
- scored classifier and returned score
Our dataset was extremely simple for classification. Real datasets look more messed up.
Testing our model on more complicated dataset
How our method will work on more complicated dataset? Scikit learn have a module with popular machine learning datasets.
One of them is iris dataset
import numpy as np from sklearn import cross_validation from sklearn import datasets iris = datasets.load_iris() # there are three classes of iris flowers print(np.unique(iris.target))
Lets look in depth how our cross validation works.
We use standard cross validation function train_test_split
.
We pass there features with labels and get randomized two
randomized subsets of desired size. It's very handy.
X_train, X_test, y_train, y_test = cross_validation.train_test_split( iris.data, iris.target, test_size=0.4) print(X_train.shape, y_train.shape) print(X_test.shape, y_test.shape)
There are also more complicated crossvalidation methods that use more of our trainging data, which is valuable for us.
One of the most popular is KFolds. KFolds divides dataset into K groups, chooses K - 1 for training and leaves K-th for testing. KFolds can choose K-th element in K-ways, so we can use it as generator with K tuples of training and testing elements.
So lets test how well performs our classifier on Iris dataset. We will use only two of three features for better visualization on the plane.
clf = GaussianNB() plot_classification_results(clf, X_train[:, :2], y_train, "3-Class classification")
And our score is 0.83
(1 is the best possible).
Summary
It could be better. We could use three available features or use better parameters in classifier or choose another classifier... There are many options how can we approach improving our classification.
In next post we'll learn how to create and choose good features and choose best options for model.