Machine Learning Primer
How to classify pictures by what they represent? How cluster similiar clients? How to predict new traffic rates on your server?
Machine learning is a great tool to solve this kinds of problems. It turns out to be actually really easy to use in python. You don't have to be a machine learning expert. To use python tools you need to know python, here is a tutorial.
Numpy and scipy are a backbone of scientific and numerical computing in python. It's good to know at least some basics of them. Here is a tutorial to get you started.
To visualize data, features and results of learning I use matplotlib. It's a cool, powerful and useful tool.
Kinds of problems you can solve with machine learning
Machine learning offers us methods for solving different kinds of problems. We can divide them in classification, regression and clustering.
There is also supervised and unsupervised learning.
Here is a quick overview:
What is a classification problem?
Main goal of classification is identifying how to categorize new element.
Algorithms:
- SVM
- nearest neighbors
- random forest
If you want to learn more - lecture on classification from Princeton gives some more examples.
Regression - how to predict continuous variables?
Algorithms:
- SVR
- ridge regression -Lasso
Clustering - grouping similiar things together!
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). -- Wikipedia
Applications:
- customer segmentation
- Grouping experiment outcomes
- learning more about data
Algorithms:
- k-Means
- spectral clustering
- affinity propagation
What are good tools and how can I start using them?
If you don't know where to start, to solve your machine learning problems, start with some too. One example of great Machine Learning tool is a scikit-learn library.
It's documentation is just amazing. You can learn not only about the library and ways to use it, but also how these methods work (logic behind them) - look at their clustering guide.
There are tutorials, examples... you can click at any figure, to learn how it was generated - as an example: classifier comparison.
I'm twelve and what is this? - How to get some insight?
Although scikit-learn offers great tools to solve problems, it doesn't tell what is best for particular case and how algorithms work in depth.
To gain some insight about using machine learning in python, I recommend: Building Machine Learning Systems in Python book, with it's source code. Reading this book was both educational and enternaining. It was also pretty easy to follow.
It's isn't heavy mathematics, but rather guided hands on tutorial with solving toy problems on real datasets - but you will get accustomed to machine learning approach and learn some basic concepts.
It's not easy to choose the best algorithm for your problem. When you choose a particular algorithm, it's great to understand it well. To learn how particular ML algorithms I would recommend some youtube tutorials on machine learning such as those. It's a great starting point to learn Machine Learning.
If it's still not enough for you, I found out that Coursera restarts it's course on machine learning from Stanford this month, it's here.
Tools for more specific problems
If you have a problem which is connected to image processing, you may consider using scikit-image or mahotas, which can make computer vision less painful.
If you work with text, look at nltk. It's a best tool for natural language processing I know. If you want to do some semantic analysis - checkout gensim.
Holistic approach
It's easy to forget that machine learning isn't only a pack of fancy algorithms.
If you have a clustering or classification problem, you have to get features right.
And it's a very tricky part, often more complicated than choosing a right machine learning algorithm. Because most of the algorithms work pretty well for you problem with different tradeoffs (speed, accuracy, etc).
However without good features, classification start to be no better than choosing at random. But there are luckily some helpful algorithms for features selection too.
Practice and challenges
There are lots of datasets in the internet. Here are dumps of wikipedia for example. For testing your algorithms mlcomp can be helpful.
And if you're looking for challenges - take a look at kaggle, a good place to start. It's a platform hosting competitions on predictive modelling.
Happy hacking with Machine Learning on board!