Although it is tempting to use popular algorithms like extreme gradient boosting/XGBoost, the algorithm of choice for winning Kaggle competitions, it might not be the best choice for the problem at hand, always. Having a logical, time tested approach to selecting algorithms and the right configuration of the algorithm for a dataset by tuning the hyper parameters is what moves Machine learning from the sphere of science to art.
In machine learning, there is no “one size fits all” solution. Choosing the algorithm is a comprehensive task that demands the analysis of a variety of factors. That is why it is so important to know how to match a machine learning algorithm to a particular problem. Selecting the right algorithm for modelling your data accurately depends on a variety of factors like computational complexity, interpretability, and ease of implementation.
A simple solution is to try everything and see what works best. But there is a lot more value in being able to understand the trade-offs you’re making when choosing one algorithm over another. So, how do you determine the best algorithm among the tens, if not hundreds of available algorithms?
To understand the factors that control the choice of algorithm, you divide your decision criteria into two sections, data-related aspects, and problem-related aspects. The size, behaviour, characteristics, and type of your data can give you the initial idea of what algorithm to use. Once you get this right, problem related aspects will help you decide on a final decision. I am trying to outline a few basic techniques to follow, before you jump in and write that code.
Step 1: Understand your data:
You can make your choice of algo narrow, by observing your input data. Knowing your data is the first and foremost step of deciding on an algorithm. Before you start thinking about different algorithms, you need to familiarize yourself with your data. A simple way to do that is to visualize the data and try to find patterns within it, try to observe it’s behaviour, and, most important of all, its size.
You can categorize algorithms based on the learning type or the input data as follows:
· Supervised Learning for labeled data
· Unsupervised Learning for unlabled data
· Semi supervised learning for partially labeled data
· Reinforcement learning -if the goal is to optimize an objective function by interacting with an environment.
Size of training set:
· If your training data is small or if the dataset has a fewer number of observations and a higher number of features like genetics or textual data, choose algorithms with high bias/low variance like Linear regression, Naïve Bayes, or Linear SVM.
· If the training data is large and the number of observations is higher as compared to the number of features, one can go for low bias/high variance algorithms like KNN, Decision trees, or kernel SVM.
Step 2: Define the problem statement:
A well-defined problem statement lets you understand the output expected from the model
- Regression algorithms, if the output is continuous
- Classification algorithms, if the output is categorical
- Clustering algorithms, to identify similarities between objects, and then group them according to the characteristics they have in common
- Optimization, to provide a data-driven approach to continuous improvement in practically any field
- Forecasting: making predictions about the future based on past and present data.
6. Anomaly Detection: to prevent fraudulent transactions in real-time, even for previously unknown types of fraud.
7. Ranking: to build ranking models. like SearchWiki by Google.
8. Recommendation systems: to make valuable suggestions to clients to explore more content/options.
Step 3: Data Transforms:
The process of choosing the algorithm isn’t limited to categorizing the problem. You also need to have a closer look at your data because it plays an important role in the selection of the right algorithm for the problem.
Observing the data alone will not give you an idea about the best algorithm to use, let alone the best data transforms to use to prepare the data or the best configuration for a given model. Instead, use controlled experiments to discover what works best for a given dataset.
· Processing. The components of data processing are pre-processing, profiling, cleansing, pulling together data from different internal and external sources.
· Feature engineering. Transform raw data into features that can represent the underlying problem to the predictive models. It helps to improve accuracy and get the desired results faster.
· Linearity: Linear regression algorithms assume that data trends follow a straight line. This assumption isn’t bad for some problems, but for others it reduces accuracy. Despite their drawbacks, linear algorithms are popular as they tend to be algorithmically simple and fast to train.
· Number of parameters: Parameters are factors such as error tolerance or the number of iterations, or options between variants of how the algorithm behaves. The time required to train a model increases exponentially with the number of parameters. However, having many parameters typically indicates that an algorithm has greater flexibility.
· Number of features: Feature selection refers to the process of applying statistical tests to inputs, given a specified output. A large number of features can bog down some learning algorithms, making training time unfeasibly long. The training time and accuracy of the algorithm can sometimes be sensitive to getting just the right settings.
Test transforms, models and hyperparameters:
Review the data and select data transforms that make distribution normal, remove outliers, etc. Test a bag of algorithms with default hyperparameters and select one or a few that perform well. Tune the hyperparameters of those top-performing models. Some machine learning methods (e.g., non-linear SVMs) crucially rely on hyperparameter optimization.
Step 4: Have a basket of algorithms for each category:
Have a personal choice of a group of algorithms that have worked well for you or have been documented to provide best results in scientific research papers or Kaggle competitions, across all categories. Looking more closely at individual algorithms can help you understand what they provide and how they are used, how to use an algorithm, how it works, and how to apply to your data. There are any number of articles explaining how to use a specific algorithm but, when to use this algorithm and how to choose the best algorithm for your data will be your judgement.
Step 5: Set up a machine learning pipeline:
Compare the performance of the algorithms after taking into consideration the following practical parameters. These parameters vary from project to project and only you can be the best judge of the trade-off’s that you need to make, depending on the nature of the problem that you are trying to solve.
1 Accuracy desired
3.complexity of the model
5. Time to value
6. alignment to the business goal.
Even the most experienced data scientists can’t tell which algorithm will perform best before trying them. Now, to use which algorithm depends on the objective of the business problem. If inference is the goal, then restrictive models are better as they are much more interpretable. Flexible models are better if higher accuracy is the goal. In general, as the flexibility of a method increases, its interpretability decreases.
Higher accuracy typically means higher training time. Also, algorithms require more time to train on large training data. In real-world applications, the choice of algorithm is driven by these two factors predominantly.
Algorithms like Naïve Bayes and Linear and Logistic regression are easy to implement and quick to run. Algorithms like SVM, which involve tuning of parameters, Neural networks with high convergence time, and random forests, need a lot of time to train the data.
1. Better data often beats better algorithms, and designing good features goes a long way. Choose your algorithm based on speed or ease of use instead.
2. You can improve the accuracy of an algorithm by sacrificing more time on processing and training the data. Make the decision based on the priority for your specific project.
3. If you really care about accuracy, you should definitely try a bunch of different classifiers and select the best one by cross-validation.
4. Use an ensemble method to choose them all.
5. If you can achieve similar results using a much simpler algorithm, opt for simplicity.
6. Remember that no single machine learning method performs best on all datasets.
This is actually my work flow / train of thoughts whenever I try to solve a new problem. I would love to know the methods that you have adopted, which worked for you. Please let me know in the comments.