Straightforward guide to know which ML algorithm to use
3 min readJun 22, 2021
First steps:
- define the problem: what is the objective of the problem?
- categorize the output: is the output a number or a class?
- explore the data
- categorize the data: is it labeled or not?
- check the size of the data: is it a large dataset or not?
- plot the data to familiarise yourself with it
- transform the data into features that represent the underlying problem (by jason brownlee)
- familiarize with the features to try to find the algorithm that gives the ouput most closer to that
- run a logistic regression or a SVM to see if your problem is linear or not by checking the reesidual error
- start by trying the basic models
- try the more complicated models
By doing these steps, you should be able to know if you have a linear problem, a classification problem, a supervised or unsupervised situation and characteristics of your dataset. So let's go:
Supervised Learning (labeled data)
If the output is a number, it's a regression problem:
- if you have a large training data you could use decision trees;
- if you need something that is quick to run, you could use linear regression, decision tree;
- if instead of that, you want something more accurate: random forerst or a neural network
- if you did step 8 and have a linear problem: support vector machines
- if you have a non-linear problem: Kernel SVM, neural networks or random forests.
- if you have too much featurers: PCA to reduce dimensionality.
- if you need to group something: KNN
If your output is a class and the number of classes is known, you have a
Classification problems
- if your training data is large: KNN
- if you want high accuracy: Random Forest, neural network, Kernel SVM
- if you need something quick to run: Naive Bayes, decision trees, logistic regression
- if you have a linear problem (step No 8): logistic regression.
- non linear problems: Kernel SVM, Random Forest, neural network.
- too much features: PCA to reduce dimensionality
- if you need to group something: SVM
Unsupervised
if the output of the model is a class but you don't know the number of classes:
Clustering
- if your variables are categoric: k-modes
- if not: k-means
- if you don't need to specify k: DBSCAN
- if there's a large number of features: SVM
if you have to do dimension reduction, then it's a
Dimension Reduction
- if it's not a probabilistic problem: Singular Value Composition
- if there's a large number of features: SVM
Semi-Supervised Learning
- if you have some labeled data but also not labeled data, you may use ensemble learning if there's more than one prediction you have to make in order to give a final result
- this ensemble learning can have multiple models with different algorithms, you can pick which algorithm you'll use based on the data destined to that model.
- Also if there's too much features you may use SVM
that's it for now, I'll update it while I go forward in my studies, you can also read my guide to most used AI terms and read the references linked bellow to have a more profound idea of the process of picking the right algorithm.
Very helpful references:
- https://serokell.io/blog/how-to-choose-ml-technique
- https://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
- https://www.kdnuggets.com/2020/05/guide-choose-right-machine-learning-algorithm.html
- https://blogs.sas.com/content/subconsciousmusings/2020/12/09/machine-learning-algorithm-use/