Straightforward guide to know which ML algorithm to use

3 min readJun 22, 2021

First steps:

define the problem: what is the objective of the problem?
categorize the output: is the output a number or a class?
explore the data
categorize the data: is it labeled or not?
check the size of the data: is it a large dataset or not?
plot the data to familiarise yourself with it
transform the data into features that represent the underlying problem (by jason brownlee)
familiarize with the features to try to find the algorithm that gives the ouput most closer to that
run a logistic regression or a SVM to see if your problem is linear or not by checking the reesidual error
start by trying the basic models
try the more complicated models

By doing these steps, you should be able to know if you have a linear problem, a classification problem, a supervised or unsupervised situation and characteristics of your dataset. So let's go:

Supervised Learning (labeled data)

If the output is a number, it's a regression problem:

if you have a large training data you could use decision trees;
if you need something that is quick to run, you could use linear regression, decision tree;
if instead of that, you want something more accurate: random forerst or a neural network
if you did step 8 and have a linear problem: support vector machines
if you have a non-linear problem: Kernel SVM, neural networks or random forests.
if you have too much featurers: PCA to reduce dimensionality.
if you need to group something: KNN

If your output is a class and the number of classes is known, you have a

Classification problems

if your training data is large: KNN
if you want high accuracy: Random Forest, neural network, Kernel SVM
if you need something quick to run: Naive Bayes, decision trees, logistic regression
if you have a linear problem (step No 8): logistic regression.
non linear problems: Kernel SVM, Random Forest, neural network.
too much features: PCA to reduce dimensionality
if you need to group something: SVM

Unsupervised

if the output of the model is a class but you don't know the number of classes:

Clustering

if your variables are categoric: k-modes
if not: k-means
if you don't need to specify k: DBSCAN
if there's a large number of features: SVM

if you have to do dimension reduction, then it's a

Dimension Reduction

if it's not a probabilistic problem: Singular Value Composition
if there's a large number of features: SVM

Semi-Supervised Learning

if you have some labeled data but also not labeled data, you may use ensemble learning if there's more than one prediction you have to make in order to give a final result
this ensemble learning can have multiple models with different algorithms, you can pick which algorithm you'll use based on the data destined to that model.
Also if there's too much features you may use SVM

that's it for now, I'll update it while I go forward in my studies, you can also read my guide to most used AI terms and read the references linked bellow to have a more profound idea of the process of picking the right algorithm.

Very helpful references: