KNN high dimensional plot

Straightforward guide to know which ML algorithm to use

João Raffs
3 min readJun 22, 2021

--

First steps:

  • define the problem: what is the objective of the problem?
  • categorize the output: is the output a number or a class?
  • explore the data
  • categorize the data: is it labeled or not?
  • check the size of the data: is it a large dataset or not?
  • plot the data to familiarise yourself with it
  • transform the data into features that represent the underlying problem (by jason brownlee)
  • familiarize with the features to try to find the algorithm that gives the ouput most closer to that
  • run a logistic regression or a SVM to see if your problem is linear or not by checking the reesidual error
  • start by trying the basic models
  • try the more complicated models

By doing these steps, you should be able to know if you have a linear problem, a classification problem, a supervised or unsupervised situation and characteristics of your dataset. So let's go:

Supervised Learning (labeled data)

If the output is a number, it's a regression problem:

  • if you have a large training data you could use decision trees;
  • if you need something that is quick to run, you could use linear regression, decision tree;
  • if instead of that, you want something more accurate: random forerst or a neural network
  • if you did step 8 and have a linear problem: support vector machines
  • if you have a non-linear problem: Kernel SVM, neural networks or random forests.
  • if you have too much featurers: PCA to reduce dimensionality.
  • if you need to group something: KNN

If your output is a class and the number of classes is known, you have a

Classification problems

  • if your training data is large: KNN
  • if you want high accuracy: Random Forest, neural network, Kernel SVM
  • if you need something quick to run: Naive Bayes, decision trees, logistic regression
  • if you have a linear problem (step No 8): logistic regression.
  • non linear problems: Kernel SVM, Random Forest, neural network.
  • too much features: PCA to reduce dimensionality
  • if you need to group something: SVM

Unsupervised

if the output of the model is a class but you don't know the number of classes:

Clustering

  • if your variables are categoric: k-modes
  • if not: k-means
  • if you don't need to specify k: DBSCAN
  • if there's a large number of features: SVM

if you have to do dimension reduction, then it's a

Dimension Reduction

  • if it's not a probabilistic problem: Singular Value Composition
  • if there's a large number of features: SVM

Semi-Supervised Learning

  • if you have some labeled data but also not labeled data, you may use ensemble learning if there's more than one prediction you have to make in order to give a final result
  • this ensemble learning can have multiple models with different algorithms, you can pick which algorithm you'll use based on the data destined to that model.
  • Also if there's too much features you may use SVM

that's it for now, I'll update it while I go forward in my studies, you can also read my guide to most used AI terms and read the references linked bellow to have a more profound idea of the process of picking the right algorithm.

Very helpful references:

--

--

No responses yet