Commonly used terms in AI research

14 min readMay 4, 2021

Recently I've been doing a lot of research in AI related papers and I was surprised at how many types of different learning, machines, filters and algorithms that I didn't have a clue about. Other surprise was the difficulty to get a quick and simple explanation for each one of those items, so I decided to make a very straightforward kind of 'dictionary' to those terms, so here it is:

Types of learning:

Transfer learning:

basically recycling a model to do another type of inference, but similar to the inference of the original model.
For instance: a model for recognizing truck images is reused to make inferences of car image recognition by using a dataset of car images instead of a truck one, alongside with some modifications on the model.

Distributed Learning

Basically lots of servers training a single model.
One of the most adopted technique of machine learning because it is great for handling big amounts of data. It's a complex deep learning method because it envolves almost all of the computer science areas:
algorithms, statistics, optimization are examples of theoretical areas, AI methods and different learnings like graphical model and kernel (we'll get to kernel in this article).
besides that, DL envolves partitioning data with distributed computing, which means that the data is divided in lots of clusters (we'll also get to clustering) that works on that data individually and independently, but are all united to work as if they were a single machine. Those datasets are identically distributed for DL to work and it's very focused on clustering computer power.
Than the model is trained with each of the results unified. The disadvantages of DL relies on working with small datasets.

Federated Learning (a.k.a collaborative learning)

Federated learning is very similar to distributed learning, but with heterogeneous datasets.
Instead of "unifying" all the machines in order to gain more power like DL does, federated learning datasets just exchange their parameters — like biases and weights — with one another.

Ensemble Learning

A combination of machine learning models that can be more accurate than just one member of the group.
Works by combining different models with different approaches (for instance: one support vector machine, a linear regression model and a neural network). Each one of the models works on a type of data and when combined, they cancel each other weaknesses.
Each model works indenpendently.

Feature Learning (a.k.a representation learning)

A set of methods get raw data and tries to transform it into a more useful type of data in order to automatically discover the representation needed for training a model to do a specific task
Very used with "real world data" that need classification, like sensor data.

Adversarial Learning

Basically inputing 'wrong' data to try to corrupt the model in order to find exploits and vulnerabilities.

Continuous Learning

One of my favorites, consists of a model that retrains itself when there's new data around. The nice thing is that CL does not ignores the already trained part of the model, it just add more skills and knowledge.
An important thing about CL is that the model does not have access to the old data, but still have access to the results of the training made by that data.

Reinforcement Learning

Reinforcement learning is based on reward. Something like "well done, model! You did better this time, so keep going on this direction!".
It tries to find a balance between what the model has already done well and the unfinished and non-explored part of it.
A great (and classic) example is the Super Mario algorithm: https://www.youtube.com/watch?v=qv6UVOQ0F44

Extreme Learning

Hard to explain, but it's a neural network that uses a matrix-based algorithm to find it's biases and weights, so it does not use backpropagation.
it's a feedforward neural network, so the connection between the nodes do not form a cicle, the information only moves forward, contrary to the recurrent neural networks.

Deterministic Learning

Algorithm that accepts inputs and tries to discover relations between events that is used to study the dynamic learning problems of continuous time systems. It does not keep info about old events, it just learns from the upcoming inputs and updates the already learned knowledge.

Supported Vector Machines

Set of methods that analyses and recognizes patterns in order to do classification and regression. It takes a set of inputs and predict, for each set, which class that input belongs to
It basically plots a line (hyperplane) that separates two types of outputs.

Cross-Domain Learning

A type of learning that studies how to adapt a learning model to another learning model that shares similar data characteristics. It can also be used with multiple learning models to generate one, but it's a challenging method.

One-shot Learning

A type of computer vision object classification model based on Bayesian framework that aims to learn how to classify without too much data or training samples.
At first, classes of models are learned on numerous training examples and then the new object classes reuse the model parameters based on similarity between those classes.

Deep Distance Metric Learning

Metric is a non-negative function between two points x and y that describes the notion of "distance" between the two points. There are several properties that a metric have to satisfy, like the distance has to be equal or above zero, the distance between x and y has to be equal to the distance between y and x.
The goal of metric learning is to use those metrics to learn a similarity function from data embeddings/feature vectors in a way that reduces the distance between feature vectors corresponding to faces belonging to the same person and increases the distance between the feature vectors corresponding to different faces.

source: https://towardsdatascience.com/the-why-and-the-how-of-deep-metric-learning-e70e16e199c0

Types of Networks and Models:

Markov Model

It's a model used in sistems that modify itself in a pseudo-random way.
It's assumed that the future state of a variable depends only on the current state of it, and not on the last events that occured (that's the Markov Property). This assumption is good the avoid problems that are intractable (problems that are possible to be solved but it can't be solved due to insuficient processing capacity)
The markov model is not a "complete" model, it's more like something you add to your model in order to calculate the probability of the future states of variables and their transition functions.
Contrary to normal models, Markov has activation functions that returns a vector based on the current state of the variable and it's based on pattern recognition

Residual Deep Learning (ResNet)

An artificial neural network that uses the so called skip connections or shortcuts to jump over some layer. It skips layers that contains nonlinearities and batch normalisation in between.
An adicional layer may be used to learn the skip weights
there are two main reasons to add skip connections: to avoid vanishing gradient problems and to mitigate the degradation problem. Also the skip connections carries informations, similar to a LSTM, and are kind of a new layer
Residual Blocks is a stack of layers set in such a way that the output of a layer is taken and added to another layer deeper in the block. It's the layer that contains the weights of the skip connections.

Space State Model

Graphical model that requires you to only specify one input, using linear difference equations. It describes the relation between state variables and the observed measurement

Deep Belief Network

A type of unsupervised neural network that has the hidden layers connected via both ways, but no connection between each layer's unit. So each hidden layer is visible to the next one.
The main goal is to help the system classify the data into different categories.

Bayesian Network

Probabilistic method based no graphic models that used Bayesian inference to represent conditional probability (how does the occurrence of a variable depends on the state of other variable) via directed acyclic graph
Ideal to know what are the possible factors that caused an event, so this type of network can be used to do diagnosis of a certain problem when given some abnormal factors or problems.

Recurrent Neural Network (RNN)

A feedforward network used to exhibit temporal dynamic behavior. It can use it's memory to process variable length of inputs.
RNN can use Long Short Term Memory (LSTM, we'll get to that) to incorporate time delays and feedback loops.
Therefore, this type of network uses sequential data or time series data for ordinal or temporal problems.

Perceptron

Probably the first type of neural network we have, it's basically a set of one input layer, one hidden layer and the output, the input is applied to the activation function and then the weight/bias of each node of the hidden layer is calculated.
A Perceptron with more than one hidden layer is called Neural Network

Convolutional Neural Network

Uses lots of perceptrons in order to lower computational costs and it's used mainly to image processing.
A 1D CNN is used to time series data analysis and can have multiple channels, so the filter can move only in one direction.
A 2D CNN is used for image processing and computer vision and it can be multi-channel or single channel and there's one kernel (explained below) for each channel, so the filter can move in two directions.

Radial Basis Function Neural Network

basically a neural network that uses Radial Basis (we'll get to that) as the activation function.
Radial basis function networks have many uses, including function approximation, time series prediction, classification, and system control.

Edge Network

Not a type of neural network, it's more like an architecture of AI and big data analysis.
The Edge, which is the physical part of the architecture, collects data from sensor and passes the data to the cloud, which will do the big data processing, data warehousing, possibly the AI training and return the outputs to the Edge.

Filters and Methods

Batch Normalization

it re-center and re-scales input layers in order to make an ANN more efficient, stable and faster because of the difficulty of the random initial weights of the neurons
This is a problem called internal covariant shift where where parameter initialization and changes in the distribution of the inputs of each layer affect the learning rate of the network.

Cyclic Spectral Coherence

Cross-correlation is a measurement that tracks the movements of two or more sets of time series data relative to one another. It is used to compare multiple time series and objectively determine how well they match up with each other and, in particular, at what point the best match occurs. Also Cross-correlation may also reveal any periodicities in the data. It is commonly used for searching a long signal for a shorter, known feature.
Cross correlation functions, however, can be normalized to create correlation coefficients.
The spectral correlation function is a cross correlation and its correlation coefficient is called the coherence and is a useful detection statistic for blindly determining significant cycle frequencies of arbitrary data.

source: https://www.investopedia.com/terms/c/crosscorrelation.asp

Gibbs Sampling

It's a Markov Chain algorithm for obtaining a sequence of observations which are approximated from a distribution of multivariate probability (that's where the markov enters) when sampling is difficult.
it's commonly used for statistical inference, specially Bayesian Inference.
The Gibbs sampling algorithm generates an instance from the distribution of each variable in turn, conditional on the current values of the other variables.
It can be shown that the sequence of samples constitutes a Markov chain, and the stationary distribution of that Markov chain is just the sought-after joint distribution
The idea in Gibbs sampling is to generate posterior samples by sweeping through each variable (or block of variables) to sample from its conditional distribution with the remaining variables fixed to their current values.
The idea in Gibbs sampling is to generate posterior samples by sweeping through each variable (or block of variables) to sample from its conditional distribution with the remaining variables fixed to their current values.
The idea in Gibbs sampling is to generate posterior samples by sweeping through each variable (or block of variables) to sample from its conditional distribution with the remaining variables fixed to their current values.

Cohen Cappa

Is a statistic that is used to measure inter-rater reliability (and also intra-rater reliability) for qualitative (categorical) items.
Inter-rater reliability is the degree of agreement between different measurements done by the same person. Intra-rater reliability is the degree of agreement between two or more raters. Rater is someone who rates something.
Cohen’s kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories.
A simple way to think this is that Cohen’s Kappa is a quantitative measure of reliability for two raters that are rating the same thing, corrected for how often that the raters may agree by chance.
A score of 0 means that there is random agreement among raters, whereas a score of 1 means that there is a complete agreement between the raters.
Source: the great article https://towardsdatascience.com/cohens-kappa-9786ceceab58

K-nearest Neighbor Algorithm

Classification machine learning algorithm of both regression and classification, in which the learning is based in how similar a data (vector) is to another data, so the learning is formed by a lot of vectors.
It doesn't compare the new data to all of the existing data, it calculates the distance between the vectors to make a classification.
It is used to normalize the training data and it can improve the accuracy a lot.

XGBoost (Extreme Gradient Boosting)

A gradient boost is a regression and classification technique that produces a prediction model in the form of a weak ensemble (mix of a lot of models) models, usually decision trees. This allows to optimize the loss function
It's a form of model formalization to control over-fitting that is designed to designed to “push the extreme of the computation limits of machines to provide a scalable, portable and accurate library.”

Kernel

A class of methods and algorithms of analysis that is uses a linear classifier to solve a non-linear problem. There are multiple kernel methods that are worth checking out.
A Convolution Kernel is used to extract features from an input image: a kernel is a matrix of weights, which is multiplied with the input such that the output is enhanced in a certain desirable manner, in order to find features. Through convolutional neural networks, we can use these kernels to extract latent features

Kalman Filters (a.k.a linear quadratic estimation (LQE))

Algorithm that uses a series of over time observed measurements. By analysing series of those measurements, it can produce accurate estimations of the 'real' value of variables that contains noise or any other uncertainties.
It works by analysing the time series and then giving weight individually to each one of the variables, with the most reliable variables having a bigger weight.

Luenberger observers

It's a system that makes an estimate of the internal state of a real system based on inputs and outputs of the real system itself.
By that, it can detect failures and detect errors, but it's more often susceptible to sensor noise, so it's usually used with Kalman Filters.

Particle Swarm Optimization (PSO)

An algorithm to optimize non linear continuous functions. This function is called fitness function or objective function
A lot of particles are released in a 3D graph and each one of them tries to find the best global spot, they communicate between each other in order to make the whole swarm go to the best spot, sharing things like velocity and position of the particle.
The main goal is to find a global minimum, but this optimization method does not garantees to find the global minimum and oftenly finds a local miminum, but it's usefull when gradient methods does not work in that specific case.

RLS Filter

Algorithm that minimizes the weighted linear least squares cost function by recursively finding the coefficients relating to the input signals.
It exhibits extremely fast convergence, but by the cost of high computational complexity

Negative Selection

This class of algorithms are typically used for classification and pattern recognition problem domains where the problem space is modeled in the complement of available knowledge. Very used in anomaly detection
For example, in the case of an anomaly detection domain, the algorithm prepares a set of exemplar pattern detectors trained on normal (non-anomalous) patterns that model and detect unseen or anomalous patterns.

For understanding the next three itens, you first have to understand what is a decision tree, here's the definition by wikipedia:

"Decision tree learning is one of the predictive modelling approaches used in statistics, data mining and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves). Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Decision trees are among the most popular machine learning algorithms given their intelligibility and simplicity."

Gradient boosting

Machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. When a decision tree is the weak learner, the resulting algorithm is called gradient boosted trees (see below).

Boosting Trees

Supervised learning algorithm that builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. It usually outperforms random forest

Random Forest

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees.
Random decision forests correct for decision trees’ habit of overfitting to their training set.

Fisher Discriminant Analysis

A method to find linear separation between classes before later classification.
Attempts to express one independent variable as a linear combination of other features or measurements.

Transformations

Wavelet

A Signal Processing technique that makes the detection, classification, compression and storage of signals automatically, transforming a signal from its time-domain to its frequency domain
Also identifies temporal discontinuity in signals.
The wavelet packet transform, however decomposes the detailed information of the signal in the high frequency domain,
thereby overcoming the limitation of the normal wavelet transform which suffers from a relatively low resolution in the high frequency region.

Fourier

A technique that transforms a signal into its constituent components and frequencies, basically splitting something up into sine waves.

Laplacian transform / laplace transform

Transforms the signal in time domain in to a signal in a complex frequency domain, so it can be used for the analysis of signals and systems

Welch power spectrum transformation

Used for estimating the power of a signal at different frequencies, also causes noise reduction.

Others

Embedding

In machine learning, embedding may be used as a layer in a Neural Network to associate elements in an embedded matrix using vectors.
It's mostly used in Natural Language Processing to represent a big matrix in smaller dimensions to associate words with vectors

LSTM (Long Short Term Memory)

Type of Recurrent Network that can process not only single data points, but also sequences of data and time series.

Clustering

A form of grouping an unlabelled dataset, by grouping the data points in different clusters, consisting of similar data points. It works by finding similar patterns in the data, like shape or color, and dividing them by the presence or absence of those patterns.

Entropy

Is a measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw any conclusions from that information.
A high entropy means low information gain, and a low entropy means high information gain.
So, basically entropy in machine learning is the purity in the system, the amount of clean knowledge avaiable.

DirRec

A type of strategy of prediction strategy for the selection in long-term predictions of time series . It combines two prediction strategies: Recursive and Direct.
Recursive is when the same model is used over and over again and the previous prediction is used as original inputs for the next prediction
Direct uses different models for each time but always the real measured data as inputs and no approximations are introduced to the input set. Only the original data set values are in the approximation of the future values.
also in Direct: Every time step incorporates its own model and may also have its own selection of inputs, if the input selection is used. These selections increase the calculation time considerably, but in practice give better results in the long-term prediction due to the lack of cumulative error
In DirRec, it's used a different model at every time step and introduces the approximations from previous steps into the input set. Every time step the input set in increased with one more input, the approxi- mation of the previous step. When we use the input selection, we can determine if the approximation is accurate enough to be included in the next step and so on.
So in a way, the DirRec strategy gives not only the prediction of each step, but also information about the validity of the approximations done in the pre- vious steps.
Source: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.3203&rep=rep1&type=pdf