## Machine Learning Algorithms and their Applications

In this article, you will learn about some Machine Learning Algorithms and how they work, what the math behind them is, and some of their countless applications.

We will start by talking about Machine Learning in general, we will come up with some definitions, and then I am going to explain how 2 algorithms work, what their differences are when they are used, and what makes them so good/bad. We will start with Linear Regression, and then move to K-Means.

Before Starting I should mention that I will include all the equations and mathematical formulas that make up these algorithms. If you are not familiar with Calculus and Linear Algebra, you will not be able to understand most of these equations, however, I will explain what the equations mean and provide some intuition so that you can follow despite not knowing the mathematics.

### What is Machine Learning?

• Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.

(Arthur Samuel)

### The Universally Accepted Definition of Machine Learning

A computer program is said to learn from experience E with respect to some task T and some performance measurement P, if its performance on T, as measurement by P, improves with experience E.

(Tom Mitchell)

### What are some examples of Machine Learning?

The most common example is the housing price model. This algorithm, based on all the data it was given in its training phase, creates a prediction for the price of a house. A lot of factors depend on it, like how old the house is, how many bedrooms there are, its location, the number of restrooms, etc. We use machine learning to create an analogy between all these features and the final prices. In this specific case, it creates a function of n variables, where n denotes the number of features, that return a specific value, namely the housing price. This is a classical example of a Regression Algorithm.

### Different Types of Learning Algorithms

In Machine Learning, there are 3 types of Learning Algorithms:

1) Supervised Learning

2)Unsupervised Learning

3)Reinforcement Learning

All of these types of Learning Algorithms solve different types of problems and have countless applications in our everyday life.

### Supervised Learning

We will first take a look at Supervised Learning Algorithms, define them and look at some examples. Afterward, we will get into our first Algorithm, and do a deep analysis with multiple examples and many diagrams.

However, before starting, we should first talk about notation and symbols. There are many ways to symbolize the things that we will talk about, but in this article, we will use the following notation:

Number of Training Examples: m

Input Variables/Features: x

Output Variables/Target Variables/ Labels: y

Hypothesis: $$h_θ(x)$$

$$i^{th}$$ training example :$$(x^{(i)},y^{(i)})$$(for example, the second training example will be denotes as:$$(x^{(2)},y^{(2)})$$, we will not start our index at 0, since there is a specific values reserved for the 0th elements)

What is Supervised Learning?

In S.L., the program is given a training set. A training set is a set of examples for the program to learn of. Each training example consists of n inputs, where n is the number of features our algorithm takes into consideration when making a prediction. Along with the inputs, we also provide the labels, which are the values that correspond to the respective inputs.

The Supervised Learning Algorithm, then creates a function that we call a hypothesis, denoted by h. This hypothesis is the prediction it creates based  on the inputs. After forming a hypothesis, the algorithm then compares its estimated output, with the label that we have provided. Based on the comparison it adjusts its function(the hypothesis) and iterates over the training set until it reaches the accuracy that the programmer has set as a threshold.

Based on the algorithm of our choice, the hypothesis has a different form. For example, as we will later see, Linear Regression has a form of

$$h_θ(x) = θ_0 + θ_1 x$$

for the Simple Linear Regression, and the form

$$h_θ(x) = θ_0 + θ_1 x_1 + θ_2x_2+…+θ_n x_n$$

for the Multivariable Linear Regression.

On the other hand, the Logistic Regression Model, take the model

$$h_θ(x) = \frac{1}{1 + e^{-θ^Τ x}}$$

The algorithms above might seem complicated, but in reality, they are very very simple and we get to the analysis of these, you will find out that they are very straightforward.

The Hypothesis, as we said is the function that our machine learning program creates.

The values θ are simply constants. That means that the Simple Linear Regression hypothesis is simply the equation for a line, the famous

$$y =ax +b$$

All of these constants are parameters of our hypothesis, hence the notation $$h_θ(x)$$

It means that our hypothesis h is a function of x, parametrized by θ.

Our algorithm changes the values θ based on how close its estimated prediction is to the actual value, given in the training set.

The simplest example of such a function is one where the training set consists of single pairs of numbers, one value, and its associated label. In this case, in each example, x is simply that one value, and then we compare the value of our hypothesis to the label, which we usually notate as y.

The Hypothesis, as we said is the function that our machine learning program creates.

The values θ are simply constants. That means that the Simple Linear Regression hypothesis is simply the equation for a line, the famous

$$y =ax +b$$

All of these constants are parameters of our hypothesis, hence the notation $$h_θ(x)$$

It means that our hypothesis h is a function of x, parametrized by θ.

Our algorithm changes the values θ based on how close its estimated prediction is to the actual value, given in the training set.

The simplest example of such a function is one where the training set consists of single pairs of numbers, one value, and its associated label. In this case, in each example, x is simply that one value, and then we compare the value of our hypothesis to the label, which we usually notate as y.

### Linear Regression

This is the first algorithm that we will look at. Linear Regression, as we said in the previous segment, is an algorithm, that based on the training set creates the best fit line for our training set. We give it a specific starting hypothesis, that could be random, taken from another program, or anything really.

The Algorithm then starts training by comparing its predictions/estimated results, to the actual output variables y, and makes the necessary corrections to the parameters θ. In this article we will only talk about Univariate Linear Regression, and mention Multivariate Linear Regression at the end and see how the Gradient Descent algorithm changes when it is introduced to many variables. However, I will not go into detail

Note, that almost everything we say in one algorithm applies to the rest too. At the start of each algorithm, we will however have a small revision of the previous one.

### Univariate Linear Regression

Univariate Linear Regression is a Linear Regression algorithm that has only 1 input variable, as suggested by its name.

That means that training set for U.L.R, would look like this: The algorithm creates a function, that we call the hypothesis, and based on how close to y the estimated result is, it alters the function to match the training set.

How do we do that?

### The Cost Function

For our algorithm to see how close its estimated result is to the actual label, we need a function that tells us exactly that. That function is called the Cost Function

There isn’t a single correct Cost Function. There are many ways to find the difference between the expected and the estimated value. However, in this article, we will use the Mean Squared Error Function as our Cost Function.

We denote the Cost Function as

$$J(θ)$$

or more specifically, for U.L.R.

$$J(θ_0 , θ_1)$$

I will not directly give you the function. Instead, we will try to “create” it ourselves, so that you can get some intuition.

So first of all, the main objective of our function should be to find how much we are off the labels. We should also try to make the value of the function as simple as possible so that we can later use that value and alter the thetas (θ). In other words, we want our result to be a single value for the whole hypothesis and not a specific value for each x. That also makes our algorithm way faster and more efficient.

The core of our function, as we have said, should calculate the difference between y and $$h_θ(x)$$, however, there is a problem:

We cannot simply subtract one value from the other one, because of the signs.

If our hypothesis was say 12 and the actual result was 10, subtracting y from would yield 2. But if our hypothesis was 8, the result of the subtraction would be -2

See the problem?

Both estimates, 8 and 12, are 2 units away from y, but with the cost function based on $$h_θ(x) -y$$,  we get two very different results, with a different sign.

There is a very simple solution to that. What we can do is to either square the difference or find its absolute value.

Both ways are good, but we will choose the first one, namely squaring the difference.

The reason behind it is that if we square it, it amplifies the difference, obviously, by order of 2. That makes the Cost Function more sensitive, therefore, and faster. There is however a catch. After the difference $$h_θ(x) – y$$ becomes smaller than one, the amplification, is no longer helpful, since it makes a small difference a lot smaller.

In the real world and for the training set sizes that we will talk about, it doesn’t really matter which of these we choose. We can barely see a difference in the speed.

From the name of the Cost Function, you could find the rest of the formula based on logic and, well English, but nevertheless, I will continue explaining everything so that you build a strong intuition.

If you remember, we said that we want our Cost function to have only one dimension, one value, for the whole hypothesis. So what we can now do, is find the average error of our hypothesis, by adding all the differences $$h_θ(x)-y$$ and then dividing by the number of training examples. Just like this, we can calculate the mean (squared) error of our hypothesis. Mathematically we write this as:

$$J(θ_0,θ_1) = \frac{1}{2m} \sum\limits_{i=1}^{m} (h_θ(x^{(i)}) – y^{(i)} )^2$$

Where:

m: Number of Training Examples, $$h_θ(x^{(i)})$$: Hypothesis of the $$i^{th}$$ training example, $$y^{(i)}$$ : Label of $$i^{th}$$ training example

You probably noticed, that in our Cost Function, we also divide by 2. If you are familiar with Mutivariate Calculus, you probably understand why that 2 is there and probably, because of that 2, what we will do to change the parameters to fit the training set.

If you are not familiar with multivariable calculus,  do not worry about that 2, I will soon explain its existence and why we put it there.

Before moving on here are some example so that you get a visual interpretation of what we just said.

It is pretty clear that the second example fits the examples better and is the most accurate model of these 3. That would mean that if we were to calculate the cost function of each of these lines, we would expect that the first has the largest, and the middle the smallest. If we were to actually Calculate we would find exactly that!

### Minimizing the Cost Function

We now want to find a way to minimize our cost Function J. It also needs to be fast and controllable.

In Multivariable Calculus and in Vector Calculus there is a specific operation, called taking the Gradient of a Function. We symbolize it with either $$grad(f)$$ or $$\nabla f$$. This operation tells you whether you should increase the value of the input variables to decrease them so that you can arrive at the local maximum slope in the output variable of the function.

For someone that isn’t familiar with Calculus this is a lot of information, that seems complex and too hard to understand and they might get discouraged by that, but the reason I am writing this article is so that people not in the field can understand the logic behind it, so I will explain what this algorithm does.

Before getting into the gradient, we should first take a step back and talk about functions of 2 variables.

These functions take as input 2 variables and outputthird, dependant, variable. That means, that we can see the function, call it f, like a map, that correlates two specific inputs from one space to an output in a 3rd space. In other words, we can see the 3D space as one 2D input space, and one 1D output space. Essentially, what the gradient of a multivariable function does, is that it tells us, whether we should increase or decrease the values of our input variables so that we get closer to the nearest maximum value of the function. Gradient Descent is essentially that, but opposite, so we plug in a minus. The multivariable function is the Cost Function in our case, and we use G.D. to find its local minimum, rather than the minimum.

We have seen now mathematically what the Gradient Descent is, but now we need to see how we apply it to Linear Regression.

There is a problem with applying this operation in our algorithm. There is a chance that the Gradient Descent overshoots, and instead of going at the local minimum, it goes slightly above. That is why we introduce a number to our algorithm, called the learning rate. We usually denote it with a, and it is a controlled constant that determines, how strong the Gradient Descent is, or how big of steps it tells us to move each time.

We always select a small value for a, so that it doesn’t overshoot, but big enough so that it doesn’t take ages to find the nearest optimum. ( I have used a cost function that depends on one value for demonstrative purposes, you can imagine how it would look like in 3D)

initialize the θ, with random values

repeat until convergence{

$$h_θ(x^{(i)}) = θ_0 + θ_1 x^{(i)}$$

$$J(θ_0,θ_1) = \frac{1}{2m}\sum\limits_{i=1}^{m} (h_θ(x^{(i)})-y^{(i)})^2$$

$$θ_j := θ_j – a\frac{\partial}{\partial θ_j}J( θ_0, θ_1)$$

(for j = 0 and j = 1)}

That is our Algorithm. Notice how we first define the Cost function and then update   the variables (:= is the update operator). We do that since it depends on both thetas, so if we were to first update a theta, then define the cost function, and then update the other theta, we would have an algorithm that doesn’t work as intended and updates the values with a wrong logic. Not to mention that this would take a lot longer to finish as opposed to the algorithm above!

###### This actually is also the algorithm and logic for Multivariate Linear Regression, too. The only difference being the hypothesis and the values j takes:
$$h_θ(x_1, x_2,..,x_n) = θ_0 + θ_1 x_1 + θ_2 x_1 +…+ θ_n x_n$$

j takes values form 1 to n, and we define all the x, as features. We also define $$x_0 = 1$$ for all training examples. In MVL we also perform something called feature scaling, where we scale all the features so that they are around the same values. We do that because t there is a huge difference between values and G.D. takes way too long.

### Unsupervised Learning

Unsupervised Algorithms are completely different from Supervised. In U.L. we have no label, no output variable to predict

U.L. Algorithms take as input all the examples we have and create clusters based on them. After the clusters are created, and the algorithm is sufficiently trained, it classifies any input based on its distance from all the clusters.  For our purpose, we will examine K-Means, an algorithm that does exactly that:  Creates Clusters and Assigns points to them.

K-Kmeans is a very simple but beautiful algorithm, that everyone can intuitively understand without having to know any hardcore mathematics.

### K-Means

We can describe this algorithm in 4 steps:

1. Plot all the Points
2. Pick K random Points (these random points are now called cluster centroids)
3. Assign all the points to centroids based on the distance from them
4. Set the coordinates of the cluster centroid as the mean of the coordinates of all its points

Then repeat steps 3 and 4 until convergence

The first step is called Cluster Initialization. We pick the number of clusters that we want (we denote that with the variable K), and the algorithm picks K random points to be centroids.

After that, we move onto the next step called Cluster Assignment. Now, our algorithm iterates over all the points, and based on their distance to the clusters, it assigns them to the closest one, creating n clusters.

The next step is a shift in the coordinates of the cluster centroids. The Algorithm sets the coordinates of the centroids, equal to the mean coordinates of the points in said cluster.

Finally, K-Means repeats the process, until there is no change in assignments, in coordinates, until everything converges.

### The Mathematics behind K-Means

Before concluding, I think it is only right that we take a look at the mathematics behind K-Means, and the symbols we usually.

The algorithm takes two inputs: K: The number of Clusters,and $$\{x^{(1)},x^{(2)},..,x^{(m)}\}$$ the training set, where $$x^{(i)}\in \mathbb{R}$$ and $$x^{(0)}=1$$

(Remember that this is a generalized version, meaning there are n features in each training example)

##### Symbols

$$c^{(i)}$$ is the cluster $$x^{(i)}$$ is assigned to.

$$μ_{(k)}$$ are the clusters/ cluster centroids

(so if we say that $$μ_{(k)}$$:= …., we mean that we update its coordinates to …..)

$$δ_{ij}$$ The Kronecker Delta. This is a mathematical symbol that is used by many fields. It is equal to 1 if i=j, and 0 otherwise

##### The Process

Initialize K $$μ_k$$‘s

repeat until convergence{

for i in range(m){

$$c^{(i)} = \min\limits_{k} ||x^{(i)}-μ_k||$$

for j in range(K){

$$t_j =\sum\limits_{i}^{m} δ_{jc^{(i)}}$$

$$μ_j = \frac{1}{t_j}\sum\limits_{k = 1}^{m}x^{(k)}δ_{jc^{(i)}}$$

}}}

Thank you all so much for reading and hopefully, you learned something new! If you want to learn more about machine learning, I would suggest taking a look at some of the books I have listed below. Thank you for your time and make sure to come back for our future content.