Robotics Design, Repair and Restoration
Navigation
Hover Here
Gradient Descent
The Practical Way
In this chapter, we will explore the basics of the Gradient Descent algorithm and how to create it using Python.
Overview
1.1  What is Gradient Descent?
Machine Learning is the autonomous process by which a machine can "learn" to accomplish a task rather than relying on a programmer to provide a stepbystep solution. This process is useful when solving a problem with no clear answer. Gradient Descent is one of many machine learning algorithms that can be used to optimise an AI model in this way.
In this tutorial, we will explore the basics of gradient descent, and learn how to create it from scratch using the Python langauge.
The Aim
2.1  The Aim of Gradient Descent
A critical component of optimising an AI model is establishing the error  by how much do the predictions from the AI model differ from the target values in our realworld training data?
The aim of gradient descent is to tune the model's parameters to reduce this error. Treating this as a minimisation problem, the goal is to find the combination of parameter values that will result in the lowest error our model can achieve.
Each change in the model parameters will lead to a change in the model's accuracy. For a simple problem, such as the example above, we can visualise these results as an "error surface," otherwise referred to as a "cost surface." This netted shape represents the calculated error values for all of the individual combinations of the variables at play, in this case, the inputs named "Weight Value" and "Bias Value".
At some point on this surface, there will be a region where the error value is at its lowest point, known as the "global minimum." Unlike other low points, referred to as "local minimums", the global minimum reveals the optimum parameter values for the model. Ultimately, finding this global minimum is the main goal of the gradient descent algorithm.
2.2  The Approach
In reality, searching every combination to find the global minimum is not an efficient use of time or computer resources. Additionally, most realworld problems require many more variables than the 2D and 3D examples on this page. This is important to remember, considering that most graphs can only represent up to axis three dimensions plus colour.
However, using gradient descent, we can begin with the parameters holding a randomised combination of values and then iteratively "descend" towards the minimum point, therefore expecting to vastly reduce overhead. A common analogy for this process is to imagine the cost surface as a hill, where we can place a ball at a random position and let it roll down to the lowest point. By finding where the ball rests, it will indicate the location of the lowest point the ball reached.
To accomplish this task, we will be taking the following approach:

Initialise the model parameters to starting values

Derive the gradient of the error for the current position in the search space

Adjust the model parameters based on the gradient to reduce the error

Repeat steps 2 and 3 until the training criteria have been satisfied
Examples of training criteria to indicate that training can stop include:

The error score reaching a suitably low level

The training period overrunning a maximum time limit (May need to run again)

A high error score not changing over a significant period of time (May need to run again)
The Walkthrough
3.1  Establishing the Model
Ultimately, an AI model is a complex set of functions produced to compute a solution to a given problem, such as detecting and replicating a pattern found in data. When improving our model, an error function needs to be added to scrutinise the model's performance against supplied training data and output the calculated error score. This combined AI model and error function can be represented as one large overall equation. The more optimised this equation, the more accurate the model's result.
To demonstrate gradient descent more easily, we can substitute the complexity of this overall function with a simplistic quadratic equation. Therefore, if we find the minimum of our placeholder function, we have "learned" a solution for our imaginary model. This will enable us to experiment with gradient descent without the overhead of choosing a problem to solve and gathering example data for this problem.
A quadratic equation gets its name from one or more of the input terms being raised to the power of 2, giving the line a curved "bowl" like shape. The equation y = x^2 + 2 was selected for three reasons. Firstly, the equation only has one variable to tune, simplifying the problem. Secondly, the x^2 produces a simple shape with a single global minimum. And lastly, the + 2 raises the line above the xaxis, allowing the line to be more easily seen at its lowest points.
3.2  Initialise the Model Parameters
The overall function is represented by the following equation:
y = x^2 + 2
The dependent variable "y" acts as the output we monitor, with the independent variable "x" representing the input we control. The other parameters of the function, such as the ^2 and +2, are locked and cannot be changed. The last parameters control the characteristics of how the function behaves.
The goal is to find the minimum of the function, or in other words, uncover what value of "x" provides the lowest result in "y". In later examples, this will translate to what values of the model parameters offer the most accurate results.
To improve the function, we first need a starting point. For this scenario, we don't need a complicated method of choosing a starting point, so we can select any point at random, such as x = 4. After this, we can slowly nudge the value of x so that the y output reaches the lowest point, as signified by the arrow.
3.3  Derive the Gradient
After selecting a starting point, we can begin to move towards the optimum point. In order to do this, we need to establish two important values for each variable we will adjust.

How much does the value need to be changed?

In what direction (+/) does the value need to be changed?
These values can be calculated using the gradient at this point. A gradient represents a rate of change. For example, the gradient can be applied to a slope to measure steepness. Moving along a steep slope will cause a large change in height, whereas a slight slope will lead to a smaller change in height.
The value of the gradient reveals how drastic of a change is needed. For example, if the rate of change is high, then the model is far from the optimum point. However, if the gradient is low, then less change is needed. Therefore, a gradient close to or equal to zero would suggest that the model has reached optimum and no further changes are needed.
The direction of the gradient indicates the direction in which the variable needs to be altered. For example, a negative gradient (pointing down) will show that the value is too low and needs to be increased. However, a positive gradient (pointing up) reveals that the parameter value is too high and needs to decrease.
In short, two important rules can be followed:

The closer the gradient approaches zero, the smaller of a change is needed.

The direction to adjust a parameter is opposite to the sign of the gradient.
It is also important to note that although the gradient indicates a "slope," it represents specifically a rate of change. The point is not to draw a line on a graph but to discover how quickly the value of one variable is changing with respect to another.
3.4  Making a Change
The next stage of gradient descent is to update the model parameters. This is thanks to one equation in particular, at the heart of this machine learning algorithm:
New_Value = Old_Value +/ (Amount to Change)
or
New_Value = Old_Value  (Step_Size * Gradient)
In essence, we get our parameter's new value by either increasing or decreasing the previous value by an appropriate amount to change.
When updating model parameters, it is important to make changes in controlled amounts. Too large of a change can lead to "overshooting", making it impossible to get to the true optimum point. Alternatively, making too small of a change may be more accurate but will take too long. Therefore, we can introduce a "step size", an appropriately small number chosen to nudge the parameter by the right amount.
This step size is multiplied by the gradient to further control the intensity of the change. For example, when the gradient is high, we can make a larger change, and when the gradient is closer to zero, we can make a much finer change.
The direction of the gradient is the opposite of the direction in which the parameter is to be adjusted. To overcome this, we can update the value by subtracting the amount to adjust it.
The general characteristic of the algorithm is displayed in the image above.
At the start, there are large distances between updated points as the algorithm quickly approaches the general region of the solution. Near the end, gradually smaller distances are visible when more precise changes are made. Once at the global minimum, the changes are so small that nothing seems to change, and the training can be stopped.
Overall, we can update the parameter by taking its current value and subtracting a controlled amount based on the gradient.
The training is done over many iterations, gradually improving the model's accuracy. This loop can be stopped once the training criteria are met, such as when the error score is deemed low enough or when the gradient gets close to zero.
QuickFire Quiz
Before we progress into the next section, let's test your knowledge so far with some quickfire questions!
0/3
Q1. Using the image below, select the answer that best describes the following gradient.
The Background
4.1  What is the Gradient
The gradient is a crucial component of the gradient descent algorithm, as the name suggests. Therefore, it is important to understand how exactly the gradient is calculated.
The gradient, also known as the slope, represents a rate of change. As the gradient increases, the rate at which values change over time increases. Take the equation for a straight line:
y = mx + c
or
y = (gradient * x) + bias
On a typical straightline graph, this is demonstrated by a greater gradient creating a steeper line, causing values to rise more rapidly. Alternatively, a lesser gradient would lead to a more shallow line tilt.
The bias, or "c", controls the yintercept, where the line cuts through the yaxis. The bias can be said to raise or lower a line, but as we are purely demonstrating gradient, the bias has been ignored for the examples below.
The gradient is typically controlled by a multiplication to one or many input variables. For example, as shown in the image above, the changes to the slope of the line are proportional to the numbers being multiplied to the X term in the equation. For functions utilising multiple input values, each dimension can have an individual gradient applied.
4.2  Finding the Gradient
In many cases, the gradient may not be given. There are a variety of equations avaliable, that can calculate, or at least very closely estimate, the gradient. The simplest and most common method is the riseoverrun equation.
The riseoverrun equation calculates the rate of change between two points. This is achieved by dividing the resultant change in the yaxis (Δy) by the measured change in the xaxis (Δx). Applied to a straightline graph, any two points can be chosen, as the gradient of a straight line is constant.
4.3  The RiseOverRun Approach
The riseoverrun equation can look quite daunting at first, so let's work through the following example.
We have been provided with a mystery line, the output of an unknown straightline equation with an unknown gradient and unknown bias. Through applying riseoverrun, we can make short work of achieving these values.
The process to take is as follows:

Choose any two points (Something with whole number values would be easiest)

Calculate the difference in y (y of point 2  the y of point 1)

Calculate the difference in x (x of point 2  the x of point 1)

Divide the change in y by the change in x
In the example above, I have chosen two points that conveniently cross on axis lines.

Point 1 = (2,2)

Point 2 = (6,4)
This gives us the x and y values of:

X1: 2, Y1: 2

X2: 6, Y2: 4
4.4  Calculating the Gradient
With two points selected, the next stage is to calculate the differences. The first step is to measure the difference in y values by subtracting the first point's y from the second point's y.
Δy = y2  y1
This provides the result of:
Δy = 4  2 = 2
The second measurement is the difference of the x values by subtracting the first point's x from the second point's x.
Δx = x2  x1
This provides the result of:
Δx = 6  2 = 4
From these results, we can conclude that a difference of 4 on the xaxis will lead to a difference of 2 on the yaxis.
Finally, dividing these differences gives us the overall rate of change, revealing the missing gradient value.
We can conclude that the gradient of the mystery line is equal to 0.5. In other words, the change in y is half that of a change in x.
The previous graph shows that the yintercept, where the line crosses the yaxis, is at the value y = 1, revealing the bias (or "c") to be equal to 1. Putting this to the test, if we choose a random point on the graph, and subtract one from the y value at this point, we should see that the reduced y value is exactly half (0.5 times) that of the x value that produces it, verifying our results.
4.5  NonLinear Gradients
The riseoverrun method is a great way to calculate the gradient. However, it does have some crucial drawbacks that affect its suitability for use within gradient descent.
For example, as stated previously, the gradient of a straight line is constant. That means that anywhere on the line, or for any two points we choose, the gradient will be the same. However, as displayed by the image below, the gradient is different for each point we examine for our scenario's nonlinear equation.
This effectively makes the riseoverrun incompatible with our specific use case. Bolstered by the fact that any scenario we cover through these tutorials will result in a nonlinear equation to minimise, we need to investigate an alternative.
4.6  Introducing the Derivative
Rather than generalising a gradient across the entire function's output, we need to adopt a method of deriving the gradient for a specific point we need to examine.
As gradient descent involves improving a given model, we can use the characteristics of the error function and the model's known equation as a foundation to formulate a way of calculating the gradient. Though the solution may not be known, the formula for the model itself should be.
The common way of achieving this is by calculating the derivative of the equation  A new equation that will provide the gradient for any value we plug in.
The derivative has a complex background concept, so we will only cover what we need to know. However, the idea is primarily based on the riseoverrun method, but the gradient is calculated using "two points" with an almost infinitely small distance between them. This effectively calculates the gradient at a single point, which is exactly what we need.
Fortunately, there is a range of popular mathematical "rules" that allow us to simplify this process by bypassing most of this maths.
4.7  Introducing the Derivative
The derivative is used to measure an instantaneous rate of change, such as the gradient at a specific point. It can also be focused on a specific input variable, as we will demonstrate in later tutorials.
The following equation calculates the derivative of the target equation with respect to the input variable "x".
In other words, this equation asks the question:
If I make a small change in x, how much will this change y?
In the case of derivatives, the symbol signifying "a change in ..." is the letter "d".
4.8  Introducing the Power Rule
The derivative can be calculated by utilising a selection of tools called the Derivative Rules. These rules act as shortcuts to simplify the mathematics in particular scenarios.
Take our quadratic equation:
y = x^2 + 2
The rule we will be using in this scenario is the power rule.
The power rule is a shortcut specifically designed to differentiate x^n, where n can represent any number the x term is raised to the power off.
The process is as follows:

Select an x term in the form x^n

Next, take that power of n, and multiply it to the term’s coefficient (the number before x).

Lastly, we reduce the original power by 1.

After these steps, repeat for any remaining terms in the equation in that form.
This will produce the new derivative equation that can calculate the rate of change at any given point for our nonlinear equation.
4.9  Additional Useful Rules
Many rules can be combined to compute or simplify the derivative. Below is a range of other mustknows to add to your derivation tool kit.
We now have an established toolset that can be applied to our original equation to get our full derivative with respect to x.
4.10  Using the Power Rule
Below is the starttofinish workflow for finding the derivative of our quadratic equation, concluding the background section of this tutorial.
In other words, the gradient at a specific point would be twice that of the x value at that point.
4.11  Demonstrating the Derivative
Now that we have the derivative, we can plot this equation as an additional function against the original line. This will allow us to perform some visual tests to ensure the gradient is working as we expect.
Selecting some points on the grid lines allows us to read the values on the graph more accurately without printing the values in a table. From these preliminary tests, it can be seen that the gradient is, in fact, double the x value and that the gradient line looks even across and proportional to the example line.
More importantly, the gradient line is aligned with the centre of the example line, and we see a negative gradient on the lefthand side and a positive gradient on the righthand side of the centre. Moreover, the further from the middle the x value is, the greater the rate of change.
The characteristics of the gradient match that of the example line, which suggests we have successfully found the derivative and captured the essence of the example line.
Quick Fire Quiz #2
Before we progress into the next section, let's test your knowledge so far with some quickfire questions!
0/3
Q1. A mistake was made whilst calculating the derivative. Which line did the error occur?
The Code Along
5.1  Setting up the Environment
In this tutorial section, we will progress through a codealong walkthrough of the gradient descent process.
To get started, we need to initiate a suitable programming environment to begin coding. I have selected Jupyter Notebook, a freetouse webbased environment you can install and run locally on your computer. In addition to being free, I have chosen this tool because it permits programs to be created using "cells", small segments of live code that can be run and altered independently, perfect for experimenting with making changes to a larger program.
With a new project made, the next step is to import libraries. Libraries are a selection of premade programs and functions that we can include in our programs to make our job easier. The libraries I have chosen are:

Matplotlib  A free visualisation tool that allows us to create graphs effortlessly.

NumPy  A free mathematics library that contains many optimised mathematics tools.
Python 3
# Import Essential Libraries #  # Import graph library from matplotlib import pyplot as plt # Import mathematics library import numpy as np
Libraries can be imported as a broader package or further specified to include specific tools. A tool imported into a program can also be provided with an alias using the "as" keyword. An alias acts as a nickname that you can use in your program. For example, if I want to use the "pyplot" graphing tool within the "matplotlib" package, I can refer to it using the alias "plt" to make my code shorter and easier to read.
5.2  Declaring the Function
The first stage of the process is to establish the model we intend to optimise.
We can add the function we intend to use by declaring it as a function. A function has two main sections: the function header and the function body. The function header defines the function's name, along with any inputs we need to provide. The function body is where we place the actual behaviour of our function, in this case, returning the result of a calculation.
Python 3
# An example function to optimise def example_function(x_in): # Represents the equation: y = x^2 + 2 return (x_in ** 2) + 2
The quadratic equation we have shown before only takes in one input, the x value, which has been provided as an input parameter named "x_in." The "x_in" variable is known as a local variable, a variable only available within our function, similar to the functions having its own copy of the value we give it. This way, the function is free to work with this variable without affecting any other code outside the function's scope.
The equation is then entered in the function body as a return statement, providing us with the answer once it has been calculated. It is worth noting that raising a value to a power is accomplished using double asterisks.
5.3  Testing the Function
A simple 2D function such as this can be tested by plotting it on a graph. This will allow us to visually check that the function is performing the way we expect it to before we attempt to optimise it. Though not a required step, catching issues early will limit confusion if problems occur later.
To test the function, we will need to do four things:

Create input values for the xaxis

Run the function to generate values for the yaxis

To plot the function to a graph

Display the graph
Python 3
# Create xaxis values x = np.linspace(5, 5, 20) # Generate the yvalues by calling the function function_test = example_function(x) # Plot the test line plt.plot(x, function_test, c='b') # Label the line on the graph plt.legend(["y = x^2 + 2"]) # Show the graph plt.show()
The chosen equation is the quadratic we have used previously. The equation only takes in one input, the x value, which has been provided as a input paramter names "x_in". The "x_in" variable is known as a local variable, a variable only avaliable within our function, simular to the functions having its own copy of the value we give it, thats its free to work with without effecting any other code ourseide the functions scope.
The equation is then entered in the function body, as a return statement to provide us witht he answer once it has been calculated. It is worth noting that raising a value to a power is accomplished using double asterisks.
The Derivative  Lab
1  What are potential dividers?
To initiate the repair, we first need to strip down the robot to expose the error, as the basic checks show the fault lies within the robot. Below is a picture of the top face of the robot in question.
Progress Summery
1  What are potential dividers?
To initiate the repair, we first need to strip down the robot to expose the error, as the basic checks show the fault lies within the robot. Below is a picture of the top face of the robot in question.
Quiz  Simple  Medium  Hard 

Quiz 1  No  No  No 
Quiz 2  No  No  No 
Quiz 3  No  No  No 