Linear Regression Explained, Step by Step

Linear regression is one of the most famous algorithms in statistics and machine learning. In this post you will learn how linear regression works on a fundamental level. You will also implement linear regression both from scratch as well as with the popular library scikit-learn in Python. You will learn when and how to best use linear regression in your machine learning projects. You do not need any knowledge prior to reading this article.

Lari Giba
Hi! I'm Lari and I run this website. You can find more information in the "About"-tab.
More posts by Lari Giba.

Lari Giba

23 min read

Share on:

Linear Regression Explained, Step by Step

Outline

You’ve probably heard about linear regression before. Maybe you have even used it in a project before. And maybe, like I when I first learned about it, you had some open questions in mind as to how linear regression works in detail and why exactly we use this or that. If this is you, then you have found the right post. In this article, we’ll walk through linear regression step by step and take a look at everything you need to know in order to utilize this technique to its full potential. You’ll also understand what exactly we are doing when we perform a linear regression. We’ll go through the intuition, the math, and the code. Let’s begin!

The Main Idea

Imagine we have a dataset of houses for a specific city, where we are given the number of bedrooms for each house as well as the price of each house. The dataset might look like this:

Simple Housing Dataset

make interactive

Now let’s say there is a new house for sale in the city. $a=3$ Wouldn’t it be cool if we could use this existing dataset to somehow predict how much the new house will cost? We can do so by trying to create a straight line

If we wanted to predict the price of a house based on the number of bedrooms as well as the number of restrooms, we could no longer draw a straight line to make predictions. Instead, we would have to draw a 2-dimensional plane, where for each number of bedrooms and each number of restrooms, we get an estimated price. More generally, if we have

n

features, we would have to create an

n-1

dimensional hyperplane for our linear regression.

, or in other terms, a linear function, which models the relationship between these two parameters (bedroom count and price). Our line could look like this:

Linear Function for Housing Dataset

make interactive

Now, if a new house enters the dataset, all we need to do is take the number of bedrooms in that house and read off the price that our line marks at that specific number of bedrooms. In other terms, we plug the number of bedrooms into our linear function and what we receive is the estimated price:

f(number\ of\ bedrooms) = price

Let’s say our function looks like this

From now on I will reduce “60000x$” to “60000x” in order to make it more readable.

f(x) = 60000x

where x is the number of bedrooms in the house. Our function estimates that a house with one bedroom will cost 60.000$, a house with two bedrooms will cost 120.000$, and so on. If we plot our function alongside our data points, we get the graph above.

Looks good! But we could have also chosen this function:

g(x) = 70000x

or this one:

h(x) = 45000x+5000

In general, we could take any function that has this form:

f(x) = m*x + b

where $m$ determines how steep our function is and $b$ determines the value of our function at x=0. $m$ is also called the slope of our function and $b$ is called the intercept-term (because it determines the value at the point where our function intercepts with the y-axis).

We can create three plots and display them side by side:

Plot of Three Functions for Housing Dataset

make interactive

But how can we tell which of these three functions best represents the trend of our data? I mean sure, if we have a function like $z(x)=1x+1$ , we can tell that it is off, but can we somehow quantify how good or bad a specific function is? Yes, we can!

Finding the Least Squares

Ok, so we want to find out how well our line matches our data points. In other terms, we want to know how similar, or how close our line is to our points. So let’s just calculate the difference between every data point and the value of our function at that point and sum those differences together. So if we have a data point that tells us there was a house on sale with 3 bedrooms for a price of 200.000$, but our function $f$ suggests a price of 180.000$, the difference is 20.000$. And now we do this for every point.

Mathematically, we can write this sum down as:

( y_1-f(x_1) ) + ( y_2-f(x_2) ) + ... + ( y_n-f(x_n) ) \\ = \sum_{i=1}^n{f(x_i)-y_i}

where $y_i$ is the actual price of the house in thousand dollars, $x_i$ is the number of bedrooms in a house, and $f(x_i)$ is the price our function predicts for that number of bedrooms. So if we plug in the values from our dataset, we get:

( 80-f(1) ) + ( 170-f(2) ) + ( 100-f(3) ) + \\ ( 220-f(3) ) + ( 200-f(4) ) + ( 270-f(4) ) + ( 500-f(5) ) \\ = \sum_{i=1}^7{y_i-f(x_i)}

These individual differences are also called residuals. Let’s call the resulting sum the sum of residuals (or SOR for short).

At this point it might be helpful to visualize this process, so let’s take a look at it in action. At the top of the animation, we will keep track of the total SOR while we add each residual one by one.

calculating the SOR for our line

1/7

The problem with the SOR

Now there is one problem with this. You might have noticed that when we calculate the difference of a data point and our prediction, we get a negative value if our prediction is higher than the data point. This happens when the blue data points lie below our green line.

In this case, our SOR gets reduced instead of increased. Not good! This way, individual terms might cancel each other out and our result is distorted. We want to track overestimates as well as underestimates, but right now this is not working as intended. So how can we fix this?

The first workaround that comes to mind would be to just take the absolute value, like this: $|y_i-f(x_i)|$ . Let’s call this the sum of absolute residuals (SOAR). An alternative would be to square each term instead, like this: $(y_i-f(x_i))^2$ . Let’s call this the sum of squared residuals (SOSR).

SOAR vs SOSR

In practice, the SOAR is used a lot more rarely than the SOSR. There are a few reasons for this. The reason that is most commonly mentioned is that we avoid using the SOAR because taking absolute values makes the derivative of a function more complicated

Why do we need the derivative of the SOAR?
We need to take the derivative of our metric if we want to find it’s minimum, i.e. the value with the lowest SOAR. Since the SOAR tells us how bad a function performs, we are interested in finding the lowest possible value of it, and therefor we need the derivative of it. We won’t look at derivatives in this post, but they are needed for finding the normal equation or performing gradient descent. We will take a look at these two techniques later on in the post.

. So in order to make things a bit easier for ourselves, we avoid using the SOAR and use the SOSR instead because its derivative is very simple (f.e. the derivative of

x^2

is just

2x

Now we could try and correct our SOSR by taking the square root of every residual. But the thing is, not “correcting” our SOSR might actually be beneficial. Why is that?

If we compare the SOSR with the SOR, you might say: squaring the residuals yields a different result than the one we actually wanted, doesn’t it? We wanted to calculate the sum of residuals, but if we square each term, then large residuals increase in size a lot more than small residuals! If we have three residuals $r_1 =0.5$ , $r_2 =10$ and $r_3 =40$ and we square these terms, we get $r_1^2 = 0.25$ , $r_2^2 = 100$ and $r_3^2 = 1600$ . $r_1$ decreased, while $r_2$ increased 10-fold and $r_3$ increased 40-fold! This can’t be a good thing, can it? Well, in our case it actually is!

We use the SOSR to measure how well (or rather how poorly) a line fits our data. By squaring the residuals we magnify the effect large residuals have on our SOSR. This just means that we care more about large residuals than we do about small residuals. Let’s say we have two scenarios.

In the first scenario, we have three residuals, $r_1 = 10, r_2 = 10, r_3 = 10$ . If we square them we get $r_1,r_2,r_3 = 100$ .
In the second scenario we only have one residual, $r_4 = 30$ .

$r_4$ is exactly $r_1+r_2+r_3$ , so the total error of the first three residuals is exactly the same as the error of the fourth residual. The SOSR in scenario 1 would be $100+100+100=300$ . However, the SOSR in the second case would be $30^2 = 900$ . The large residual has a weight three times larger than the three smaller residuals, even though their total error is exactly the same! If we were to take the SOAR instead, we would get: $|r_1| + |r_2| + |r_3| = 30 = |r_4|$ .

As we see, squaring the residuals puts more weight onto large errors and less weight on smaller errors.

So when minimizing the SOSR, we always take care of the largest residual first. In contrast, when minimizing the SOAR, we treat each residual equally.

And most of the time, this property of the SOSR is a good thing. Making lots and lots of small errors is usually better than making one huge error. There is a lot more depth to this, and the topic of squares vs. absolutes appears again in another context, namely ridge and lasso regression. If you are interested in reading more about this specific topic, I recommend you take a look at the article Ridge and Lasso Regression Explained, Step by Step, where I explain this topic in its full depth. But now back to our linear regression.

Since the SOSR is the default metric, linear regression is often also called least squares regression, since our goal is to find the function that minimizes the squares in our SOSR.

Let’s evaluate our functions

Ok, great, so now we have a procedure that gives us a concrete number (or error) for every function, telling us how bad it is at predicting house prices in our dataset. Now let’s calculate these values for our three functions $f$ , $g$ and $h$ . For this let’s quickly implement our three functions as well as our SOSR in Python, so we don’t have to do the math by hand:

 def f(x):
     return 60*x
 
 def g(x):
     return 70*x
 
 
 def h(x):
     return 45*x + 5
 
 def sosr(data,function):
     sosr = 0
     for (x,y) in data:
         residual = y - function(x)
         sosr += residual**2
     return sosr

We can model our dataset as a simple list of tuples (number of bedrooms, price) in Python, like this:

 data = [(1, 80), (2, 170), (3, 100), (3, 220), (4, 200), (4, 270), (5, 500)]

If we now call sosr on our dataset and each of our functions $f$ , $g$ and $h$ we get:

 >>> sosr(data,f)
 53400
 >>> sosr(data,g)
 42200
 >>> sosr(data,h)
 94875

Adjusting the SOSR

So it looks like the function $g$ best fits our data! Now I don’t know about you, but I’m a bit unhappy with our metric, the SOSR. I mean, it’s a good metric, but we can’t really interpret a SOSR of 42200. If we did not have the SOSR-values for $f$ and $h$ , how could we tell if a SOSR of 42200 is very good, decent, bad, or terrible? If we take the square root of 42200, we get ~205. This means that our function made a total error of roughly 205.000$. Now that’s pretty bad! But this number is a bit tricky to interpret. An error of 205.000$ is really bad if we only predicted the price of 7 houses, but what if we predicted 70.000 houses instead? Then this error suddenly doesn’t look that bad anymore.

Let’s therefor slightly modify our metric. After we sum the squares of every residual, let’s also divide the final result by the number of data points in our dataset. In other words, let’s take the mean (or the average) of the SOSR instead of the SOSR. This way, we don’t get the combined squared error, but instead we get the average squared error per point. And that is a lot easier to interpret. Lets take a look:

 def mean_sosr(data,function):
     sosr = 0
     for (x,y) in data:
         residual = y - function(x)
         sosr += residual**2
     mean_sosr = sosr / len(data)
     return mean_sosr

Now if we use the mean_sosr to evaluate our functions we get:

 >>> mean_sosr(data,f)
 7628.571428571428
 >>> mean_sosr(data,g)
 6028.571428571428
 >>> mean_sosr(data,h)
 13553.57142857143

and if we take the square root of each value, we get:

 >>> sqrt(mean_sosr(data,f))
 87.34169352933013
 >>> sqrt(mean_sosr(data,g))
 77.64387566686395
 >>> sqrt(mean_sosr(data,h))
 116.41980685678631
 >>>

This means that on average, $f$ made an error of ~87.000$, $g$ an error of ~77.000$ and $h$ an error of ~116.000$ for each prediction. Now that’s a lot more intuitive if you ask me! We get a very intuitive feel for the size of the error our functions make. If our function $g$ predicts some price we know that, on average, that price can be off by up to 77.000$. Of course, this is just a rough estimate, but it still helps to get a more direct feel for how good our bad our functions are. And most importantly, we know that an average error of 77.000$ is pretty bad! And we can see this without having to compare multiple functions to each other.

Clearing Things Up

Ok, so there are a few things that I want to address at this point to make everything as clear as possible. First and foremost, let’s fix our naming conventions.

Fixing the Names

SOAR, SOSR, and mean SOAR might be very descriptive names, but they are not used in practice. Usually, we use more general terms. In practice, the mean SOSR (mean sum of squared residuals) is just called the MSE (mean squared error)

The SOAR (mean sum of absolute residuals) is just called the MAE (mean absolute error) instead.

. The MSE is used not only for linear regression, which is why we use the more general term mean squared error instead of explicitly naming the residuals. So from now on, let’s just use the term MSE instead of mean SOSR.

Why not just take the Square Root?

Ok, we do take the square root to interpret our MSE, so why don’t we include the square root into our metric directly? We could, and it can be helpful at times. The resulting metric would be called the root mean squared error or RMSE and it is sometimes used in practice. But using the RMSE instead of the MSE would not really help us out a lot. But why?

First of all, this would again just make our math more complicated. And if we can keep our metric simple, that’s usually a good idea, because a simpler function means a simpler derivative, which means we can compute it faster. Also, we would mostly undo the effect which weighs large values more heavily than small ones, which we mentioned earlier. In this case we actually want to weigh large errors more heavily, which is another reason to just stick with the MSE. Would the RMSE still work though? Yes, it would. But why make things more complicated when the simpler option already does everything we want?

Apart from that, it is also nice to know that a value $x$ which minimizes a certain function $f$ also minimizes the square root of the function $f$ .

In short, this is because for positive numbers

a

and

b

a \leq b

if and only if

\sqrt{a} \leq \sqrt{b}

. You can find more info here.

This is why taking the RMSE instead of the MSE would also work to solve our linear regression problem.

Ok, now that we have these things cleared up, let’s continue with finding the best possible line for our data!

Finding the Best Function: Visualize

Before we think about how to find the best possible function, let’s first think about how this function would behave or in other terms, which properties it should have.

We have said that we can use any function $f$ that has the following form:

f(x) = mx + b

for some $m$ and some $b$ . We then defined our evaluation metric to be the MSE. The MSE for our specific case is defined like this:

MSE(f,data) = \frac{1}{7}\sum_{i=1}^7{(y_i-f(x_i))^2}

where our data contains tuples of the shape $(x_i,y_i)$ . We can assume that our data is constant since we are only looking at our particular dataset. This means that our MSE only depends on the function $f$ . And our function $f$ only depends on $m$ and $b$ , since $x$ ist just our input data. With those aspects in mind, we can rewrite our MSE as such:

MSE(m,b) = \frac{1}{7}\sum_{i=1}^7{(y_i-{\color{#26a6ed}(mx_i+b)})^2}

The part in blue is equal to $f(x_i)$ . Because our $x_i$ s and our $y_i$ s are constant, our MSE now only depends on the two variables (or model parameters) $m$ and $b$ .

What we can do now is we can visualize this definition of our MSE for every value of $m$ and $b$ . This results in a three-dimensional plot, where we have $m$ on the x-axis, $b$ on the y-axis, and our MSE on the z-axis.

The resulting plot looks like this:

3D Plot of the MSE for every m and every b

make interactive

Ok, that looks neat! Our final goal is to find the function, that for some $m$ and some $b$ has the lowest MSE. In this plot, we can see the MSE for every $m$ and every $b$ . Thus, if we can find the lowest point on this surface (i.e. the minimum), we get the lowest possible MSE! And then all we have to do is take the values of $m$ and $b$ for that minimum, plug them into our function, and we’ve solved our linear regression! Ok, that’s the plan. But how do we find this minimum, without having to go through every possible value for $m$ and $b$ ?

I want to present you with two different ways for how we can compute our ideal function.

Finding the Minimum: Normal Equation

Linear regression as a problem is simple enough (mathematically speaking), that we can compute a function that gives us the ideal values for our $m$ and our $b$ directly. This means that we only have to solve one single equation, and we’re done. When something like this exists, we say that there is a closed-form solution for our problem. The corresponding function is then called a normal equation. Before we take a look at the normal equation though, let us first vectorize our MSE. We don’t have to do this, but doing this not only makes the normal equation look much neater, but it also makes this procedure easier and more efficient to implement.

Adjustment: Vectorization

We defined our MSE like this:

MSE(m,b) = \frac{1}{7}\sum_{i=1}^7{(y_i-(mx_i+b))^2}

There is nothing wrong with that definition, but if we wanted to translate our definition into code, what we would probably do is create a loop that sums up each of the individual residuals. This is exactly how we’ve implemented our mean_sosr-function. For our small dataset of 7 points that is no big deal, but imagine we had 100.000 data points (which is not uncommon today). Every time we wanted to compute our MSE, we would have to perform 100.000 operations through our loop. We can do better than this! For this, we will vectorize our equation. This has the benefit that we can use NumPy to perform our calculations, and NumPy is a lot faster than just a regular Python loop! Vectorization is worthy of its own post, which is why I have created a separate article where I explain in detail what vectorization and its benefits are and how we can vectorize our linear function as well as the MSE. I would highly recommend you give it a read, but if you want to keep things short, here’s what we do in a nutshell. We will rewrite our definition of the MSE to this:

MSE(m,b) = \frac{1}{7} \cdot ( (y_i-{\color{#26a6ed}y_{pred}})^T \cdot (y_i-{\color{#26a6ed}y_{pred}}) )

where ${\color{#26a6ed}y_{pred}}$ is equal to:

{\color{#26a6ed}(\textbf{x}_b \cdot \boldsymbol{\theta})}

where $\boldsymbol{\theta}$ and $\textbf{x}_b$ are defined as:

\boldsymbol{\theta} = \left[\begin{array}{c} \theta_0 \\ \theta_1 \\ \end{array}\right] = = \left[\begin{array}{c} b \\ m \\ \end{array}\right] , \textbf{x}_b = \left[\begin{array}{c} 1 \\ x \\ \end{array}\right]

In code, our new implementation will look like this:

 def mse_vectorized(y, y_predicted):
     error = y-y_predicted # save y-y_pred in a variable so that we only have to compute it once
     loss = 1/(y.size) * np.dot(error.T, error)
     return loss

If you are interested in how we arrive at this new definition, I encourage you to read the corresponding article, where we go through this particular example step by step. It should only take about 7 minutes to read, so if you are interested, take a look and then return to this post afterward.

With the vectorization now cleared up, let’s finally take a look at the normal equation.

The Normal Equation

The equation looks like this:

\hat{\boldsymbol{\theta}} = (\mathbf{x}_b^T\mathbf{x}_b)^{-1}\mathbf{x}_b^T \mathbf{y}

where $\hat{\boldsymbol{\theta}}$ is the optimal parameter vector $\theta$ , meaning it contains the parameter values that minimize our MSE. The $^T$ in $\textbf{x}_b^T$ means we are transposing $\textbf{x}_b$ . Loosely speaking this just means that we swap the rows and columns with each other in our $\textbf{x}_b$ . The $^{-1}$ after the brackets on the left means we take the inverse of the matrix that is calculated inside of the brackets.

Getting to this normal equation requires a bit more math

In a nutshell, what we are doing is we take the partial derivatives of our MSE with respect to our model parameters, and then we set these derivatives equal to zero to find the optimal values for our parameters.

and this post is already pretty long, so I am not going to explain this in more depth. However, if you are interested in reading a detailed explanation on how to get to this equation, definitely let me know in the comments below! If there is enough demand, I will create a separate post where I explain this in more detail.

Implementing the Normal Equation

With that being said, we can simply implement this equation in Python like this:

 def normal_equation_linear_regression(x,y):
     intercept_ones = np.ones((len(x),1)) # results in array( [ [1],..,[1] ] )
 
     x_b = np.c_[intercept_ones,x] # we now add the additional ones as a new column to our X
 
     theta_optimal = np.linalg.inv(x_b.T.dot(x_b)).dot(x_b.T).dot(y) # the normal equation
 
     return theta_optimal

In the first line of our function, we create this array:

 array([[1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.]])

In the second line, we combine this newly created array with our x. We do this by using numpy’s c_-operator. This will take our X_b and our intercept_ones and concatenate them along the second dimension. What this means is that it will add our ones as a new column. The resulting array will look like this:

 array([[1., 1.],
        [1., 2.],
        [1., 3.],
        [1., 3.],
        [1., 4.],
        [1., 4.],
        [1., 5.]])

Lastly, we compute our normal equation. np.linalg.inv computes the inverse of our matrix, whereas .T transposes the matrix it is being called on. The .dot-operator is just a multiplication, meaning X_b.T.dot(X_b) is equal to $\textbf{x}_b^T \cdot \textbf{x}_b$ .

Ok, now let’s plug these values into our function and see what values we get, shall we?

 dataset = np.array([
     (1,80),
     (2,170),
     (3,100),
     (3,220),
     (4,200),
     (4,270),
     (5,500)
     ])
 
 x = dataset[:,0]
 y = dataset[:,1]

 >>>normal_equation_linear_regression(x,y)
 array([-46.31578947,  84.73684211])

Ok, that looks like something. Let’s also calculate the MSE for those values:

 def f_ideal(x_b):
     return np.dot(x_b,[-46.31578947, 84.73684211]) # -46.31578947 + x * 84.73684211
 
 >>>mse(x,f_ideal(x))
 5691.72932330827

That looks pretty good! The square root of 5691 is ~75, which means that our ideal function is off by “only” about 75k$ per estimate. Now that still does not sound that great, but it is a lot better than any of our initial functions $f$ , $g$ and, $h$ . If we display this point (-46.32, 84.74) on our three-dimensional plot, we get this:

3D Plot of our MSE for every m and every b with minimum

make interactive

Indeed, it really is the minimum!

Now we have learned how we can solve any linear regression problem in just a few lines of raw Python code and we also took a look at the math behind it.

Using Scikit-Learn

We can achieve the same result by using the popular library scikit-learn. Scikit-learn provides a LinearRegression-class we can use for this. But before we can do that, we have to make a small adjustment to our data. Scikit-learn expects our x to be two-dimensional (since, in most cases, we will have more than one feature), but right now our x is only one-dimensional. So we redefine our x like this:

 x = np.array(
     [
             [1], [2], [3], [3], [4], [4], [5]
     ]
 )
 
 y = np.array(
     [ 80, 170, 100, 220, 200, 270, 500]
 )

and now we can solve our linear regression problem like this:

 from sklearn.linear_model import LinearRegression
 linreg = LinearRegression()
 linreg.fit(x,y)

To inspect our variables, we can simply run:

 >>>print(linreg.coef_,linreg.intercept_)
 [84.73684211] -46.31578947368433

As we see, scikit-learn got exactly the same result as our own code. Nice!

Our Final Line

Let’s now finally take a look at our optimal line:

Optimal Line for Housing Dataset

make interactive

We can also plot our line directly in Python with the following matplotlib-code:

 import matplotlib.pyplot as plt
 y_predict = [f_ideal(i) for i in x]
 
 plt.plot(x, y_predict, "g-") #g- means: use green as the color ("g"), and draw a line ("-")
 plt.plot(x, y, "b.") #b. means: use blue as the color ("b"), and draw individual points (".")
 plt.show()

Cool! As you see, the points lie pretty close to our line, except for the last point. Our mathematical formula gave us the optimal solution, but maybe we can still ask ourselves if we can somehow do better? The answer is maybe. You may have noticed that our last data point seems a bit off. I added this point on purpose. This last point is a so-called outlier, a value that significantly distances itself from the average value at that point. There are a number of different ways we can handle outliers and if you want to learn more about them, I recommend you read my article Outliers in Data and What You Can Do To Fix Them, where I explain in detail how you can find outliers and what you can do to stop them ruining your machine learning model. For now though, we can be happy that we found our ideal function!

Finding the Minimum: Gradient Descent

Another way to find our ideal function takes a completely different approach. Instead of directly computing the ideal values for our variables, we instead approach them step by step. In a nutshell, gradient descent starts with a random function and continuously improves it until it can no longer be improved.

I think Aurélien Géron described the idea behind the algorithm with a great analogy in his book “Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow”, so I will share it with you here.

Imagine you are on the side of a hill and you want to get to the valley. But the air is very foggy. So foggy, that you can only see the ground you are standing on and nothing else. But you can still feel the slope of the hill, right? So by taking very small and cautious steps in the downward direction, you will eventually reach the valley. This is in essence how gradient descent works.

Gradient descent is worthy of its own post, so I will not go into further detail here. If you are interested in reading more about how we can use gradient descent to solve our linear regression, I recommend you read Gradient Descent for Linear Regression Explained, Step by Step, where I go into much more detail and show you step by step how we can use this method to find the function that best fits our data.

Complexity

To finish off this post let’s talk a bit about complexity. If there are multiple ways to solve our linear regression problem, how can we decide on which one we should use? Let’s perform a small complexity analysis for this.

When we solve our normal equation, we have to compute $(\textbf{x}_b^T \textbf{x}_b)^{-1}$ . $(\textbf{x}_b^T \textbf{x}_b)$ is an $m \times (n+1)$ matrix where $n$ is the number of our features and $m$ is the number of samples in our dataset. So for our $\textbf{x}_b$ , which looks like this:

\textbf{x}_b = \left[\begin{array}{c} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 3 \\ 1 & 4 \\ 1 & 4 \\ 1 & 5 \\ \end{array}\right]

our $n$ will be 2 and our $m$ will be 7. The second dimension is $n+1$ instead of $n$ , because we add the 1 to every sample to account for our bias term. If we transpose an $m \times (n+1)$ matrix, we get an $(n+1) \times m$ matrix. Multiplying these two matrices together gives us a $(n+1) \times (n+1)$ matrix.

Inverting this matrix has a time complexity somewhere between $O(n^{2.4})$ and $O(n^3)$ because there are multiple ways to compute the inverse of our matrix. The complexity therefor depends on the specific implementation. This means that if we double the number of features (or input parameters) that we have, we multiply our computation time by a factor between $2^{2.4} = 5.3$ and $2^3=8$ . This is very costly.

The other operations in this equation are all matrix multiplications, which have a complexity of at most $O(n^2m)$ . As we see, solving the normal equation has a very bad time complexity with regard to $n$ , the number of features (it’s between quadratic and cubic) and a pretty good time complexity with regard to $m$ , the number of entries in our dataset (it’s linear!).

With gradient descent, we only perform one small step at a time. This means that the complexity of the closed-form solution comes from very few large operations, whereas the complexity of gradient descent comes from very many small operations. Performing one iteration of gradient descent is therefor fairly cost-efficient. The cost of gradient descent comes from performing many of these small steps.

So how do we choose between the two?

If our dataset is very large, as long as it fits in our memory, solving the normal equation won’t be too difficult. But if we have large number of features, things will get ugly.

At some amount of features, solving the normal equation will no longer be feasible, and it becomes more efficient to use gradient descent instead and iteratively approach the ideal solution, rather than computing it directly. So if you have to choose between these two, you can decide based on the number of features in your dataset.

However, most of the time when you want to use linear regression, you dataset will likely not have 1000+ features. If it does, then you are most likely trying to solve a rather complicated problem with machine learning, in which case there might be an even better tool to use than linear regression. This means that most of the time you will be just fine using scikit-learn’s LinearRegression-class.

Let’s look at how Scikit-Learn handles this, shall we?

What does Scikit-Learn use?

Scikit-learn uses yet another technique for its LinearRegression-class: Singular Value Decomposition (or SVD for short). The complexity of this approach is roughly $O(mn^2)$ . This means that with regard to the number of entries in our dataset, we still have a linear time complexity. But this approach is a bit faster with regard to the number of features in our dataset. Explaining exactly how SVD works in the context of linear regression would make this already long post even longer. If you are interested in reading more about this though, let me know in a comment below so that I know there is demand for this! If you want to use gradient descent for solving a linear regression problem with scikit-learn, you can use the SGDRegressor-class instead.

Below you can see a plot that shows how many computations we have to perform for specific amounts of input parameters.

Complexities

make interactive

You can click on the labels in the top right to toggle the individual graphs on and off. As you see, the differences between the complexities are massive! Because of this, it is crucial that you choose the right method to solve your linear regression problem.

Info

Filter by Tags

MLC Search

Table of Contents

Linear Regression Explained, Step by Step

Lari Giba

Lari Giba

Outline

The Main Idea

Finding the Least Squares

The problem with the SOR

SOAR vs SOSR

Let’s evaluate our functions

Adjusting the SOSR

Clearing Things Up

Fixing the Names

Why not just take the Square Root?

Finding the Best Function: Visualize

Finding the Minimum: Normal Equation

Adjustment: Vectorization

The Normal Equation

Implementing the Normal Equation

Using Scikit-Learn

Our Final Line

Finding the Minimum: Gradient Descent

Complexity

What does Scikit-Learn use?

Further Reading

Continuing in the Same Direction

Outside the Box

Elastic Net Regression Explained, Step by Step

Lasso Regression Explained, Step by Step

Ridge Regression Explained, Step by Step

Vectorization Explained, Step by Step

Gradient Descent for Linear Regression Explained, Step by Step