1. Outline
2. The main idea
3. Vectorization
1. Example 1: Linear Function
2. Example 2: MSE
4. Speed Comparison
Machine Learning Math

# Vectorization Explained, Step by Step

Vectorization is one of the most useful techniques to make your machine learning code more efficient. In this post, you will learn everything you need to know to start using vectorization efficiently in your machine learning projects.

Share on:

## Outline

In this post, we will go over what vectorization really is, what the benefits of vectorizing your code are, and how you can do so. You do not need to have any specific knowledge prior to reading this post, but it will be very helpful if you are familiar with linear regression and the mean squared error since we will use those as an example. If you are not, I recommend you read the post Linear Regression Explained, Step by Step, where you will learn everything you need to know about those two. This post and the linear regression post, in particular, go hand in hand, as the examples in this post are taken directly from linear regression itself. We will also use the same dataset and terminology that was used in the post about linear regression, so even if you are familiar with linear regression, you might still find it helpful to just quickly skim through the linked post.

## The main idea

If you’ve implemented a few machine learning algorithms on your own or if you perform many calculations in your programs, you know how long those calculations can sometimes take. Let’s say we have a dataset containing 100.000 data points and we want to perform some sort of operation on every single one of those points. If we use a regular loop for this, things will get very inefficient very quickly. If we can instead vectorize our operation and just perform one (or a few) large vector or matrix operation using something like NumPy, our code will run a lot faster

*
Here is an interesting article comparing the two. Take a special look at the “Dot product”-section.
. Ok, so now we know why vectorizing our operations might be a good idea. But how can we do so? Let’s look at two examples.

## Vectorization

### Example 1: Linear Function

Let’s say we want to use linear regression to predict the price of a house based on the number of bedrooms it has. We could create a linear function like this one:

$f(x) = mx+b$

where x is the number of bedrooms in a house. Ok. Now let’s say we want to predict the price of a house based on the number of bedrooms, the year it was built in and the average house price in the city the house stands in. Then our function would look more like this:

$f(x_1,x_2,x_3) = x_1 \cdot m_1 + x_2 \cdot m_2 + x_3 \cdot m_3 + b$

Things got less pretty rather quickly.. So to make our lives easier we will vectorize our initial equation! There are a couple of steps we need to take in order to vectorize our equation. First, we rename our $m$ and $b$ to $\theta_1$ and $\theta_0$. So instead of writing

$f(x) = mx+b$

we now write

$f(x) = \theta_1x + \theta_0$

This will all make sense in a moment, just bear with me for now. Then we change up the notation of our function a little bit. We now write:

$f(x) = \boldsymbol{\theta} \cdot \mathbf{x}_b$

In this step, we rewrote our function to now use vectors instead of scalars. In other words, instead of multiplying a bunch of individual numbers together, we now only need to multiply two vectors. We define $\boldsymbol{\theta}$ and $\textbf{x}_b$ like this:

$\boldsymbol{\theta} = \left[\begin{array}{c} \theta_0 \\ \theta_1 \\ \end{array}\right] , \textbf{x}_b = \left[\begin{array}{c} 1 \\ x \\ \end{array}\right]$

where $x$ is our number of bedrooms for a given house. Let’s first check if this is equal to our initial definition.

If we calculate the dot product of our two vectors $\boldsymbol{\theta}$ and $\textbf{x}_b$, we get:

$\boldsymbol{\theta} \cdot \textbf{x}_b = \theta_0 \cdot 1 + \theta_1 \cdot x = \theta_0 + \theta_1 \cdot x$

As we can see, our function remains unchanged!

But this might still look a bit confusing. Let’s go through the changes step by step.

We rewrote our $\theta_0$ and $\theta_1$ into the vector $\boldsymbol{\theta}$. So far so good. Next up we rewrite our input x into the vector $\textbf{x}_b$. Ok, but what is this 1 doing in our $\textbf{x}_b$? Isn’t our x supposed to be the number of our bedrooms? Since we always want to add our bias term $b$ (now $\theta_0$) without multiplying it with anything (remember: our function looks like this: $mx+b$), we need to account for that. By adding a 1 to our x-vector, we make sure that our bias-term is then multiplied by this 1, and therefor always added as is inside of our equation. This is also the reason why, in many books and tutorials, you will see people adding a 1 to their parameter vector. This is to account for the bias-term. Now you could write out the bias-term separately and just not add the 1 to your x-vector, but this way, our equation looks a bit neater on the surface. Both notations are used in practice, so now you know both sides of the coin! We also add this little $_b$ to our $\textbf{x}$ to signal that we accounted for the bias-term already and to also differentiate this from any input we might also call “x”. So if you see an $\textbf{x}_b$, remember that it means “x, but with a 1 added to account for the bias term”.

### Example 2: MSE

So a non-vectorized definition of the MSE might look like this:

$MSE(m,b) = \frac{1}{7}\sum_{i=1}^7{(y_i-(mx_i+b))^2}$

There is nothing wrong with that definition, but if we wanted to translate our definition into code, what we would probably do is create a loop that sums up each of the individual residuals. The code could look similar to this:

.css-1ov1681{-webkit-display:table-row;display:table-row;padding-right:1em;}.css-i0kwuw{display:table-cell;text-align:right;-webkit-user-select:none;-moz-user-select:none;-ms-user-select:none;user-select:none;opacity:0.3;width:0px;} .css-noiznb{display:table-cell;padding-left:1.75em;}def mse_non_vectorized(data,function):     sosr = 0 # sum of squared residuals     for (x,y) in data:         residual = y - function(x)         sosr += residual**2     mse = sosr / len(data)     return mse

Now, for our small dataset of 7 points that is no big deal, but imagine we had 100.000 data points (which is not uncommon today). Every time we wanted to compute our MSE, we would have to perform 100.000 operations through our loop. We can do better than this! So let’s vectorize our MSE!

For this, let’s use an exemplary dataset with two features: the number of bedrooms in a house and the price for that house. The dataset looks like this:

make interactive

In Python, we can implement our dataset as a simple list:

 data = [(1, 80), (2, 170), (3, 100), (3, 220), (4, 200), (4, 270), (5, 500)]

Let’s also create an exemplary linear function so we can calculate our MSE:

 def f(x):     return 60*x #number of bedrooms * 60k\$ = price of the house

If we use our non-vectorized implementation of the MSE, we get:

 >>>mse_non_vectorized(data,f) 53400

Ok, now let’s vectorize it! First, let’s make some small changes. Let’s reshape our individual data values (1,80), (2,270) and so on into two vectors $\textbf{x}$ and $\textbf{y}$. This will be useful later on when we redefine our MSE. Those vectors look like this:

$\textbf{x} = \left[\begin{array}{c} 1 \\ 2 \\ 3 \\ 3 \\ 4 \\ 4 \\ 5 \\ \end{array}\right] , \textbf{y} = \left[\begin{array}{c} 80 \\ 170 \\ 100 \\ 220 \\ 200 \\ 270 \\ 500 \\ \end{array}\right]$

With this, we can now redefine our MSE like this:

$MSE(m,b) = \frac{1}{7} \cdot \sum_{i=1}^7{(y_i-{\color{#26a6ed}(\textbf{x}^{(i)}_b \cdot \boldsymbol{\theta})})}^2$

Remember, the part of the equation that is colored in blue is equivalent to $f(x)$. Ok, but what does this $^{(i)}$ mean? Is this something like the “i-th power?” I’m glad that you ask! If we write out our equation in more detail once again, we get:

$MSE(m,b) = \frac{1}{7} \cdot ((80-{\color{#26a6ed}\begin{pmatrix}1\\1\end{pmatrix} \cdot \begin{pmatrix}\theta_0\\\theta_1\end{pmatrix}})^2 + (170-{\color{#26a6ed}\begin{pmatrix}1\\2\end{pmatrix} \cdot \begin{pmatrix}\theta_0\\\theta_1\end{pmatrix}})^2 + ...)$

As we see, $\textbf{x}^{(i)}_b$ is the $i$-th sample in our $\textbf{x}_b$. Since we’ve already used the $b$ as a subscript, we just put the $i$ at the top, as a superscript. We write the $i$ in brackets to signal that this is not a power, but just an index-notation. I know this looks a bit off when you first see it, but at some point, you will get used to it. Note that we could have just not used $b$ as a subscript and put the $i$ at the bottom instead, but this way we do can show all of the necessary information, even though it looks a bit bloated.

And now, instead of taking the sum over every sample in our dataset, we instead calculate the error for every sample at once by replacing our sum with the following equation:

$y-{\color{#26a6ed}(\textbf{x}_b \cdot \boldsymbol{\theta})}$

The part in blue is just the following:

$\left[\begin{array}{c} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 3 \\ 1 & 4 \\ 1 & 4 \\ 1 & 5 \\ \end{array}\right] \cdot \left[\begin{array}{c} \theta_0 \\ \theta_1 \\ \end{array}\right] = \left[\begin{array}{c} \theta_0 + \theta_1 \cdot 1 \\ \theta_0 + \theta_1 \cdot 2 \\ \theta_0 + \theta_1 \cdot 3 \\ \theta_0 + \theta_1 \cdot 3 \\ \theta_0 + \theta_1 \cdot 4 \\ \theta_0 + \theta_1 \cdot 4 \\ \theta_0 + \theta_1 \cdot 5 \\ \end{array}\right] = {\color{#26a6ed}y_{pred}}$

Since this part represents the predicted prices of our function, we call it ${\color{#26a6ed}y_{pred}}$.

Now we can calculate the squares of residuals by taking the dot product of $y-{\color{#26a6ed}y_{pred}}$ with itself:

$MSE(m,b) = \frac{1}{7} \cdot ( (y_i-{\color{#26a6ed}y_{pred}})^T \cdot (y_i-{\color{#26a6ed}y_{pred}}) )$

This is okay to do since taking the dot product of a row vector with a column vector yields us a scalar. If we plug in the values for our concrete example and for our function $f$, we get:

$y - {\color{#26a6ed}y_{pred}} = \left[\begin{array}{c} 20 \\ 50 \\ -80 \\ 40 \\ -40 \\ 30 \\ 200 \\ \end{array}\right]$

and if we take the dot product of this (transposed) vector with itself, we get:

$\left[\begin{array}{c} 20 & 50 & -80 & 40 & -40 & 30 & 200 \\ \end{array}\right] \cdot \left[\begin{array}{c} 20 \\ 50 \\ -80 \\ 40 \\ -40 \\ 30 \\ 200 \\ \end{array}\right] = 53400$

which is in fact the same value that we got from our non-vectorized implementation of the MSE!

If we translate our new definition into code, we get:

 def mse_vectorized(y, y_predicted):     error = y-y_predicted # save y-y_pred in a variable so that we only have to compute it once     loss = 1/(y.size) * np.dot(error.T, error)     return loss

We now no longer need to pass in our entire dataset as well as a function to our MSE. Instead, we calculate $y_{pred}$ beforehand, like this: y_predicted = f(x). This makes our function a bit more lightweight.

Great, now we have vectorized our MSE as well!

## Speed Comparison

Ok, now let’s test both of our implementations on an exemplary dataset. But for a real test, our small dataset is not going to cut it. Since our dataset does not really have to make sense, we can just use some random numbers. Let’s create an array with roughly one million values and reshape it to contain individual arrays [x,y]. Then we can also save x and y separately.

 arranged_values = np.arange(1,100000,0.1).reshape(-1,2) #shape: (499995, 2) arranged_values.shape
x = arranged_values[:,0] y = arranged_values[:,1]

If we now time the executions of our non-vectorized and our vectorized MSE using the magic command %%timeit (you can learn more about those here), we get this:

As you can see, the vectorized version outperformed the non-vectorized one by a landslide!