Optimizing Octave Code

Octave is a fantastic domain specific language for doing lots of operations on matrices and vectors. If you are reading this page, congratulations -- it probably means you approached your problem by first writing it in Octave/Matlab, and are now hoping to improve the performance of a few hot spots in your code. This is opposed to the poor person who thought they would write their code in raw C ("for speed!"), and have been spending the last couple months debugging their hopeless mess.

However, we're not off the hook either. In this article, I will start with an overview of how to approach optimization in Octave, and conclude with some tips on writing C++ code for Octave user functions.

First things first, as with all optimization a look at your algorithm is probably the best place to start. I'm going to assume you have picked a decent algorithm or approach to your problem, but it never hurts to revisit your assumptions. Taking things from O(n^2) to O(n) can make a bigger difference (depending on your data set) than any of the approaches outlined here.

Next, we can use a very low-tech approach to finding hot spots in your code. As per the Pareto principle, there is probably some small corner of your function that is taking the bulk of the execution time. To find this hot spot, we can make use of the tic() and toc() functions. The former starts a timer and the latter stops the timer, printing out the amount of elapsed time as a side effect if evaluated:

octave-3.0.2:9> m = rand(1000,1000);
octave-3.0.2:10> v = rand(1000,1);
octave-3.0.2:11> tic(); m * v; toc()
Elapsed time is 0.00252795 seconds.

Using these two functions as a primitive profiling tool, you can quickly winnow down what part of your code is causing problems by a binary search approach. Time the whole section of code, then the first and second halves, and so on. It will quickly become apparent where the problem is.

At this point, one of two things has happened: either you have slapped your forehead in realization of something dumb ("I don't need to re-read that huge file at every step in the for-loop!"), or you need to think about how to rewrite some of the problematic code you found. Assuming the latter case, let's review some facts about Octave.

As the Octave documentation will tell you, Octave performs best when it is able to "vectorize" operations. For example, writing a for loop to iterate over elements in a huge matrix and accumulate the sum will take much longer than the equivalent "sum" operation on the matrix. Don't believe me? Take a look:

octave-3.0.2:25> m = rand(1000,1000);
octave-3.0.2:26> tic(); total = 0; for i = 1:1000; for j = 1:1000; total += m(i,j); endfor; endfor; toc()
Elapsed time is 12.471 seconds.
octave-3.0.2:27> total
total = 4.9963e+05
octave-3.0.2:28> tic(); total = sum(sum(m)); toc()
Elapsed time is 0.01094 seconds.
octave-3.0.2:29> total
total = 4.9963e+05

To answer the question of why this is so, you have to consider what the Octave interpreter is doing at any given point. Looking at the Octave source code itself can also be edifying. In the above example, inside the for-loop we are referencing every element in the 1000 x 1000 matrix. Each reference involves at least one function call within Octave itself to get the value after checking that the indices you supplied were within bounds, etc. In the second case, we are calling the sum function, whose sole purpose in life is to accumulate totals of N-dimensional objects along a specified (or default) dimension.

Many operations in Octave can be expressed in terms of vector/matrix operations which will make them run much faster. One good approach is to just look for every for/while loop you have written and ask yourself, "Could I have done this as a vector or matrix operation?". For example, subtracting elements of a vector from every element from each column vector of a matrix.

Assuming you've looked at all of the possible causes of slowness discussed so far, we will now go into using oct-files to achieve even better performance. It's worth mentioning that this is a fairly last-ditch approach, and so please think carefully about whether or not it is really necessary. Most of the time, it can be hard to achieve a good balance between the amount of time it might take you to write and debug the equivalent C++ code versus how much time savings you can achieve with it.

I will use an example that may seem to work against my argument at first. One of the traditional ways to do ordinary least squares regression is to compute inv(XTX) * XT * y. If it turned out that doing regressions was your bottleneck, you might reasonably start to wonder if it is possible to speed this process up by directly calling the LAPACK functions from C++ to compute this yourself.

The first thing to do is get familiar with how to build oct-files using mkoctfile. The documentation has an excellent overview of how to do this, and I encourage you to try a simple example first before attempting anything more complex.

To do the OLS calculation above, it turns out we have to write some fairly complex code. We will need to call the BLAS dgemv_ and dgemm_ functions to do the matrix-matrix and matrix-vector multiplies. Each of these has a large number of parameters you have to get exactly right or you'll get memory corruption and a core dump will result (hopefully?). When I went through this exercise myself, however, it was actually fairly straightforward to use the CLAPACK header on my system (Mac OS X has a framework called vecLib which includes LAPACK, BLAS, etc). One very helpful hint is that the Octave Matrix object has a member function called fortran_vec() which returns a double pointer to the column-major memory the matrix data is stored in. This is exactly the pointer type the LAPACK routines need.

Upon further reflection and reading of the LAPACK documentation, you might realize that it is possible to do OLS by simply calling the LAPACK dgels_ function. This function computes the alternative notation of OLS, Ax = b, in one step. The "x" value is the same betas result obtained by the longer equation above.

Great! Here is the code, for reference:

// solve b = inv(XTX) * XT * y
// aka Ax = b (where A is X, B is y, and x is b (oh boy...))
char trans = 'N';
__CLPK_integer m = X.rows();
__CLPK_integer n = X.columns();
__CLPK_integer nrhs = 1;
__CLPK_integer lda = m;
__CLPK_integer ldb = y.rows();
__CLPK_integer info = 0;
__CLPK_doublereal work[256*64+256];
__CLPK_integer lwork = sizeof(work)/sizeof(__CLPK_doublereal);
dgels_(&trans, &m, &n, &nrhs, X.fortran_vec(), &lda, y.fortran_vec(), &ldb, work, &lwork, &info);
if(info != 0)
{
fprintf(stderr, "Error in dgels_");
return;
}

It certainly is a lot shorter than doing all the multiplies, transposes, etc manually! This sounds like a winner, doesn't it? And, if we do some speed tests comparing the naive OLS written in Octave versus our faster OLS written in C++ calling LAPACK, we see a slight uptick in performance:

/Volumes/AEGIS/nolanw/dev/octave cat calc_test.m
A = [76 27 18 ; 25 89 60 ; 11 51 32; 12 13 14];
b = [10 7 43 25];

printf("Naive solver:\n");
s = tic()
b * (inv(A' * A) * A')'
e = tic()
printf("Octave left division:\n");
s = tic()
A\b'
e = tic()

printf("calc result:\n");
s = tic()
calc(A, b)
e = tic()

/Volumes/AEGIS/nolanw/dev/octave /Applications/Octave.app/Contents/Resources/bin/octave
GNU Octave, version 3.0.2
Copyright (C) 2008 John W. Eaton and others.
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; not even for MERCHANTIBILITY or
FITNESS FOR A PARTICULAR PURPOSE. For details, type `warranty'.

Octave was configured for "i386-apple-darwin8.9.1".

Additional information about Octave is available at http://www.octave.org.

Please contribute if you find this software useful.
For more information, visit http://www.octave.org/help-wanted.html

Report bugs to (but first, please read
http://www.octave.org/bugs.html to learn how to write a helpful report).

For information about changes from previous versions, type `news'.

Prepended /Volumes/AEGIS/nolanw/dev/octave/lib/3.0.2/10.4.11 to local library path.
Welcome...
octave-3.0.2:1> calc_test
Naive solver:
s = 1229959469735221
ans =

0.034725 -0.580661 1.291719

e = 1229959469750492
Octave left division:
s = 1229959469750798
ans =

0.034725
-0.580661
1.291719

e = 1229959469751124
calc result:
s = 1229959469751433
ans =

0.034725 -0.580661 1.291719

e = 1229959469766306

The "s" and "e" printouts show the number of "ticks" that elapsed between calls to tic and toc. Divided by 1000000.0, they represent seconds. The naive solver took approximately 0.015271 seconds to calculate the betas (obviously the accuracy and precision of the timing results could be increased by increasing number of trials). (I should mention it is best to run these tests at least twice, once to "warm up" any subsystems that do caching or initialization.)

The "calc" function (which is a function containing the dgels_ code I showed above) took 0.014873 seconds to run, to get the same result. That is a slight improvement... However, we have to keep in mind that the function also makes some significant assumptions about the data types used and their sizes (remember that hardcoded work array?).

However, there is one other item in the test results we haven't talked about -- the left division. What's that? It looks like it took 0.000326 seconds!

It turns out that "left division" (the backslash operator for matrix-vector) in Octave solves the Ax=y equation for you. Reading the documentation or Googling for the problem beforehand would have given us a heads-up on this and saved a lot of time. Looking at the Octave code, it is also apparent that the Matrix object (eventually) calls (a variant of) dgels_ anyway.

So, the last question is why is the built-in operator so much faster than our function, if it is doing all these other range checks, etc but still calling the same LAPACK code we do in a very short function? Inserting printf's of timeofday() values into the C++ reveals the answer. The bulk of the C++ code's time is not spent in dgels_ (in fact, that time is quite close to the total time spent in the left division operator). Most of the time is spent in Octave's actual function call mechanism. If you think about it, it sort of makes sense -- the built in operator will get immediately identified by the lexer and called straight away. A user-defined function needs to be identified as such, looked up in a symbol table, have its arguments marshalled appropriately, and then get dispatched.

As I mentioned, this was something of an anti-example. However, an important lesson to take away is: if you can amortize the cost of the function call overhead, you can actually obtain some impressive speedups writing Octave functions in C++.


Top

Contact email: nolanw followed by underbar then the word "commerce" at willnolan dot com

Last modified: Mon Dec 29 07:24:00 PST 2008