Comparing Python, Go, and C++ on the -Queens Problem
Abstract
Python currently is the dominant language in the field of Machine Learning but is often criticized for being slow to perform certain tasks. In this report, we use the well-known -queens puzzle [1] as a benchmark to show that once compiled using the Numba compiler it becomes competitive with C++ and Go in terms of execution speed while still allowing for very fast prototyping. This is true of both sequential and parallel programs. In most cases that arise in an academic environment, it therefore makes sense to develop in ordinary Python, identify computational bottlenecks, and use Numba to remove them.
1 Introduction
Python currently is the dominant language in the field of Machine Learning and gives easy access to powerful Deep Learning packages such as TensorFlow and PyTorch. However, it is known to be slow to perform some operations such as loops, which are not always easy to vectorize away. In such situations, one might consider switching to another language, such as C++ or the more recent Go language whose similarity to Python makes them potentially attractive replacements. In this note, we will argue that this may be necessary only rarely because the Numba python compiler [8] delivers performance close to those of C++ while preserving the compactness and ease of development that make Python such a powerful prototyping tool. Furthermore, it is easy to use. Once a set of Python functions has been identified as computationally intensive, one simply adds Numba decorators before their definitions to instruct Python to compile them while leaving the rest of the code largely unchanged.
To demonstrate this, we use the well known -queens puzzle [1] as a benchmark. It involves placing chess queens on an chessboard so that no two queens threaten each other. Fig. 1 depicts three solutions on a standard board and one on a larger one. We will focus on finding the number of solutions as a function of . This is easy for small values of but quickly becomes computationally intractable for larger ones because the complexity of our algorithm is exponential with respect to . There is no known formula for the exact number of solutions and, to date, is the larger value of for which the answer has been computed [12].
All our benchmarking code available is available online.111https://github.com/cvlab-epfl/n-queens-benchmark We welcome comments and suggestions for potential improvements. The article is also on ArXiv222https://arxiv.org/abs/2001.02491 and if you find it useful you can cite it. Thanks.
2 Sequential Processing
We started from the recursive algorithm described in [15] to compute one single solution of the 8-queens problem and translated it to Python. Our implementation relies on the fact that for two queens at board locations and not to be in conflict with each other, they must not be on the same row, on the same same column, or on the same diagonal. The first two mean that and . The third holds if and . In other words, the two diagonals going through any location are completely characterized by and .
To exploit this, the function allQueensRec
of Tab. 1 allocates boolean arrays col
, dg1
, and dg2
to keep track on which columns and diagonals are still available to place a new queen on an board. It then invokes the recursive function allQueensRecursive
. At recursion level , for each , it adds a queen at location if it is available, marks the column and the diagonals and as unavailable for additional queens, and calls itself for row . The recursion ends in one of two ways. Either reaches , meaning that all rows have been successfully filled, or no more queen can be added. If the first case, a counter is incremented. In the second case, nothing happens. In both cases, the program backtracks, undoes it earlier marking, and continues until all solutions have been found. This process could be sped up by exploiting the symmetries of the -queens problems. However, this is not required for benchmarking purposes an we chose not to do it to keep the code simple. We also chose to use the C++ and Go naming convention for functions and variables, that is, we use allQueensRec
instead of the more typical all_queens_rec
, so that we can use the same names in all versions of the code we present.
In Fig. 2(a), the red curves depicts the computation time on a 2.9 GHz Quad-Core Intel Core i7 Mac running the Catalina operating system. In the top part of the figure, we plot the wall-clock time as as function of the board size using a standard scale. In the bottom part of the figure, we plot the same computation times using a log-scale instead, which results in an almost straight curve. This serves as a sanity check because the computational complexity grows exponentially with . As allQueensRecursive
performs loops, our vanilla Python implementation is inefficient. To remedy this, we used the Numba python compiler [8] as shown in Tab. 2. The code is almost unchanged except for adding a couple of Numba decorators and making the array types Numba compatible. Yet, as depicted by the green curve in Fig. 2(a), these minor modifications deliver a 33-fold increase in average computing speed of allQueensNmb
over allQueensRec
.
The Numba decorator njit()
that appears in Tab. 2 is short for jit(nopython=True)
. It ensures that if the code compile without errors it will not invoke python while running and will therefore be fast. Additionally, we could have used jit(nopython=True,nogil=True)
to instruct Numba to release the Python Global Interpreter Lock [3] while executing the function, thus allowing several versions to run simultaneously on threads of a single process, something that standard Python code cannot do because of the aforementioned lock. This does not have any significant impact on performance in a sequential execution scenario.
To further assess how effective the Python/Numba combination is, we rewrote the code in Go and C++, as shown in Tabs. 6 and 3. The short variable declarations make the Go code very similar to the Python code while being statically typed. The C++ code is slightly more verbose and one must remember to deallocate what has been allocated because there is no garbage collector. As shown in Fig. 4, this can be remedied by using more sophisticated containers such as the standard vector
of C++ that are automatically deallocated at the end of the scope of their definition. Note that we used vector<uint8_t>
as the type for our boolean arrays instead of the apparently more natural vector<bool>
. We did this because the latter packs the bits densely and has to perform binary arithmetic to extract the requested bit for each access because memory can only be addressed down to whole bytes. In other words, it reduces memory usage at the expense of increased computation. As we are interested in speed, it is therefore more effective to explicitly use bytes (uint8_t
) for our purpose. Nevertheless, we have verified that even when using byte vectors, the implementation of Fig. 4 incurs a small, but noticeable, performance decrease with respect to that of Tab. 3 and we therefore chose to stick with it, even though it is less elegant. In short, unlike Go, C++ gives the programmer great freedom to carry out tasks in many different ways but it takes a lot expertise to exploit it effectively and to avoid the many lurking pitfalls.
For example, unlike Python and Go, C++ does not automatically check that one does not write beyond the bounds of arrays. As a result, the buggy code of Tab. 5 runs but returns nonsensical values. We unintentionally made this mistake while translating the code from Python and, even though this is a short program, it took us a while to spot it. Of course, we could have used a tool such as valgrind, which would have detected the error, but this is far less convenient than being given a runtime warning.
By default Numba does not perform bounds checks but they can be enabled using the decorator njit(boundscheck=True)
, which can be useful while debugging.
The cyan and purple curves of Fig. 2(a) depict the corresponding runtimes. The Numba, Go, and C++ curves are almost superposed. Closer examination of the raw numbers give in Tab. 15 in appendix show that C++ wins. Go in slower by about and Numba by . Numba is slower mostly for low values of , which suggests that the algorithm itself runs just as fast but that calling the Numba function from Python involves an overhead. While the observed differences are statistically significant based on the variances of the different runs, in our daily research practice, they are rarely large enough to justify giving up the development speed that Python provides and to contend with potential bugs such as the one discussed above.
However, there are optimizations that require the low-level control that C++ or Go can provide. For example, in all versions of the code presented here, the memory for the col
, dg1
, and dg2
arrays is allocated dynamically on the heap. The array sizes are decided at runtime and this code could in theory handle the -queens problem for any value of . However the computational cost is exponential and any value of is wildly impractical. If we accept to limit ourselves to , we can use fixed-sized arrays allocated on the stack by declaring them as var col[32]bool
in Go or std::array<bool, 32>
in C++. Unlike in the case discussed above, using bool
instead uint8_t
has no adverse effect. We have checked that the C++ code modified in this manner delivers a gain over Numba, instead of the earlier . Potential explanations are that putting the arrays on the stack works better for the CPU cache or that the optimizer has an easier time reasoning about fixed-size stack arrays. In any event, this shows that C++, and Go, being closer to the hardware may be useful to fine-tune code under some circumstances.
Go can therefore be considered a promising alternative to both Python and C++ because its run-time checks make bugs such as the one of Fig. 5 easy to detect and correct. Furthermore, it is almost as concise as a Python and a little faster than Numba. However, some of its design features make it unwieldy in the prototyping role. For example, insisting that all variables and packages declared in a file be used makes sense for production code but is unhelpful when groping for a solution to a research problem: Commenting out a particular line of code, can mean many modifications in the file, which are unnecessary until a final solution has been found. Similarly not providing a full-fledged class-system can be understood as a way to discourage the writing of hard-to-maintain spaghetti code, which is commendable in production mode but unnecessarily rigid in prototyping mode.
3 Parallel Processing
Nearly every modern computer, including the one we used, has a multicore CPU and we can speed things up by running independent parts of the computation simultaneously on separate cores. In Go and C++, this can be done using multiple threads. Standard Python cannot do this due to the Global Interpreter Lock (GIL) [3] that we have already encountered in the previous section. Fortunately, there are several workarounds and we explored two of them:
-
Using Numba’s automatic parallelization [10]. Numba implements the ability to run loops in parallel as in Open Multi-Processing (OpenMP). The loop’s body is scheduled in separate threads and the system automatically takes care of data privatization and reduction.
-
Using a pool of processes. The pool distributes the computation into separate processes and tasks are sent to the available processors using a FIFO scheduling. Each process has its own interpreter and GIL, so they do not interfere. The price to pay is that objects need to be serialized and sent to the processes.
To test these two approaches, we parallelized the allQueensRec
in a simple way. As shown in Tab. 7, we defined a new function allQueensCol
that puts a queen in column of the first row and then invokes the function allQueensNumba
defined in Tab. 2 starting at the second row instead of the first, as in allQueensRec
. Summing the results for all possible values of yields the same results by performing independent computations. In Tab. 8, we integrate this code into two functions that spread the tasks across separate cores: allQueensPara
uses the first method described above while allQueensPool
uses the second. We will refer to them as para and pool respectively. Note how short the code is. Numba can take parallelization even further and produce functions that exploit the GPU. However, we did not explore this aspect in this study because our problem is not conducive to GPU processing.
In Fig. 2(b), we compare runtimes of the sequential Numba-compiled code of the previous section with our two parallelized versions. As before, the sequential code is depicted by the green curve while the two parallel versions are depicted by the red and blue curves, labeled para and pool respectively. para clearly delivers a significant improvement. However, for smaller values of , we noted that para does not always fully use the 8 cores of our machines, which impacts its performance. For values of up to 13, the overhead involved in spawning new processes dominates the computational cost of pool and makes it uncompetitive. However, for , this overhead becomes negligible with respect to the rest of the computation and pool starts dominating, albeit only by a small margin for large values of , as can be seen in Tab. 15.
To again compare against Go and C++, we rewrote the code in these two languages using their built-in multi-threading capabilities, as shown in Tabs 9 and 10. Note that we used the template function std::async function to make the C++ version compact. The syntax is complex but gives access to the full power of lambda functions. The corresponding performance measurements are depicted by the cyan and purple curves of Fig. 2(b). As before for small values of , para and pool are uncompetitive because the initial overhead is too large. However, for larger values of they catch up and eventually do better than Go and almost as well as C++. In short, there are corner cases in which it might pay to switch from Python to C++ or go but it is not clear how pervasive they are in our research practice.
4 Numba Limitations
In the two previous sections, we have argued that Numba is a powerful tool to painlessly compile potentially slow Python code so that it runs almost as fast as Go and C++. However, it also has limitations: Only a subset of Python [11] and NumPy [9] features are available inside compiled functions. Numba has a compilation mode that generates code able to handle all values as Python objects and uses the Python C API to perform all operations on such objects. Unfortunately, relying on this mode results in almost no performance gain over non-compiled code. This is why we used the njit()
decorator in all our examples. It yields much faster code but requires that the native types of all values in the function can be inferred, which is not necessarily true in standard Python. Otherwise, the compilation fails.
In practice, this imposes an additional workload on the programmer who has to figure out what parts of the code are computationally expensive, encapsulate them in separate functions, and make sure that these functions can be compiled using the no-python mode that njit()
enforces. This is probably why there are ongoing efforts to optimize whole Python programs such as PyPy [13]. Unfortunately, the results are not always compatible with libraries utilizing the C API, such as those routinely used in the field of scientific computing. As discussed in appendix, Julia is a potential alternative to Python/Numba that supports both high performance scientific computing and fast prototyping, is compiled, and could eventually address this problem.
5 Conclusion
As Computer Vision and Machine Learning researchers, we primarily need a language that allows us to test and refine ideas quickly while giving us access to as many mathematical, image processing, and machine learning libraries as possible. The latter spares us the need to reinvent the wheel every time we want to try something new. Maintainability and ability to work in large teams are secondary considerations as our code often stops evolving once the PhD student or post-doctoral researcher who wrote it leaves our lab. Before that happens, we typically make it publicly available to demonstrate that the ideas we published in conference and journals truly work and, in the end, that is often its main function.
Python fits that bill perfectly at the cost of being slow when performing operations such as loops. Fortunately, as we showed in this report, this shortcoming can be largely overcome by using the Numba compiler that delivers performance comparable to that of C++, which itself tends to be faster than Go. This suggests that a perfectly valid workflow is to first write and debug a program in ordinary Python; identify the computational bottlenecks; and use Numba to eliminate them. This will work most of the time. In the rare cases where it does not, we can rewrite the relevant section of the code in C++ and call it from Python, which can be achieved using Cython [2] or pybind11 [4]. Interestingly, this approach harkens to the standard way one used to work in the much older Common Lisp language, as discussed in the appendix.
Appendix
Appendix A Other Languages
In his book [15], N. Wirth proposed the -queens algorithm implemented in Pascal and shown in Tab. 11. It is specific to the case and is designed to stop when the first solution is found. It then returns the corresponding configuration. As shown in Tab. 12, this can be done even more concisely in Prolog, a logic programming language of the same era as Pascal. What makes Prolog particularly interesting is that, of all the languages discussed in this report, it is the only one that forces a truly different approach to programming. It is well suited to performing the kind of systematic exploration and backtracking that solving the -queens problem requires and is still used to solve specific tasks that involve rule-based logical queries such as searching databases.
As not all implementations of the Pascal standard support dynamic arrays, extending the program of Tab. 11 that is specific to the case to the general case would require manually allocating memory using pointers, much as in C. This would not be necessary in the even older Common Lisp language, as shown in Tab. 13. Although among the most ancient languages still in regular use, Lisp offers many of the same amenities as Python, that is, dynamically allocated arrays, sophisticated loop structures, and garbage collection among others. It can be used either in interpreted or compiled form, much like Python without and with Numba. When invoking the compiler, the optional type declarations help it generate faster code. A standard approach to developing in Lisp is therefore to prototype quickly without the declarations and then add them as needed to speed up the code. Interestingly Python is now moving in that direction with its support for type hints but does not yet enforce that variable and argument values match their declared types at runtime. They are only intended to be used by third party tools such as type checkers, IDEs, and linters [14].
As can be seen in Fig. 2(a), because it is compiled, Lisp does not perform so badly compared to the more recent languages we discussed in this report. The even newer Julia [6] language can be understood as being related to it in that it supports rapid prototyping by allowing interactive execution while being compiled. Type declarations are available but optional and the code is very Python-like, as can be seen in Tab. 14. Like Numba, Julia uses LLVM [7] to perform low level optimizations and produce efficient native binary binary code. Unlike in C++ or Go, there is no explicit compilation step and yet it delivers performance that are almost on par with those of the other compiled languages we discussed, as can be seen in Fig. 2(a) and Tab. 15.
This make Julia a potentially attractive alternative to Python/Numba. Unfortunately, there are some significant obstacles to its adoption. First, it still is a new language. Some important features remain experimental and the number of third-party libraries is limited, whereas Python gives access to a wealth of powerful libraries, such as the deep learning ones that have become absolutely central to our research activities. Furthermore, it features some design choices that differentiate it from currently popular languages [5], such as 1-indexed arrays and multiple-dispatch instead of classes. Whether or not these choices are wise, they make it harder to switch from an established language like Python to Julia.
Appendix B Raw Data
The performance numbers we used to produce the plots of Fig. 2 are given in Tab. 15. For both the sequential versions of the code and for each value of , the time for the fastest implementation appears in red and the one for the second best in blue. These numbers were obtained on a 2.9 GHz Quad-Core Intel Core i7 Mac running the Catalina operating system. For all versions of the code we ran 20 trials for , 10 for , and 3 for and computed the mean and variance in each case. We have rerun all these benchmarks on an Intel Xeon X5690 CPU running Ubuntu 18.04 and the overall ranking of the implementations was unchanged.
Sequential | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
---|---|---|---|---|---|---|---|---|---|---|---|
Python | 0.00391 | 0.01625 | 0.07024 | 0.33707 | 1.77808 | 9.58502 | 57.5610 | - | - | - | - |
Lisp | 0.00036 | 0.00148 | 0.00577 | 0.02711 | 0.13346 | 0.72808 | 4.43882 | 29.1243 | - | - | - |
Julia | 0.00010 | 0.00047 | 0.00213 | 0.01057 | 0.05555 | 0.30745 | 1.90335 | 13.1295 | 89.8649 | 634.371 | - |
Numba | 0.00011 | 0.00053 | 0.00219 | 0.01054 | 0.05455 | 0.28898 | 1.74438 | 11.0766 | 75.5948 | 546.312 | 4085.54 |
Go | 0.00009 | 0.00044 | 0.00198 | 0.01007 | 0.05404 | 0.30519 | 1.80617 | 10.9284 | 73.6867 | 527.514 | 4014.05 |
C++ | 0.00009 | 0.00041 | 0.00189 | 0.00935 | 0.04991 | 0.29386 | 1.67399 | 9.99408 | 66.7773 | 491.910 | 3677.10 |
Parallel | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
Numba Para | 0.00008 | 0.000248 | 0.00087 | 0.00473 | 0.02509 | 0.15463 | 0.98384 | 6.6811 | 17.2807 | 135.166 | 1184.33 |
Numba Pool | 0.13120 | 0.130002 | 0.13113 | 0.13105 | 0.13349 | 0.13280 | 0.44365 | 2.52153 | 16.9606 | 135.078 | 994.799 |
Go | 0.00012 | 0.000221 | 0.00055 | 0.00236 | 0.01222 | 0.08580 | 0.51022 | 3.44080 | 22.9841 | 140.860 | 1068.71 |
C++ | 0.00012 | 0.000202 | 0.00063 | 0.00272 | 0.01293 | 0.06945 | 0.39773 | 2.75142 | 18.7342 | 122.827 | 877.398 |
References
- [1] J. Bell and B. Stevens. A survey of known results and research areas for n-queens. Discrete Mathematics, 309(1):1–31, 2009.
- [2] Cython: C-extensions for python. https://cython.org/.
-
[3]
Global interpreter lock.
https://en.wikipedia.org/wiki/Global_interpreter_lock. -
[4]
Wenzel Jakob, Jason Rhinelander, and Dean Moldovan.
pybind11 – seamless operability between c++11 and python, 2017.
https://github.com/pybind/pybind11. -
[5]
Julia: Noteworthy Differences from other Languages.
https://docs.julialang.org/en/v1/manual/noteworthy-differences/. -
[6]
The Julia Programming Language.
https://julialang.org/. -
[7]
The LLVM Compiler Infrastructure Project.
https://llvm.org/. -
[8]
Numba: A High Performance Python Compiler.
https://numba.pydata.org/. -
[9]
Numba: Supported NumPy Features.
https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html. -
[10]
Numba: Automatic parallelization with @jit.
https://numba.pydata.org/numba-doc/dev/user/parallel.html. -
[11]
Numba: Supported Python Features.
https://numba.pydata.org/numba-doc/dev/reference/pysupported.html. - [12] T. Preußer, B. Nägel, and R. Spallek. Putting queens in carry chains. 2009.
-
[13]
PyPy: a fast, compliant alternative implementation of the Python language.
https://pypy.org/. -
[14]
Support for Type Hints in Python.
https://docs.python.org/3/library/typing.html. - [15] Niklaus Wirth et al. Algorithms+ Data Structures. Prentice-Hall, 1976.