Tuesday, November 19, 2019

Tech Book Face Off: Seven Concurrency Models In Seven Weeks Vs. CUDA By Example

Concurrency and parallelism are becoming more important by the day, as processor cores are becoming more numerous per CPU and more widespread in every type of computing device, while single core performance is stagnating. Something that used to be barely accessible to the average programmer is now becoming ubiquitous, which makes it even more pertinent to learn how to utilize all of these supercomputers effectively. Besides, parallel processing is a fascinating topic, and I think it's great that it is now so easy to experiment at home with things that used to be reserved for huge companies and university research departments. In order to become more proficient at programming in this way, I started with the book Seven Concurrency Models in Seven Weeks: When Threads Unravel by Paul Butcher for an overview of the current state of affairs in concurrent and parallel programming. Then I went for an introduction to CUDA programming for GPUs with CUDA by Example by Jason Sanders and Edward Kandrot. I've been looking forward to digging into these fascinating books for a while now, so let's see how they stack up.

Seven Concurrency Models in Seven Weeks front coverVS.CUDA By Example front cover

Seven Concurrency Models in Seven Weeks

I had previously enjoyed reading three other Seven in Seven Weeks books so I figured this one was an obvious choice for a solid book on concurrency, and that hunch held true. Butcher gives an excellent tour of the current state of concurrency and parallelism in the software development world, and he does it with a compelling story that builds up from the foundations of concurrency to the modern state-of-the-art services available for Big Data processing, at least circa 2014.

The main rationale for paying more attention to concurrency and parallelism is that that is where the hardware is taking us. As Butcher argues in the introduction:
The primary driver behind this resurgence of interest is what's become known as the "multicore crisis." Moore's law continues to deliver more transistors per chip, but instead of those transistors being used to make a single CPU faster, we're seeing computers with more and more cores.
As Herb Sutter said, "The free lunch is over." You can no longer make your code run faster by simply waiting for faster hardware. These days if you need more performance, you need to exploit multiple cores, and that means exploiting parallelism.
So if we're going to take advantage of all of these multiplying cores, we'd better figure out how to handle doing multiple things at once in our programs.

Our concurrency story begins with the little things. The first week focuses on the fundamentals of concurrency: threads and locks. Each week is split into three days, each day building on the day before, with the intention of being able to learn and experiment with the chapter's contents over a weekend. This first week on threads and locks is not meant to show the reader how to do modern parallel programming with threads, but to give a foundation of understanding for the higher-level concepts that come later. Threads are notoriously difficult to use without corrupting program state and crashing programs, and locks are a necessary evil that can help solve those corruption problems but have problems of their own, like deadlocks and livelocks. These problems are especially insidious because they're most often invisible, as Butcher warns:
To my mind, what makes multithreaded programming difficult is not that writing it is hard, but that testing it is hard. It's not the pitfalls that you can fall into; it's the fact that you don't necessarily know whether you've fallen into one of them. 
The first concurrency model gives us a view into that abyss, but then pulls back and moves on to better alternatives right away. The first better model turns out to be an old programming paradigm that has recently become more and more popular: functional programming. One of the biggest problems with programming languages like C or Java is that they have mutable state. That means most of their data structures and variables can and do change by default. Functional languages, on the other hand, default to immutable data structures that don't have the same problems when sharing state across threads.

The next model goes into detail about how one functional language, Clojure, uses the basic advantages of immutable state by separating identity and state. The identity of a data structure is what that data structure is inherently, like a list of names. It doesn't change. The state, which specific names are in the list, can change over time, and a persistent data structure in Clojure will guarantee that if the state changes for one thread, it will not change for other threads unless that state is explicitly passed from one thread to another. This separation of identity and state is accomplished by atoms and agents, but we don't have time to get into the specifics here. It's in the book.

After Clojure, we move on to Elixir, another functional language that takes a different approach to parallelism. Instead of threads, Elixir has extremely lightweight processes that can be used to make highly reliable applications out of unreliable components. The perspective to take when programming in Elixir is to design the application so that individual processes are not critical and can fail. Then instead of trying to do thorough error checking, we can just let them crash and depend on the system to recover and restart them. This approach makes for incredibly reliable systems, and with Elixir running on the Erlang VM, it has a solid foundation for bulletproof systems.

With the next model, we come back to Clojure to explore communicating sequential processes (CSP). Instead of making the endpoints in a message the important thing, CSP concentrates on the communication channel between the endpoints. In Clojure this is implemented with Go Blocks, and it's an intriguing change to the normal way of thinking about message passing between threads or processes.

What are we at now, the sixth model? This model steps outside of the CPU and takes a look at the other supercomputer in your PC, the massively parallel GPU. This chapter was a little too short for the subject to get a great understanding of what was going on, but it does use OpenCL for some simple word-counting applications that run on the GPU. It was neat to see how it works, but it was a lot of boilerplate code that was pretty opaque to me. I'm hoping the other book in this face-off will shed much more light on how to do GPU programming.

The final model takes us into the stratosphere with serious Big Data processing using Hadoop and Storm, frameworks that enable massively parallel data processing on large compute clusters. It was surprising to see how little code was needed to get a program up and running on such an industrial strength framework. Granted, the program was a simple one, but thinking about what the framework accomplishes is pretty intense.

That brings us to the end of the tour of concurrency models. The breadth of topics covered was exceptional, and the book flowed quite nicely. Butcher's explanations were clear, and he did an excellent job covering a wide-ranging, complex topic in a concise 300 pages. If you're looking for an overview of what's out there today in the way of concurrent and parallel programming, this is definitely the book to start you on that journey.

CUDA by Example

CUDA used to be an acronym that stood for Compute Unified Device Architecture, but Nvidia, it's creator, rightly decided that such a definition was silly and stopped using it. Now CUDA is just CUDA, and it refers to a programming platform used to turn your Nvidia graphics card into a massively parallel supercomputer. This book takes the reader through how to write this code using the CUDA libraries for your very own graphics card. It does a fairly decent job at this task.

The first chapter starts out with a bit of history on the graphics processing unit (GPU) and why we would need a general-purpose platform such as CUDA for doing computations on it. The short answer is that the prior situation was dire. The longer answer is as follows:
The general approach in the early days of GPU computing was extraordinarily convoluted. Because standard graphics APIs such as OpenGL and DirectX were still the only way to interact with a GPU, any attempt to perform arbitrary computations on a GPU would still be subject to the constraints of programming within a graphics API. Because of this, researchers explored general-purpose computation through graphics APIs by trying to make their problems appear to the GPU to be traditional rendering.
Suffice it to say, people were not particularly satisfied shoehorning  their algorithms into the GPU through graphics programming, so CUDA and OpenCL were a welcome development.

The next chapter goes through how to get everything ready on your computer in order to start writing and running CUDA code, and the chapter after that finally unveils the first program to run on the GPU. It's not exciting, just the standard "Hello, World!" program, but this example does introduce some of the special syntax and keywords that are used in CUDA programming.

Chapter 4 is where the real fun begins. We get to run an honest-to-goodness parallel program on the GPU. It's still simple in that it's only summing two vectors together element by element, but it's doing the calculation with each pair of elements in its own thread. Each thread gets assigned to its own resource on the GPU, so theoretically, if the GPU had at least as many compute resources as there are pairs of elements, all of the additions would happen simultaneously. It may not seem quite right to use compute resources in this way since we're so used to programming on much more serial CPUs, but the GPU hardware is designed specifically to do thousands of small calculations in parallel in a highly efficient manner. It's definitely a programming paradigm shift.

After another more interesting example of calculating and displaying the Julia Set, a kind of fractal set of numbers, the next chapter follows up with how to synchronize these thousands of threads in calculations that aren't completely parallel. The example here is the dot product calculation, and this example ends up getting used multiple times throughout the rest of the book. So far the examples have been unique, but they'll start to get reused from here on, partly in order to not need to keep introducing more new algorithms for each example.

The next couple chapters discuss the different types of memory available in a GPU. A small amount of constant memory is there to hold values that are, well, constant, for fast access instead of needing to keep fetching those unchanging values from main memory or having them fill up the cache unnecessarily. Then there's texture memory available for optimized 2-D memory accesses, which are common in certain algorithms that operate on neighboring memory locations in two dimensions instead of the normal one dimension of vector calculations.

Chapter 8 discusses how to combine the use of the GPU as both a CUDA processor and a graphics processor without needing to copy buffers back and forth to the host memory. Actually, a lot of CUDA programming is optimized by thinking about how best to use the memory resources available. There are now at least three more memories to consider: the GPU main memory, constant memory, and texture memory, in addition to the normal system memory attached to the CPU we're used to thinking about. The options have multiplied, and it's important to use both the CPU and GPU efficiently to get the best performance.

We're nearing the end now, with chapters on using atomics to maintain memory consistency when multiple threads are accessing the same locations, using streams to more fully utilize a GPU's resources, and using multiple GPUs to their full potential, if your system is blessed with more than one GPU. By this point much of the content is starting to feel redundant, with incremental features being added to the mix and most of the examples and explanations of the code being copies of previous examples with minor tweaks for the new features.

The last chapter is a review of what was covered in the book, some recommendations of more resources to learn from, and a quick tour of the debugging tools available for CUDA. While overall this book was fairly good for learning how to do massively parallel programming with CUDA, and I certainly enjoyed coming up to speed with this exciting and powerful technology, the second half of the book especially felt drawn out and repetitive. The explanations got to be too verbose, and frankly, the cringe-worthy sense of humor couldn't carry the redundancy through. The book could have easily been half as long without losing much, although the pace was certainly easy to keep up with. I never struggled to understand anything, and that's always a plus. I've got a couple other CUDA books that may be better, but CUDA by Example is sufficient to learn the ropes in a pinch.


Of these two books, Seven Concurrency Models in Seven Weeks was the more wide-ranging and enlightening book. It gave a wonderful overview of the landscape for concurrent and parallel programming, even though it couldn't go into enough depth on any one topic to do it justice or allow the reader to competently start working in that area. Like all of the Seven in Seven books, its purpose is not to make the reader an expert, but to provide enough information to give the reader a fighting chance at making their own decision on a path. Then, the reader can follow that path further with a more specialized book. CUDA by Example is one such specialized book, although it was somewhat light on the real details of GPU programming. As an introductory book, it was adequate, but I'm hoping the next couple of books I read on GPU programming will have more substance. In any case parallel programming is growing in importance, and it's exciting to be able to play around with it on consumer-grade hardware today.

No comments: