What would you choose if you have to cross the jungle and could only select either a 4×4 truck or a small car? The 4×4 truck, right? But what if you had to travel hundreds of miles in a city with limited gas? When it comes to Artificial Intelligence, choosing the correct hardware is like choosing the right vehicle – it all depends on the task.
Choosing the right hardware configuration depends on what you are trying to do. Are you going to do machine learning? Maybe train some deep learning models with convolutional neural networks? Or maybe you just want to do inferences or data preprocessing.
Some time ago, I was asked to find a new hardware configuration for our new server. This server was for deep learning training, inference, and data preprocessing, as well as for some basic machine learning algorithms. I said, “yes, let’s do this. I will look for the best GPU, with the best CPU, and a lot of RAM and, maybe some NAS Storage for good measure.” Little did I realize that I needed to first start with the AI tasks we needed to perform – training, inference or data preprocessing – before selecting the GPU or CPU.
Data Parallelism, Task Parallelism or High Data-Throughput? You Decide
When a deep learning model is trained, the same arithmetic and matrix operations are performed over and over again on vast amounts of data. Inference requires less data than training, but the output is usually expected in real-time. In data preprocessing, more varied and complex operations are performed on vast amounts of data. So, for training tasks, it is necessary to focus on hardware that allows for more data parallelism. For data preprocessing tasks, hardware that enables task parallelism, with fast data throughput; and for inference tasks, hardware with high data throughput. So now, we have narrowed down the problem and can look for hardware tailored to either data parallelism, task parallelism or high data throughput.
Historically parallelism has been correlated to cores. It is generally thought that higher amounts of cores should allow more parallelism. So, since you can find GPUs with thousands of cores, but the highest number of cores in a CPU is 64, a server with multiple GPUs should do the trick. The problem is that GPU operations are limited. It is like a person with much brawn but little intelligence. So yes, GPU can handle massive amounts of data at the same time but with limited arithmetic and matrix operations, and at slower rates. Meanwhile, the CPU can handle more advanced operations at faster rates. The speed of a GPU is still measured in Hz while the CPU is in the GHz magnitude. Think of it as a GPU runs at feet/hours while a CPU runs at miles/hours. So, CPUs can handle faster data throughput and more task parallelism, while GPUs give us room for more data parallelism. To sum this up, you should focus on GPUs for training tasks, and CPUs for data preprocessing and inference.
GPU Tensor Core and Cuda Core – One Other Decision
But what about edge devices? Going back to our vehicle example, let’s say you need to travel through a city with straight roads and a city with winding roads. In both cases, the small car works, right? But wouldn’t you prefer a more stable car for the winding roads? And maybe a faster car for the straight roads? Now we are adding the road type as a variable. The same goes for training in machine learning. Depending on the machine learning architecture you want to use, you should focus on GPU Tensor cores or Cuda cores. Today, GPUs have two different types of cores. The first time I heard this, I was quite confused. I thought that I just had to look for the GPU with more cores, and that’s it. Well, that isn’t the case. For this reason, you need to determine if the ML architecture performs more arithmetic or more matrices operations. Before continuing, we need to clarify the difference between Tensor core and Cuda core. The main difference is that Cuda cores can perform one precise calculation per GPU clock, while Tensor cores can calculate an entire 4×4 matrix operation per GPU clock. Why does this matter? Without diving into the technical details, let’s take a look at one the most common operations in a neural network:
Z=WTX + B
For convolutional neural networks, this function changes to:
Z=WT∗X + B
The first function is an element-by-element operation. Now in the second function, a convolutional function is used (not to be confused with the convolutional function used in signal processing). This function is an element-wise multiplication and addition operation over matrices. So, for architectures that do not have or have a few multiplications over matrices, you should focus on Cuda cores. While for architectures that have many matrices multiplication operations focus on Tensor cores. Typically Cuda cores are slower than Tensor cores, but they are more precise and smaller than Tensor cores. Since Cuda cores are smaller than Tensor cores, GPUs tend to have more Cuda cores than Tensor cores. So, even though Tensor cores are faster, the extra thousands of Cuda cores in a single GPU give you room for more data parallelism. You must be very aware of which arithmetic or linear algebra operations the machine learning architecture is implementing. In that way, you can use Cuda cores and Tensor cores appropriately.
As you can see, choosing the best hardware configuration for AI is not as straight forward as it seems. It depends on many variables, including the AI tasks and the AI architecture, as well as other factors such as the available physical space, budget, power consumption limitations, how many GPUs the motherboard can handle, bandwidth, and so on. For that reason, I would dare to say that there is not a perfect hardware configuration for all the possible AI tasks, but you can choose a hardware configuration suitable for several key ones.