Translate

Friday, January 4, 2013

SMT architectures - Design issues


Simultaneous multithreading (SMT) is a technique for improving CPU performance, where one CPU pretends to be two. The instructions from two threads are interleaved in the CPU  pipeline, which can yield a performance gain for two reasons.
The first reason is that modern superscalar procesors have many execution units, and a single thread rarely uses all of them at once.
 The second reason is that when a thread stalls because it needs data that is not stored in the cache, the CPU ends up waiting for memory for hundreds of cycles.
In both of these cases, having a second thread immediately ready to run can fill up these empty spaces, and can prevent wasting resources. The important question is how much performance benefit can we expect, and how can we take advantage of it? This is going to become more important as this type of technology is making its way into CPU designs by all the major players: Intel, AMD, IBM and Sun.
According to Intel, SMT (which they call "hyper-threading") can give a performance boost of up to 30%. However, this improvement is application dependant, and there are some applications where hyperthreading actually hurts performance. For example, this DivX encoding benchmark shows a noticable loss when SMT is enabled. This shows that it is not trivial to gain extra performance with SMT. To help shed some light on this question, I found two interesting documents on SMT performance from Intel.
The is about how SMT is the beginning of a shift towards explicit parallelism in all types of applications, not just servers and scientific computation. The last few generations of CPU improvements, such as pipelining and superscalar execution, exploit the implicit parallelism that exists in machine code. Unfortunately, these are hitting the limits of this approach. Sun calls this paradigm shift "throughput computing”. The challenge is that we need to change our tools and mindsets to take advantage of the available multithreaded computational power.
The second gives developers some concrete advice for improving application performance. It describes how to use "cache blocking" effectively on SMT processors. Cache blocking is when you adjust the size of the chunks of data you are processing to fit them into cache. The article presents some interesting graphs which suggest that for a single processor, blocks should be around half to three-quarters the size of the L2 cache, but when hyper-threading is enabled, this size is too big because the cache is shared between two threads. For the best SMT performance, the block size should be between half and a quarter the L2 cache, and it is better to be too small instead of too large. These results also suggest that the operating system needs to be very careful about how it maps threads onto CPUs. The biggest benefit should occur when the two threads share a significant chunk of their working set. Thinking about the problems and the opportunities in this space makes me itching to add multithreading to thttpd. Web serving is naturally a parallel activity, and it would be an interesting challenge to figure out to best scale its single thread of execution to multiple threads. Maybe I need to do my Ph. D. in parallel computing.

No comments:

Post a Comment