Multiprocessing has been introduced to maximize the processing power and programming capabilities. Upon smart utilization the efficiency of hardware can be maximized which improves performance to large extents. Huge applications that are highly CPU intensive could be processed as well as web pages and their interactive applications could stay responsive while being engaged in other activities by running different threads. Developers have implemented different multithreading models to make processor work as effectively as they could. The related issues with the concept of multithreading at processor level includes thread scheduling and events that should be considered for thread switching.
This paper discusses the different implementation of the multithreading model by the hardware. A brief introduction to the processors implemented with different threading model. Advantages and disadvantages associated with these implementations. Research issues that scientist are considering for improving multithreading at hardware level. This paper doesn’t addresses the steps and research that has been conducted to resolve issues related with multithreading in multiprocessors rather it gives an overview of the entire topic.
Introduction to the concept has been given. Examples of the systems with specific multithreaded model are provided. Some Names of processors and their vendors wherever possible has been included. A diagram has been added to illustrate operating system level mapping of user threads to kernel threads.
Concept of Threading
A thread is considered as a basic unit of CPU utilization, it’s the smallest unit of processing. Threads are managed both at operating system level as well as the processor level and how it happens depends on the operating system used and the architecture of the underlying hardware. At Operating System level threads are classified as user threads and kernel threads. User threads are processes based user managed threads and needs to be mapped to kernel level threads to be given to the processor to execute. Kernel threads are managed entirely by operating system and user interference is not allowed in their activities.
To map user threads to kernel threads different approaches are utilized which includes one-one, many-one, many-many and two level mapping. To provide greater concurrency for an application one-one mapping of user threads to kernel threads is used. This model has the limitation that creation of every other kernel thread for each user thread is an expensive operation having a high overhead there for a lot of systems having this implementation restrict the number of user threads used. This model is used in operating system including Linux, family of Windows 95, 98, NT, 2000 and XP.
Many-many model maps different user threads to smaller or equal number of kernel thread. A case where such model becomes handy for a specific application is by using multiprocessor system where threads belonging to an application can be assigned to different kernel threads. Here concurrency gained is not high. User has no restriction of creating amount of threads.
Two level models follows exactly same pattern with an addition that it bounds user threads to a specific kernel thread, this model is supported in IRIX, HP-UX, and Tru64 UNIX (Silberschatz, Galvin, Gagne, p.131). Many-one mapping is used when many user level threads are mapped to one kernel thread this is managed by thread library of the system and is an efficient mechanism but the entire process will block if a single threads makes a blocking call. The thread libraries that are normally used are Pthreads (POSIX threads), Win32 Threads and Java Threads.
This figure shows how mapping between user threads and kernel threads is done. Use of thread library and multithreaded models introduce in application responsiveness allowing programmers to make them more interactive for instance a multithreaded webpage allows user to still view the page while it makes connection to the database. All the threads belonging to a process share its resources and thus achieve economy in terms of resource utilization. They are best when use with multiprocessors.
The benefit of multithreading can be greatly increased in a processor architecture, where threads may be running in parallel on different processors. A single threaded process can only run on one CPU, no matter how many are available. Multithreading on multi-CPU machine increases concurrency (p.129).
Threads in Distributed Systems
Distributed systems gained popularity with the advent of networked application in 1970’s where Ethernet introduced local area computing and Arpanet introduced wide area networks. The issue of distributed applications was a hot topic in 80’s and become even popular implementation issue in 1990’s. Unlike other system in distributed system processes communicate by message passing instead of reading and writing shared variables. The motivations for building distributed applications include resource sharing, computation speedup, reliability and communication. In distributed system incase of different sites being connected user of one site could share resources of another site, computational tasks could be divided among all the processors using the distributed application this improves performance to a large scale.
This management of job among different processors is called job sharing or load sharing. Though is possible but not commercially used, is the automated load sharing where distributed operating system automatically replaces jobs. Incase of failure of any of the site the distributed applications can switch the sites they are at. If the application is not running at huge autonomous system then a failure of a component could be considered as crucial to the system, this single failure could halt the entire system. But for autonomous system this failure never affects rest of the system. Its also a design issue that failure of the system should be detected and must be replaced at earliest because it could devoid a system from being able to use all the available sites.
In these applications messages are passed just as processes pass messages in single-computer message system. With this feature of message passing very high level functionalities of standalone systems could be introduced in distributed systems. The advantages include companies setting up their businesses by downsizing which means employing networks of workstations or personal computers. In distributed environment user has the access to resources that system provides and this access is provided by data, migration, computation migration or process migration. The distributed system is defined by the collection of processors which doesn’t share memory or the clock. Processors communicate with each other by different communication lines and each processor has its own local memory which can be high speed-buses or telephone lines.
Parallel computing is the essence of distributed computing where processors in the case of cluster computing loosely coupled processors share resources and processing tasks, here threads are shared by these processors to perform computations as quickly as possible. Threads here are called subtasks, where smaller processors call these threads fibers and massive processors call them processes, but thread generally refers to subtasks. For each subtask to execute properly should not interfere each other’s memory and execution.
If some how one causes others to produce incorrect results its referred to as race condition which has to be avoided for this purpose the concept of mutual locks have been introduced that are responsible to take care of concurrency and integrity of data. There are issues related with locking including deadlocks which have been discussed later in case of multiprocessors. These things are introducing parallel slowdown and researchers are looking for methods to introduce software level locking approach. Distributed memory has a non-uniform memory access system and for such system to improve parallel computing has to use cache coherence mechanism where cached values are stored and required for smooth program execution.
Designing efficient cache coherent mechanism is high computer architecture issue. The tasks to be divided among processors system has to use routing tables as processors are networked. Distributed system also use Massive parallel processors (MPPS) that have capabilities of clusters and are very large normally having hundred or more processors and each CPU has its own memory and copy of the operating system and application. High-speed interconnects are used for communication between each sub-system.
Multithreading at Processor level
In multiprocessors we can utilize processors capabilities to large extents by using a multithreaded implementation of operating system where load can be divided among all the underlying processors. This is done primarily to achieve performance benefits. The way multithreading is done at ordinary system its similar to distributed environment where distributed operating system manages user and kernel threads and underlying hardware implements multithreading strategies.
The simplest concept here is when a thread is made to execute on a processor it stops processing when it’s been clocked by any event possibly failure to access cache etc other thread is given a chance to execute. This scenario is called a cache miss and suffered thread has to wait quite a number of CPU cycles and instead of waiting for this operation the execution is given to another thread to efficiently manage processing power and improve performance. This concept is called Block or Cooperative or Coarse-grained multithreading.
A hardware that is multithreaded has been designed to achieve the capability to switch between threads. A blocked thread has to be replaced by a ready to run thread. The whole process involves the replication of different registers. These registers could be program visible or processor visible belonging to each thread associated of a specific processor. For the processor these register may include the program counter etc. Switching to another thread means switching the registers that are being used.
Here efficiency is achieved by using hardware having capability to switch registers in one CPU cycle. The competing threads should view the entire system as having been processing alone and have no other thread to share resources with. If managed like this, there will be very less changes required to be made to an operating system implementation for managing threads. Many microcontrollers and embedded processors have multiple registers to allow quick context switching among blocked or interrupted threads and is called blocked multithreading between user and interrupted threads.
There are a limited maximum number of threads that a block can contain. However, blocks of same dimensionality and size that execute the same kernel can be batched together into a grid of blocks, so that the total number of threads that can be launched in a single kernel invocation is much larger. The maximum number of threads per block is 512. The maximum number of blocks that can run concurrently on a multiprocessor is 8. The maximum number of warps that can run concurrently on a multiprocessor is 24. The maximum number of threads that can run concurrently on a multiprocessor is 768 (NVIDIA CUDA Programming
Guide 1.0, 2007).
For implementation purpose it is more desired to execute more than one block at a time. So by considering the given constraints of the register and memory used its been decided to run as many possible blocks as it could. The goal is to maximize multiprocessor occupancy and let say if a block has 256 threads it makes a wrap of 24 and uses the CPU effectively. But to use any of the utilization patterns, constraints of registers and shared memory should be taken care of. Every multiprocessor carries an independent instruction decoder that means different blocks can run different instruction seven every warp can execute a different instruction stream. Threads within a block can synchronize using __syncThreads (), but blocks cannot synchronize and communicate with each other (NVIDIA CUDA Programming Guide 1.0).
Another kind of processor level multithreading is the interleaved multithreading where processor switches threads every CPU cycle. The purpose here is to minimize data dependency which means one instruction depends on the preceding instruction, from the execution pipeline, execution pipeline refers to the data processing elements connected in serial fashion having the output of one element being the input of the next which are normally executed in parallel or time-sliced fashion by using buffer storage space. Instruction pipelines in processors allow multiple instructions in the same circuitry to execute in overlapping order. The circuitries is usually divided up into stages, having instruction decoding, arithmetic, and register fetching, where in each stage processes one instruction at the particular time.
In interleaved multithreading, the thread changes in each processor cycle, consecutive instructions are issued from different threads, and no data dependencies can stall the pipeline. Enhanced interleaved multithreading maintains a number of additional threads which are used to replace an active thread when it initiates a long-latency operation. Instruction issuing slots, which are lost in pure interleaved multithreading are thus used by instructions from the new thread (Zuberek, W.M., 2004).
This type of multithreading could be understood as the concept of multitasking in operating system where time slice is divided among threads for execution. It is also called time-sliced, fine-grained and pre-emptive multiprocessing. Examples of processors employing this model are Denelcor Heterogeneous Element Processor, Intel Super-threading, Sun Microsystems UltraSPARC T1, Lexra NetVortex, MIPS 34K core which implements the Multi-Threaded ASE and Raza Microelectronics Inc XLR.
There is another very advanced multithreading model called superscalar multithreading, where a processor issue multiple instructions from a single thread in each CPU cycle. There is also a slight variation called simultaneous multithreading where in each CPU cycle the superscalar multiprocessor issue instruction from different threads. Since a single thread has limited amount of instruction level parallelism this model utilizes parallelism available across multiple threads and minimizes the possible waste associated with each slot allocated for processing.
Simultaneous multithreading is a processor design that combines hardware multithreading with superscalar processor technology to allow multiple threads to issue instructions each cycle. Unlike other hardware multithreaded architectures (such as the Tera MTA), in which only a single hardware context (i.e., thread) is active on any given cycle, SMT permits all thread contexts to simultaneously compete for and share processor resources. Unlike conventional superscalar processors, which suffer from a lack of per-thread instruction-level parallelism, simultaneous multithreading uses multiple threads to compensate for low single-thread ILP (Susan Eggers, 2005).
When instructions from any one of the threads is issued, its called temporal multithreading. System implementing this type of multithreading include Alpha AXP EV8 (uncompleted), Intel Hyperthreading, IBM POWER5, Power Processing Element within the Cell microprocessor and Sun Microsystems
HyperThreading Technology (HTT) is Intel’s trademark for their implementation of the simultaneous multithreading technology (SMT) on the Pentium 4 micro architecture. It is basically a more advanced form of SuperThreading that first debuted on the Intel Xeon processors and was later added to Pentium 4 processors (OSDEV, 2006).
A very important issue around is of scheduling the threads. A thread scheduler is supposed to quickly select thread for execution from among the list of ready-to-run threads along with maintain waiting threads and those it has to keep in the list of ready to run threads. Scheduler normally uses different thread prioritizing schemes to manage this process. Thread scheduler could be implemented at both hardware and software level.
For system management software has some visible state including privileged registers if a multithreading model replicate these states it could create a virtual machine for each thread and every thread could run its own operating system on single processor and if only user-level thread is replicated it provides an opportunity for more threads to stay active at one time for the same cost. Other issues that involve multithreading implementations include considering the type of events that could result in a thread switch, which includes cache misses, inter-thread communication, DMA completion, etc. DMA is the dynamic memory accessing schemes which allows hardware subsystems to access system memory independent of central processing unit’s (CPU) intervention.
There is also an issue of mutual exclusion in multiprocessors because of potentially large number of so-operating processes. This is a threat to computational efficiency by introducing more shared locks and memory issues. This problem could also lead to more power consumption by these processors.
The excessive process delays that can result when a process that holds a lock is preempted can be controlled either by making the lock unavailable to a process (that is, by freezing the lock) during the last phase of the process time slice (when the process would not have time to complete its critical section before preemption), or by allowing a lock-holding process to be preempted but then immediately rescheduling it (that is, recovering the process).
The memory contention and network traffic that results from multiple processes accessing a shared lock variable can be reduced either by using a hybrid primitive that employs both a cached and an uncached lock variable to minimize network traffic, or by using an entire set of lock variables to disperse the contention. And the problem of honoring the priorities of processes waiting for a lock (that is, maintaining fairness) can be solved by storing waiting processes in a list that is sorted in priority order (Michael J. Young).
Researchers are extensively researching the issues related to improve productivity of these processors as day by day processor intensive applications are capturing market and corporate environments. A lot of video games require huge image processing and graphics utilization which also seeks a lot of CPU capability.
Michael J. Young. Recent Developments in Mutual Exclusion for Multiprocessor
Systems Retrieved on April 2, 2008. From
NVIDIA CUDA Programming Guide 1.0. (2007). General CUDA GPU
Computing Discussion Retrieved on April 2, 2008. From
OSDEV Community Staff. (2006). HyperThreading Technology – Overview. Retrieved
On April 2, 2008. From< http://www.osdcom.info/content/view/30/39/ >
Silberschatz, Galvin, Gagne. Operating System Concepts. Multithreading Models
(p129131). Seventh Edition.
Susan Eggers. (2005). Simultaneous Multithreading Project Retrieved on April 2, 2008
From < http://www.cs.washington.edu/research/smt/>
Zuberek, W.M. (2004). Enhanced interleaved multithreaded multiprocessors and
their performance analysis, Application of Concurrency to System Design
(p.7-15). IEEE Xplore. Retrieved on April 2, 2008