Total Execution Time

Traditional microarchitectures

Shigeyuki Takano , in Thinking Machines, 2021

2.7.5 Utilization

How many computing resources are active in the total execution time is one consideration to check. Such utilization U is calculated using the following equation.

(2.8) U O P S T l a t e n c y = N o p × f N c y c l e × f = N o p N c y c l e

Thus, this is an average rate indicating how many operations are conducted in a single cycle. In general, it is N o p < N c y c l e . To enhance the utilization, an extremely easy approach is to fuse multiple stages of the pipeline, and thus N c y c l e can be easily decreased.

In addition, the utilization can be estimated using N t h and OPS as follows.

(2.9) U N t h × O P S

The throughput N t h does not include the property of the problem (input) size, and thus, the utilization does not include the problem scale. Therefore, simply observing the utilization incurs a risk of failure in terms of such consideration. The utilization is effective only for an evidence-based approach.

In addition, the utilization can be used for an effective data-level parallelism N o p as follows.

(2.10) N o p = 2 N m a c × U

Thus, to achieve a higher utilization, the number of execution cycles should be close to the number of operation units on the chip as follows.

(2.11) N c y c l e 2 N m a c

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128182796000128

Symmetric Multiprocessor Architecture

Thomas Sterling , ... Maciej Brodowicz , in High Performance Computing, 2018

6.5.3 Memory System Performance

It is clear that the time to access a value from a specified variable in the memory system will vary dramatically depending on a number of factors, most specifically where the closest copy of the value is in the memory hierarchy. While analyzing such a complex memory architecture can be very complicated due to the number of levels, the overheads involved, issues of contention, and so forth, a simplified version of the problem still exposes the principal tradeoffs and shows how dramatically the average memory access time can change depending on the hit rates to cache as a consequence of locality. For this purpose, the cache is assumed to be a single intermediate layer between processor core registers and main memory. Without a detailed queuing analysis or similar in-depth model, operational metrics are adopted to capture the specific properties of the architecture and application memory access profile, plus a quality metric of performance. An analytical model is derived to show the sensitivity between delivered performance and the effectiveness of caching.

The quality metric of choice in this case is CPI or cycles per instruction. Time to solution, T, is proportional to cycle time, T cycle , and the number of instructions to be executed for a user task, I count . Because the purpose of this analysis is to expose the implications of memory behavior, the instruction count is partitioned between those instructions associated with the number of register-to-register ALU instructions, I ALU , and the number of memory access instructions, I MEM . For each of these two classes of instructions there is a separate measure of cycles per instruction, one for the register-to-register ALU operation, CPI ALU , and one for the memory instructions, CPI MEM . The total value for time, I count , and CPI can be derived from the breakdown between ALU and memory operations according to Eqs. (6.10)–(6.12).

(6.10) T = I c o u n t C P I T c y c l e

(6.11) I c o u n t = I A L U + I M E M

(6.12) C P I = I A L U I c o u n t C P I A L U + I M E M I c o u n t C P I M E M

The full set of parameters is defined as:

T , total execution time; T cycle , time for a single processor cycle; I count , total number of instructions; I ALU , number of ALU instructions (e.g., register to register); I MEM , number of memory access instructions (e.g., load, store); CPI, average cycles per instruction; CPI ALU , average cycles per ALU instruction; CPI MEM , average cycles per memory instruction; r miss , cache miss rate; r hit , cache hit rate; CPI MEMMISS , cycles per cache miss; CPI MEMHIT , cycles per cache hit; M ALU , instruction mix for ALU instructions; and M MEM , instruction mix for memory access instructions.

The idea of an instruction mix simplifies representation of this distinction between ALU and memory operations, providing ratios of each with respect to the total instruction count.

In addition, the parameter that expresses the effect of data reuse is defined as the hit rate, r hit , which establishes the percentage of time that a memory request is found in the cache. The opposite of this parameter can be useful: r miss   =   (1   r hit ). One last distinction is made for CPI MEM depending on whether a hit or a miss occurred. These represent the costs, measured in number of cycles of memory instruction access times, depending on whether there was a hit or a miss at the cache. CPI MEM-HIT is a fixed value of the number of cycles required for an access that is served by the cache, and CPI MEM-MISS is the cost in cycles of going all the way to main memory to get a memory request serviced in the case of a cache miss. The relationships among these distinguishing parameters are demonstrated in Eqs. (6.13)–(6.17), associating them with the definition of full execution time.

Instruction mix:

(6.13) M A L U = I A L U I c o u n t

(6.14) M M E M = I M E M I c o u n t

(6.15) M A L U + M M E M = 1

Time to solution:

(6.16) C P I = ( M A L U C P I A L U ) + ( M M E M C P I M E M )

(6.17) T = I c o u n t [ ( M A L U C P I A L U ) + ( M M E M C P I M E M ) ] T c y c l e

Finally, the values for CPI MEM and T as functions of r miss are presented in Eqs. (6.18) and (6.19). It may appear peculiar that the coefficient of CPI MEM-HIT is not r hit . This is because the cost of getting data from or to the cache occurs whether or not a miss occurs.

(6.18) C P I M E M = C P I M E M H I T + r m i s s C P I M E M M I S S

(6.19) T = I c o u n t [ ( M A L U C P I A L U ) + M M E M ( C P I M E M H I T + r m i s s C P I M E M M I S S ) ] T c y c l e

This shows the effect of the application-driven properties, including I count , M MEM , and r miss . Architecture-driven properties are reflected as T cycle , CPI MEM-MISS , and CPI MEM-HIT in determining the final time to solution, T.

Example

As a case study, a system and computation are described in terms of the set of parameters presented above. Typical values are assigned to these to represent conventional practices, architectures, and applications. These are shown below.

I count =1E11
I MEM =2E10
CPI ALU =1
T cycle =0.5   ns
CPI MEMMISS =100
CPI MEMHIT =1

The intermediate values for instruction mix are computed as follows:

I A L U = I c o u n t I M E M = 8 E 10

M A L U = I A L U I c o u n t = 8 E 10 1 E 11 = 0.8

M M E M = I M E M I c o u n t = 2 E 10 1 E 11 = 0.2

This example shows the impact of the cache hit rate on the total execution time, which can prove to be one of the most important determining factors of application time to solution and one of which the user has to be aware as data layout is considered. Two alternative computations are considered. The first is favorable to a cache hierarchy (this example simplifies, with only one layer) with a hit rate of 90%. With this value established, the time to solution can be determined as shown in Eqs. (6.20)–(6.22).

(6.20) r h i t A = 0.9

(6.21) C P I M E M A = C P I M E M H I T + r M I S S A C P I M E M M I S S = 1 + ( 1 0.9 ) 100 = 11

(6.22) T A = 1 E 11 [ ( 0.8 1 ) + ( 0.2 11 ) ] 5 E 10 = 150 s

But if the cache hit rate is lower, in this case 50%, a recalculation with this new value shows a dramatic reduction of performance, as shown in Eqs. (6.23)–(6.25).

(6.23) r h i t A = 0.5

(6.24) C P I M E M B = C P I M E M H I T + r M I S S B C P I M E M M I S S = 1 + ( 1 0.5 ) 100 = 51

(6.25) T B = 1 E 11 [ ( 0.8 1 ) + ( 0.2 51 ) ] 5 E 10 = 550 s

The difference is more than a factor of 3× performance degradation, just because of the change in the cache hit rate.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012420158300006X

Parallel Computing

A. Cortés , ... E. Luque , in Advances in Parallel Computing, 1998

1 INTRODUCTION

This paper focuses on dynamic load balancing strategies designed to minimize the total execution time of a single application running in parallel on a scalable parallel system. We present a local strategy DASUD (Diffusion Algorithm Searching Unbalanced Domains) and compare it to a diffusion scheme known as SID (Sender Initiated Diffusion) [ 4]. Both strategies make use of near-neighbor load information to apportion surplus load from heavily loaded processors to underloaded neighbors in the system. In contrast to Diffusion and Dimension-Exchange algorithms [5] SID and DASUD strategies consider that the portion of extra load is not a fixed value and depends on the neighborhood's state. Therefore, the load thrashing between processors is reduced and the convergence rate is improved. The SID strategy has the disadvantage that it may produce solutions which although the processors are locally balanced the system prove to be globally unbalanced. However, DASUD strategy use overlapped domains (where every domain includes all immediate neighbors of the underlying processor) to achieve global balancing.

For the objective of this paper, we assume a single application characterization in which the problem to be executed is partitioned into a fixed number of tasks. All tasks are independent and may be executed on any processor in any sequence. We assume that all the tasks have roughly the same execution time, and that the load balancing phases attempt to equalize only the number of tasks at each processor. Examples include back-track searches, branch-and-bound optimization, adaptive refinement techniques PDE's and ray tracing [5].

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0927545298800981

Processes and Operating Systems

Marilyn Wolf , in High-Performance Embedded Computing (Second Edition), 2014

4.2 Real-time process scheduling

Let us begin with definitions of some common terms. We then review traditional real-time scheduling algorithms. Next, we consider extensions to traditional schedulers for new criteria in Section 4.2.2 and spend Section 4.2.3 concentrating on real-time scheduling for dynamic voltage scaling. We conclude with a discussion of performance estimation for multiple-process systems.

4.2.1 Preliminaries

This section surveys algorithms for scheduling processes on real-time systems. Scheduling has been studied in many contexts. Some general scheduling techniques apply to real-time systems; other techniques are designed for the specific characteristics of real-time computer systems.

Processes and scheduling

One of the most fundamental abstractions in computing is the process. A process is a unique execution of a program—it includes the program and all of its state. Operating systems allow several processes to run concurrently on a single CPU by interleaving the processes' executions, using a technique known as context switching. The time between successive operating system preemptions for scheduling is known as the time quantum. The sequence of times at which processes execute on the CPU is a schedule.

Processes vs. tasks

Several different words are used in the literature for similar concepts, and sometimes the same word is used in different ways. In scheduling, the terms thread, process, and task are all used in various ways. We use the term thread to mean a lightweight process that shares an address space with other threads, and the term process as a generic term for any execution of a program. We use task to mean a collection of processes that must execute together. Tasks are often related by data dependencies. The word task is sometimes used to refer to what we call a process, but we find it useful to distinguish between a single program and a collection of programs. The processes that make up a task may be called subtasks.

Static vs. dynamic

Scheduling algorithms can be divided into two general categories. Static scheduling algorithms determine the schedule offline, before the system begins to operate. Dynamic scheduling algorithms build the schedule on the fly during execution. Many scheduling algorithms are NP-complete. As a result, we must use heuristics to solve them.

Constructive vs. iterative improvement

Static scheduling algorithms are widely used in both hardware and software design. The major types of static schedulers are constructive and iterative improvement. Constructive schedulers use rules to select the next task in the schedule. Iterative improvement schedulers, in contrast, revisit their decisions to change the schedule.

Priority schedulers

Dynamic schedulers in real-time systems are generally priority schedulers. These schedulers assign priorities (integer or real values), and then use those priorities to determine which process to run next.

Real-time vs. general-purpose

Real-time scheduling algorithms are very different from scheduling policies for general-purpose operating systems. General-purpose systems are generally concerned with fairness—they do not want to starve any process of computation time, while still allowing for some processes to execute more frequently than others. Real-time scheduling, in contrast, is concerned with deadlines. The penalties for missing a deadline may vary, but all real-time scheduling algorithms are somehow aimed at satisfying deadlines or throughput requirements.

Hard vs. soft

The literature often distinguishes between hard and soft real-time scheduling. Some people use the term hard real-time to mean only safety-critical computations, but we prefer to use it for any computation that fails when a deadline is missed. A laser printer, for example, may print a bad page if it misses a deadline, though this does not (usually) cause people to die. We use soft real-time systems to mean any system that prefers to meet deadlines but does not meet catastrophic deadlines if it fails. A user interface on a digital television is an example of a system that meets soft real-time deadlines.

Deadline definitions

We need to establish a few terms to specify deadlines and describe process behavior. As shown in Figure 4.1, the deadline is the time when all computation must finish. In many cases, the process may start executing at the end of the previous deadline, but in some cases, we may want to define a release time after the last deadline. The period T i is the interval between successive deadlines. The deadline and start time are both specifications of desired behavior. The relative deadline is the difference between the process's release time and the end of its deadline.

FIGURE 4.1. Deadline-related terminology.

We also need to describe the actual execution of the process. The initiation time is the time when the process actually starts executing, while the completion time is the time when it finishes. The response time of a process is the time from its release to the time it completes. As illustrated in Figure 4.1, the process may not execute every time it is initiated. The execution time or CPU time, which we call C i , is the total amount of time that the process executes; that time is generally independent of the initiation time but often depends on the input data.

We often define deadlines for periodic processes, but we may also want to define a deadline for an aperiodic process. Scheduling such a process must take into account the state of the system when the process executes, but the basic definition of a deadline does not change much.

Process specifications

To define the real-time requirements of a process, we specify its period (and perhaps its start time). When we build tasks out of several processes, we define the deadline for the entire task. Figure 4.2 shows a task with three processes and data dependencies between the processes. The period of the task includes the execution of all the processes in the task. A system may include several tasks, each running at its own rate.

FIGURE 4.2. A task with several processes.

We can observe some basic properties of systems of processes and schedules. Clearly, the total execution time of all the processes must be less than the total available time. Given a set of processes 1 . . . n, the total execution time required to complete all the processes is

Utilization

(EQ 4.1) C = 1 i n C i

If the total time available to execute the processes is t, then the utilization or maximum utilization of the CPU is

(EQ 4.2) U = C t

Utilization is often expressed as a percentage. Clearly, the CPU's maximum utilization cannot exceed 100%.

4.2.2 Real-time scheduling algorithms

Static scheduling algorithms

Static scheduling algorithms have been studied in many different contexts ranging from shop floor scheduling to hardware/software co-design. This section looks at a few static scheduling techniques that can be useful in software design. There is a rather soft line between static scheduling and code synthesis. The last chapter discussed code synthesis techniques that could be viewed as static scheduling algorithms.

Data dependencies and scheduling

Static schedulers often look at data dependencies between processes. Data dependencies make the scheduling problem much more difficult. Without data dependencies, there is little need to use a static scheduler. Data dependencies are represented by directed edges in a graph whose nodes are the processes. Unweighted directed edges simply specify order. Weighted directed edges specify the minimum time required between the completion of the source node process and the initiation of the sink node process.

We can compute some simple bounds on sets of processes with data dependencies that can be used to guide scheduling algorithms. The as-soon-as-possible (ASAP) schedule for a set of processes is constructed by putting each process at its earliest possible time as limited by the data dependencies. In the as-late-as-possible (ALAP) schedule, each process takes place at its latest possible time. If a process is at the same position in both the ASAP and ALAP schedules, then it is a critical process in the schedule; the set of critical nodes and edges from the source to the sink of the graph is the critical path.

Resource dependencies

In contrast to data dependencies, resource dependencies come from the platform architecture. Two processes that need the same memory location, for example, require the same resource. Resource dependencies can be represented by undirected edges in the process graph. Two processes that need the same resource in general do not have to execute in any particular order, but they cannot execute at the same time. If the processes must access the resource in a particular order, then a data dependency can be used to express that fact.

Implementation

We can implement static schedulers in several ways. We can interleave code from several programs to create a unified program. We can also build a state machine that uses subroutines or co-routines to call the processes. Neither of these programs use timers, which reduces overhead but also limits the accuracy of program timing. We can also set a timer to the next time a process must run, with the timer under the control of a state-machine-like program.

List scheduler

A common form of constructive scheduler is the list scheduler. As the name implies, this algorithm forms a list of processes to be scheduled, then takes processes at the head of the list to form the schedule. The scheduling heuristic is embodied in the algorithm that determines where in the list a process is placed.

Figure 4.3 shows an example problem for list scheduling. Given two different heuristics for building the lists, we obtain two different schedules. The first heuristic adds processes to the list in order of CPU time; the process with the smallest CPU time goes first. Note that in this case, two schedules are possible since 2 and 4 take the same amount of CPU time. The second heuristic adds processes in order of their periods, with the shortest period going first. The process orders in these two cases are very different. Which heuristic is best depends on the circumstances and goals of the system.

FIGURE 4.3. List scheduling.

Interval scheduling

Chou and Borriello developed an interval-scheduling model [Cho95a] and algorithm to statically schedule deadline-driven operations. They represent the processes and their timing constraints as a weighted directed graph. Each node in the graph represents a process; a node is weighted with a [ 1 , u ] pair that gives the lower and upper bounds on execution time of the node. Each directed edge is weighted with a minimum delay required from the source to the sink process. Each node has an anchor node that serves as the source node.

The interval-scheduling algorithm is a constructive algorithm that can be formulated as a recursive procedure. The algorithm searches each subgraph in topological order of the forward edges in the graph. At each step, it selects a node to be scheduled such that the partial schedule will be valid. This is ensured by requiring both the min-run and max-run are valid. The algorithm is outlined in Figure 4.4 [Cho95a].

FIGURE 4.4. The interval-scheduling algorithm. From Chou and Borriello [Cho95a] ©1995 ACM press.

Heuristics can be used to select algorithms that satisfy secondary optimization criteria as well as provide correct schedules. Choosing the candidate node with the smallest difference between lower- and upper-bound execution times helps to maximize available slack. Choosing the candidate with the smallest required separation from its parent requires less idle time in the schedule. Choosing the node with the highest indegree helps reduce starvation.

Dynamic and priority-driven scheduling

Dynamic scheduling algorithms for periodic sets of processes are often cast as priority-driven scheduling. A priority-driven scheduler assigns priorities to each process. It then takes the process with the highest priority as the next process to run.

Static vs. dynamic priorities

A priority-driven scheduler may use either static or dynamic priorities. (Note that priority-driven scheduling is dynamic in that the next process to execute is determined at runtime, but the scheduler may use statically assigned priorities to determine scheduling.) In a static priority system, the priority of a process does not change during execution. A dynamic priority system, in contrast, changes the priorities of processes on the fly.

Liu and Layland

In their classic paper, Liu and Layland [Liu73] analyzed examples of both static and dynamic priority scheduling algorithms. Their static priority algorithm was called rate-monotonic scheduling (RMS) or rate-monotonic analysis (RMA). Their dynamic priority algorithm was known as earliest deadline first (EDF). Their analysis made some common assumptions:

There are no data dependencies between processes.

Process periods may have arbitrary relationships.

Context-switching overhead is negligible.

The release time of each process is at the start of its period.

Process execution time (C) is fixed.

In our discussion, we will assume that lower-numbered processes have higher priority, with process 1 having the highest priority. (Readers of Liu and Layland's article should note that they use the term task for what we refer to as process.) The assumptions underlying RMS and EDF mean that a system scheduled using these policies can be scheduled by a static scheduler that implements RMS or EDF. If the system allows dynamically created tasks, we can also use RMS or EDF to check whether the task is admissible and determine how to schedule it.

Because the deadlines of the processes do not have to be related in any way, many different combinations of deadlines are possible. It is not possible to enumerate all possible deadline sets to evaluate schedulability. Liu and Layland's analysis of rate-monotonic scheduling centered on the critical instant. As illustrated in Figure 4.5, the critical instant is the worst-case combination of process executions that will cause the longest delay for the initiation time of a process. Liu and Layland showed that the critical instant for process i occurs when all higher-priority processes are ready to execute—that is, when the deadlines of all higher-priority processes have just expired and new periods have begun. In Figure 4.5 the critical instant for process 4 occurs when processes 1, 2, and 3 become ready; the first three processes must run to completion before process 4 can start executing.

FIGURE 4.5. The critical instant in rate-monotonic scheduling.

RMS priority assignment

Liu and Layland used critical instant analysis to show that process priorities should be assigned in order of period, with the shortest priority process receiving the highest priority. This priority scheme is the motivation for the term "rate-monotonic." Their analysis used two processes, 1 and 2, with process 1 having the shorter period. If process 2 is given the higher priority and the resulting system has a feasible schedule, then it must be true that

(EQ 4.3) C 1 + C 2 < T

The condition for process 1 to be schedulable is

(EQ 4.4) T 2 T 1 C 1 + C 2 T 2

As EQ. (4.3) is satisfied when EQ. (4.4) is satisfied, process 1 has a feasible schedule when it has the lower priority and process 2 is feasible. However, the reverse is not true—in some cases, processes 1 and 2 are feasible when process 1 has the higher priority but not if process 2 is given the higher priority. Liu and Layland showed that RMS is a feasible schedule for any set of processes that has a feasible static priority assignment.

RMS utilization

Liu and Layland defined process utilization as the sum of the utilizations of the component processes:

(EQ 4.5) U = 1 i m C i T i

They showed that the least upper bound on utilization for a set of m tasks scheduled by RMS is

(EQ 4.6) U = m ( 2 1 / m 1 )

If the periods of the processes are simply related, there may be in fact a schedule that allows the utilization to reach 100%. However, in a wide range of circumstances, the CPU utilization approaches ln 2 for large m. This result means that we cannot always use 100% of the CPU if we want to guarantee that all processes will meet their deadlines.

Earliest deadline first

Liu and Layland also studied earliest deadline first scheduling, which they called deadline-driven scheduling. Priorities are assigned to processes based on the time remaining until the process's deadline—the highest-priority process is the one that is closest to reaching its deadline. These priorities are updated at each potential context switch. Liu and Layland showed that if a set of processes can be scheduled by any algorithm, then it can be scheduled by EDF. However, they also showed that in some cases, it is possible for EDF to overflow and not meet the deadlines of all processes.

Liu [Liu00] gave a feasibility condition for the schedulability of a system of processes using EDF. Given a set of n processes, let D i be the relative deadline of process i. Then the process set must satisfy this relation:

(EQ 4.7) 1 i n T i m i n ( D i , C i ) 1

Albers and Slomka [Alb05] developed an efficient feasibility test for EDF-scheduled systems

Least laxity first scheduling

A variation of EDF is least laxity first (LLF) scheduling. Laxity or slack is the difference between the remaining computation time required to finish the process and the time remaining until the deadline. LLF scheduling assigns the highest priority to the process with the smallest laxity or slack value. LLF differs from EDF in that it takes into account the remaining computation time. EDF gives high priority to a process even if it needs only a fraction of its remaining period to finish its work; LLF, in contrast, gives priority to the process that will have the hardest time finishing by its deadline.

Response time

Joseph and Pandya [Jos86] described the response time of a task in a fixed-priority real-time system—the time between the initiation of a request and its completion. Let T i and C i be the period and computation time of task i. Then the response time R i of task i is

(EQ 4.8) R i = C i + j h p ( i ) R i T j C j

where h p ( i ) is the set of tasks with priority higher than that of i. Note that because the response time appears on both the left-hand and right-hand sides of the equality, it must be solved iteratively.

Priority inversion

The analysis of RMA and EDF assumed that a process could always be preempted, but that is not always true in practical systems. If a process includes a critical section, then it cannot be preempted while it is in the critial section. A critical section is often used to protect the process's access to a shared resource. The critical section causes higher-priority processes to be excluded from executing. This phenomenon is known as priority inversion.

An example of priority inversion in a schedule is shown in Figure 4.6. Process 1 is the highest-priority process; it can preempt process 3 during normal operation. But when process 3 enters a critical section that it shares with process 1, then process 1 cannot preempt 3 until 3 finishes with the critical section. However, process 2, which does not share that critical section, can preempt process 3 as many times as it wants since it has higher priority. Process 2's preemptions can delay process 3 for an arbitrarily long time, even though process 1 has priority over both 1 and 2.

FIGURE 4.6. An example of priority inversion.

Priority inheritance protocols

Sha et al. [Sha90] introduced the priority inheritance protocol to avoid priority inversion. They credit Lampson and Redell with first recognizing priority inversion in the Mesa operating system [Lam80]. Sha et al. recognized that temporarily changing priorities could help reduce the effects of priority inversion.

Their basic priority inheritance protocol causes a process that is inside a critical section to execute at the highest priority of any process that shares that critial section. In the example of Figure 4.6, this means that process 2 cannot interrupt process 3 while 3 is in its critical section. As a result, process 3 finishes more quickly and allows process 1 to run. However, Sha et al. noted that if a process needs several critical sections, it can be blocked sequentially at each of those critical sections by different processes, causing a blocking chain that slows down the higher-priority process. The basic protocol is also vulnerable to deadlocks.

Priority ceiling protocol

Sha et al. developed the priority ceiling protocol to overcome the limitations of the basic protocol. Given a process i, we want to schedule the process. In this protocol, each semaphore S j guarding a critical section is assigned its own priority ceiling Π (Sj ). As before, the priority ceiling of a semaphore is equal to the priority of the highest-priority process that may use the semaphore. Given the set of semaphores that are currently locked by processes other than i, we call the semaphore in that set with the highest priority ceiling S∗. For process i to obtain semaphore S j , it must have a priority higher than Π(S∗). While in a critical section, process i inherits the priority of the highest-priority job that may be blocked by process i; priorities are inherited transitively.

Sha et al. extended Liu and Layland's analysis of RMS (namely, Eq. 4.5) to determine the schedulability of tasks under the priority ceiling protocol. If B i is the worst-case time that process i can be blocked by lower-priority processes, then

(EQ 4.9) C 1 T 1 + C 2 T 2 + + C i T i + B i T i i ( 2 1 / i 1 )

Hot swapping

Some systems may use several implementations of a process or change the process parameters to cause a large change in the process's execution time. For example, a communication system may change error-correction codes on the fly to adapt to changing channel characteristics. When the process execution time changes substantially, we must schedule the system to handle not only the steady-state condition of the new system but also the transient effects caused by the switch. Lee et al. [Lee02a] developed a hot-swapping algorithm that combines offline analysis and online scheduling. In the hot-swapping model, switching a process from one implementation to another invokes an additional computation time that corresponds to the setup time for the new implementation. At design time, they analyze the possible combinations of implementation swaps and build a table to be used by the operating system. At runtime, the operating system uses the table to determine how to schedule the transition to the new implementation of the task.

4.2.3 Multi-criticality scheduling

Vestal proposed a multi-criticality model for scheduling [Ves07]. He noted that in complex systems, some software must operate at high confidence levels while other, less critical software may not require the same level of scheduling confidence. The accuracy of a timing estimate for a high-confidence task is most important, but actual timing performance may vary. When actual task timing varies from expected timing, the scheduler should take into account the criticality of the task in adjusting the schedule.

The literature generally assumes two levels of criticality: low and high. Multi-criticality scheduling is particularly important for systems such as aircraft that require design certification—a scheduling algorithm that gives preference to high-criticality software allows certification to concentrate timing verification on the high-criticality components.

Mixed-criticality schedulability

Vestal proposed a version of preemptive fixed priority (PFP) schedulability analysis for mixed-criticality systems. When a high-criticality task exceeds its execution time, its period is and execution time are reduced by a factor of n such that the period is at or below that of all lower-criticality tasks.

Load-based schedulability

Li and Baruah [Bar10, Li10] [Bar10] [Li10] proposed a load-based algorithm for mixed-criticality scheduling. They recursively construct the priority order of tasks, identifying the lowest priority job at each step. Each job has two worst-case execution time estimates: C i is the less-pessimistic WCET estimate and C i ' is the more pessimistic estimate that may be demanded by certification authorities. At each step, a job is designated as low priority in one of two cases. For a low-criticality task, consider the interval between its release time and deadline; if every other job J j that has higher priority executes for time C j , if the low-criticality job has at least C i execution time available then it is a candidate for lowest-priority task. For a high-criticality task, consider the interval between its release time and deadline; if every other job J j that has higher priority executes for time C j ', if the low-criticality job has at least C i ' execution time available then it is a candidate for lowest-priority task. They showed that if the low-criticality and high-criticality loads for the system have an upper bound of about 0.62 then their algorithm is guaranteed to be schedulable by their algorithm.

PLRS

Guan et al. [Gua11] developed the PLRS algorithm, which uses a combination of offline and online priority computations. The offline priority computation is similar to the online priority computation of Li and Baruah. Job Jk δk is assigned lowest priority if

(EQ 4.10) τ j τ ( δ j × C j ( l k ) ) ( δ k 1 ) × T k + D k

In this formula, δ j is the number of jobs in task τ j that have not yet been assigned a priority. This process is used to form a priority list Λj for each task: the members of Λj are the priorities of the jobs in task j. The priority list is used at runtime to assign priorities to jobs as they are released. The priority list is represented as a set of intervals rather than a separate priority for each job—when a set of jobs in the list have consecutive priority values, the sequence is represented by the lowest and highest priorites in the sequence. When a new job is released while an existing job is running, the scheduler adjusts the set of priority sequences to reflect the priority of the newly scheduled job.

CBEDF

Park and Kim [Par11] developed the criticality-based earliest deadline first (CBEDF) algorithm. Their algorithm uses a combination of offline and online scheduling to make use of available slack time to schedule low-criticality jobs. Their offline scheduling algorithm schedules high-criticality jobs in EDF order, which puts each job as close to its deadline as possible. They then calculate the slacks for each high-criticality job. At runtime, they select a job depending on the state of the ready job queue: no jobs are available; only low-criticality jobs are available; only high-criticality jobs are available; or both low- and high-criticality jobs are available. In the last case, they schedule a low-criticality job if slack is remaining; otherwise they schedule a high-criticality job.

Ductility

Lakshmanan et al. [Lak12] developed mixed-criticality scheduling algorithms for uniprocessors and multiprocessors. They propose ductility as an overload tolerance metric. Given a set of k criticality levels, a system's ductility matrix D is a 0/1 matrix of size 2 k × k in which a value of 1 for entry d r , c = 1 denotes that a task at criticality level c can meet its deadlines given a system workload w r , while a value of 0 indicates that the task cannot meet its deadlines. The ductility matrix can be summarized as a scalar quantity:

(EQ 4.11) P D ( D ) = c = 1 k { r = 1 2 k d r , c 2 c 2 k }

This value is normalized to the range [0,1] by dividing by 1 1 2 k . Lakshmanan et al. proposed zero-slack scheduling for uniprocessors. Each task in the system is in either the normal or critical mode. In critical mode, all tasks with lower criticality are suspended. A task is admitted if the system ductility matrix has d r , c = 1 for r 2 c 1 . They applied zero-slack scheduling to rate-monotonic scheduling to create the zero-slack rate-monotonic (ZSRM) scheduling algorithm. The zero-slack instant of a task is the time at which a task switches from normal to critical mode. They use an offline algorithm to find the zero-slack instance for each task and so determine the admission control policy. An online algorithm then enforces that policy by blocking tasks based on the criticality and zero-task instants of tasks.

In the case of multiprocessors, different allocations of tasks to processors can change the likelihood of overload situations. Lakshmanan et al. use the normalized ductility value for a given allocation to evaluate its susceptibility to overload. Their compress-on-overload packing algorithm operates in two phases. First, tasks are allocated using a best-fit decreasing heuristic. Some tasks may remain unallocated but those that are allocated are guaranteed to meet their deadlines even under overload. Next, ZSRM is used to allocate additional tasks to processors.

4.2.4 Scheduling for dynamic voltage and frequency scaling

Scheduling for dynamic voltage scaling

Many groups have studied how to schedule tasks on processors that implement dynamic voltage and frequency scaling (DVFS). DVFS is widely used for power management; it is also used for thermal management.

Yao et al. [Yao95] developed early scheduling algorithms for DVFS. They assumed that the processor clock could be varied continuously. They modeled each process j with an arrival time a j , a deadline b j , and a required number of CPU cycles R j . (Execution time depends on the clock frequency as well as the number of clock cycles.) They defined the intensity of an interval I = [z,z′] in the schedule as

(EQ 4.12) g ( I ) = R j z z

The intensity of an interval defines a lower bound on the average processing speed required to create a feasible schedule. Yao et al. called the interval that maximizes the intensity the critical interval (known as I∗) and the set of processes that run in that interval the critical group. They showed that an optimal schedule for the process set is equal to the intensity of the critical interval g(I∗). They developed an optimal offline scheduling algorithm that repeatedly identified critical intervals and scheduled the critical group. They also developed an online scheduling heuristic. Their average rate heuristic sets the processor speed at

(EQ 4.13) s ( t ) = j R j b j a j

The order in which processes are executed is determined using an EDF policy. They showed that the average rate heuristic (AVR) is optimal to within a constant factor of the energy of an optimal schedule. The slowdown factor of a process is often called either a or h.

Pillai and Shin [Pil01] proposed testing for the feasibility of EDF or RMS scheduling under voltage scaling by multiplying the maximum utilization by the slowdown factor. They also proposed a cycle-conserving scheduling algorithm that measured the difference between the actual and worst-case execution times of the tasks and scaled the processor frequency to adjust for the unused execution time. They also proposed predicting runtimes and using that information to scale the processor frequency before the jobs had finished.

Discrete voltages and frequencies

Yao et al.'s assumption of continuously scalable frequencies is naive. In practice, the processor clock and power supply voltage can be set to a relatively small set of discrete values within the operating envelope of the processor.

Ishihara and Yasuura [Ish98a] proved some useful theorems about DVFS with discrete voltages. They used v ideal to refer to the voltage at which the CPU executes a process so that it finishes exactly at the deadline. They showed that if the processor is restricted to a small number of voltage levels, then two voltage levels are sufficient to minimize total energy consumption under a time constraint. This requirement can be formulated as

(EQ 4.14) V 1 2 x 1 + V 2 2 x 2 + V 3 2 x 3 V 1 2 y 1 + V 2 2 y 2

where the x's are the execution times at each of the three-voltage system and the y's are the execution times of the two-voltage system. The requirement of 4.14 is subject to the execution time and execution cycles of the two schedules being the same:

(EQ 4.15) x 1 V 1 ( V 1 V T ) α + x 2 V 2 ( V 2 V T ) α + x 3 V 3 ( V 3 V T ) α = y 1 V 1 ( V 1 V T ) α + y 2 V 2 ( V 2 V T ) α

(EQ 4.16) x 1 + x 2 + x 3 = y 1 + y 2

The constraints can be satisfied only if x 1 = y 1. In this case, we can rewrite the constraints in the following form:

(EQ 4.17) V 1 2 ( x 1 y 1 ) + V 3 2 x 3 V 2 2 ( y 2 x 2 )

Equation (4.17) implies that voltage scheduling with the three voltages cannot minimize energy consumption when V ideal is between V 1 and V 2 . Similarly, three voltages cannot minimize energy when V ideal is between V 2 and V 3 . Therefore, two voltage levels minimize energy consumption.

Ishihara and Yasuura also showed that the two voltages to use are immediate neighbors to V ideal . Figure 4.7 illustrates the proof. Both the continuous time-energy curve and linear approximations are shown. If, for example, the time constraint falls between t 2 and t 3, then the V 2 to V 3 line gives lower energy consumption than the V 1 to V 3 line.

FIGURE 4.7. Voltage scaling–based scheduling with two voltages.

From Ishihara and Yasuura [Ish98a] ©1998 IEEE.

Slack-based scheduling

Kim et al. [Kim02] developed an algorithm to use slack times from processes to scale processor voltage. Their algorithm takes advantage of slack from both higher-priority and lower-priority tasks. Their algorithm for determining the available slack is shown in Figure 4.8.

FIGURE 4.8. An algorithm for estimating slack for dynamic voltage scaling–based scheduling.

From Kim et al. [Kim02] ©2002 IEEE computer society.

Checkpoint-driven scheduling

Azevedo et al. [Aze02] used profile information to guide DVFS scheduling. They profiled the program using simulators to determine the program's performance and energy consumption for a variety of inputs. The designer can insert checkpoints at arbitrary places in the program to measure performance and energy. During execution, the program generates an event at each checkpoint. The scheduler can use either average power or a maximum power budget as its objective. To schedule the program, the scheduler considers all the possible events from the current checkpoint to the deadline. It then calculates the optimal frequency based on power and time constraints. Azevedo et al.'s scheduling algorithm is shown in Figure 4.9.

FIGURE 4.9. A checkpoint-driven DVFS scheduling algorithm. From Azevedo et al. [Aze02] ©2002 IEEE computer society.

Leakage minimization

Idle modes save even more processor energy than does voltage scaling; this is particularly true for processors with high leakage rates. However, because entering and leaving an idle mode incurs considerable penalties in both energy and performance, we want to maximize the length of any idle time that is entered and minimize the number of distinct idle times. Procrastination scheduling is a family of scheduling algorithms designed to maximize the length of idle periods.

Jejurikar et al. [Jej04] used procrastination scheduling to maximize idle periods. They modeled the power consumption of a processor as three components:

(EQ 4.18) P = P A C + P D C + P o n

where P A C is dynamic power consumption, P D C is static power consumption, and P o n is on-state power consumption. On-state power is consumed by the clock, I/O devices, and other peripheral circuitry. Jejurikar et al. compute the minimum breakeven time for a shutdown as

(EQ 4.19) t t h r e s h o l d = E s h u t d o w n P i d l e

Given a set of processes, they use EDF scheduling with the deadline equal to the period and a known worst-case execution time. When a new task arrives, they do not wake up the processor immediately; instead, they wait for a procrastination timer to expire before waking up the processor and starting the highest-priority task (using the EDF policy). The procrastination interval of a process is Z i . They showed that the procrastination algorithm guarantees all process deadlines if the procrastination interval satisfies

(EQ 4.20) Z i T i + 1 k i C k T k η k 1

for all i ∈ 1 . . . . n and for all k < i , Z k Z i . They showed that their policy provided a 5% energy gain over DVFS and a 20% again over non-DVFS scheduling.

Leakage and temperature

Gu and Qu [Gu13] developed a model of leakage that incorporated temperature effects. They also derived scheduling algorithms to minimize energy consumption taking into account temperature-dependent leakage. For a single task, they ran at the lowest voltage that allowed the task to finish its deadline, putting all idle time at the start of the interval. They proposed a heuristic for scheduling multiple tasks: if the total idle time is less than 1.5 R C (where R and C are the thermal resistance and capacitance), they allocate all idle time to the end; if idle time is greater than 1.5 R C , they allocate at least 1.5 R C to the end of the period, leaving the remaining idle time to the front. This heuristic takes advantage of the exponential decay of temperature with time and the relationship between temperature and leakage current. Given a wide range of starting temperatures, after an idle interval of 1.5 R C , temperature profiles have narrowed to a small range that gives a very small difference in leakage current.

We will discuss thermal-aware scheduling for multiprocessors in Section 7.5.

4.2.5 Performance estimation

The assumption that the computation time of a process is fixed is not very realistic. Not only do data-dependent paths in the program cause execution time to vary, but also the cache causes large variations in runtime. We can, however, model the cache to estimate its effects on programs.

Multitasking and caches

We are particularly interested in the effects of multiple tasks on the cache. Kirk and Strosnider [Kir90] proposed a segmented, locked cache. A program could lock a range of lines in the cache so that no other program could modify those cache locations. This would allow the program to keep certain parts of itself in the cache after a preemption. However, it reduces the cache size not only for the program with the lock but also for the other programs in the system.

Program placement for caches

Mueller [Mue95] used software methods to partition the cache for use by multiple processes. His method used the compiler to split the code into portions of equal size, based on the size of the instruction cache partition. Each partition ends in an unconditional jump. Similarly, the method splits data into smaller units. In some cases, splitting large data structures like arrays may require transformations on the parts of the program that manipulate the data structure. Local data is split into partitions by manipulating the stack pointer. Statically linked libraries can be transformed to create fixed-size partitions, but dynamically linked libraries are problematic.

Simplified process caching models

Li and Wolf [Li97b] developed a model for multitasking in caches. They characterized a program as occupying a set of lines in the cache. Each program was modeled with a two-state machine: one state represents the program in the cache, while the other state models the program outside of the cache. The total state of the cache is given by the union of all the models for the programs that use the cache. Li and Wolf modeled the performance of a program with two major numbers: the worst-case execution time when the program is not in the cache and the average-case execution time when the program is in the cache. (For synthesis purposes, they also used a best-case time assuming that there were no cache misses, but it was used only to bound possible schedules.)

Given the cache state model and the performance characteristics of each process, they could construct an abstract schedule that approximated the execution time of the multitasking system. They tested the accuracy of this model by comparing it to detailed simulations of the programs, using interleaved sections of the code of the various processes. Figure 4.10 shows the results of simulation that compared the execution times predicted by the two-state model with simulation instruction traces for the interleaved processes using the program profiler QPT and cache simulator Dinero [Li98b].

FIGURE 4.10. Experimental validation of two-state cache model.

From Li [Li98b]

Caches and scheduling

Kastner and Thesing [Kas98] developed a scheduling algorithm that takes cache behavior into account. Their analysis handled associative caches. They determined which memory lines must be in the cache at a given point in the program. Their scheduling algorithm checks the cache state at scheduling decision points to more accurately estimate execution time.

Multitasking and scratch pads

When several programs execute on the CPU, all the programs compete for the scratch pad, much as they compete for the cache. However, because the scratch pad is managed in software, the allocation algorithm must take multitasking into account. Panda et al. propose dividing the scratch pad into segments and assigning each task its own segment. This approach reduces run-time overhead for scratch pad management but results in underutilization of part of the scratch pad. When the programs are prioritized, they weight the total conflict fetch (TCF) by the task priority (see Section 3.3.4), with higher-priority tasks given more weight. (Note that this is the inverse of the convention in real-time systems, in which the highest priority task is given a priority of 1.)

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124105119000046

Toward Running Scientific Workflows in the Cloud

Wenhong Tian , Yong Zhao , in Optimized Cloud Resource Management and Scheduling, 2015

12.4.3.3 Different number of data blocks experiment

In this experiment, we change the number of input data blocks from 50 blocks to 25 blocks and measure the total execution time with varying number of workers in the virtual cluster.

In Figure 12.9, we can observe that, with the increase of the number of workers, the execution time decreases accordingly (i.e., execution efficiency improves). However, when using five workers to process the workflow, the system reaches efficiency peak. After that, the execution time goes up with more workers. This means that the improvement can't subsidize the management and registration overhead of the added worker. The time for server and worker creation, and worker registration remain unchanged when we change the input size (as shown in Figure 12.5). The experiment indicates that although our virtual resource provisioning overhead is well controlled, we do need to carefully determine the number of workers used in the virtual cluster to achieve resource utilization efficiency.

Figure 12.9. Different input size.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128014769000124

Evaluation of Performance and Power Consumption of Storage Systems

Jalil Boukhobza , Pierre Olivier , in Flash Memory Integration, 2017

3.2.1.1 Pure performance evaluation metrics

In this context, classical metrics for evaluating storage systems' performance are encountered: at very high levels, the total execution time of a benchmark (or, more generally, of a test) is often used to compare different systems or different configurations of the same system with each other. At smaller granularity levels, it is also possible to measure the execution time of a given operation (sometimes called latency), for example a read request at block level for a storage peripheral or a read( ) system call for an FFS. Typically, this latency is calculated as an average value of several calls of this operation, or as a distribution of the execution time of several operation calls for a finer analysis. The number of flash operations (read/write/erase) generated by a higher level operation, for example the execution of a write( ) system call or the execution of an entire benchmark, is also a frequently employed metric. Read and write throughput is quite a classical metric as well, and it can be measured at different levels: for example, at application, file system or block level. We can also find metrics that describe the number of satisfied I/O requests per second (IOPS). The IOPS value must always be accompanied by information about, for example, the type of operation concerned, the size of requests, the access patterns, etc. This metric is generally used to describe random access performance with a size of block or request of 4   KB.

Write amplification is a metric that denotes the phenomenon when an application request of writing a certain amount of data causes writing of a greater amount of information in flash. This is due in particular to possible garbage collecting operations that can occur. Note that the existence of caches above the management layer (for example, the Linux page cache situated above the FFS) can cause the system to write less data on the storage device than the amount requested by the application. Therefore, generally, during the calculation of write amplification, it is desirable to consider the workload entering directly into the management layer (independently from the caches). There are several methods for calculating write amplification [HU 09, CHI 11]. An intuitive method consists of dividing the total amount of flash space that is actually written during a test by the amount of space written by the workload which enters the storage system.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781785481246500049

Analysis of Operating System Components

Paul J. Fortier , Howard E. Michel , in Computer Systems Performance Evaluation and Prediction, 2003

What if the entities do require

I/O? Entities requiring I/O are freed from the CPU and routed through an assign-node, ioreq , which updates the total execution time already spent by the entity in the CPU. After this, entities wait at an await-node, iowait, on the group resource I/O block, which has various standard I/O resources, such as monitor, mouse, keyboard, and printer. After the entities are serviced by the resource, the attribute LTRIB[2] is set to zero, assuming that the entity no longer requires assitional I/O. After this the entities are routed through a free-node, free_io, where all the resources allocated to that entity are freed. Finally, the entities are routed back to the ready queue, since they are done with their execution.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781555582609500138

Embedded Multiprocessors

Marilyn Wolf , in Computers as Components (Fourth Edition), 2017

10.4.3 Accelerator performance analysis

In this section, we are most interested in speedup: How much faster is the system with the accelerator than the system without it? We may, of course, be concerned with other metrics such as power consumption and manufacturing cost. However, if the accelerator does not provide an attractive speedup, questions of cost and power will be moot.

Performance analysis of an accelerated system is a more complex task than what we have done thus far. In Chapter 6 we found that performance analysis of a CPU with multiple processes was more complex than the analysis of a single program. When we have multiple processing elements, the task becomes even more difficult.

The speedup factor depends in part on whether the system is single threaded or multithreaded, that is, whether the CPU sits idle while the accelerator runs in the single-threaded case or the CPU can do useful work in parallel with the accelerator in the multithreaded case. Another equivalent description is blocking versus nonblocking: Does the CPU's scheduler block other operations and wait for the accelerator call to complete, or does the CPU allow some other process to run in parallel with the accelerator? The possibilities are shown in Fig. 10.5. Data dependencies allow P2 and P3 to run independently on the CPU, but P2 relies on the results of the A1 process that is implemented by the accelerator. However, in the single-threaded case, the CPU blocks to wait for the accelerator to return the results of its computation. As a result, it does not matter whether P2 or P3 runs next on the CPU. In the multithreaded case, the CPU continues to do useful work while the accelerator runs, so the CPU can start P3 just after starting the accelerator and finish the task earlier.

Figure 10.5. Single-threaded versus multithreaded control of an accelerator.

The first task is to analyze the performance of the accelerator. As illustrated in Fig. 10.6, the execution time for the accelerator depends on more than just the time required to execute the accelerator's function. It also depends on the time required to get the data into the accelerator and back out of it.

Figure 10.6. Components of execution time for an accelerator.

Because the CPU's registers are probably not addressable by the accelerator, the data probably reside in main memory.

Accelerator execution time

A simple accelerator will read all its input data, perform the required computation, and then write all its results. In this case, the total execution time may be written as

(10.1) t a c c e l = t i n + t x + t o u t

where t x is the execution time of the accelerator assuming all data are available, and t in and t out are the times required for reading and writing the required variables, respectively. The values for t in and t out must reflect the time required for the bus transactions, including two factors:

the time required to flush any register or cache values to main memory, if those values are needed in main memory to communicate with the accelerator; and

the time required for transfer of control between the CPU and accelerator.

Transferring data into and out of the accelerator may require the accelerator to become a bus master. Because the CPU may delay bus mastership requests, some worst-case value for bus mastership acquisition must be determined based on the CPU characteristics.

A more sophisticated accelerator could try to overlap input and output with computation. For example, it could read a few variables and start computing on those values while reading other values in parallel. In this case, the t in and t out terms would represent the nonoverlapped read/write times rather than the complete input and output times. One important example of overlapped I/O and computation is streaming data applications such as digital filtering. As illustrated in Fig. 10.7, an accelerator may take in one or more streams of data and output a stream. Latency requirements generally require that outputs be produced on the fly rather than storing up all the data and then computing; furthermore, it may be impractical to store long streams at all. In this case, the t in and t out terms are determined by the amount of data read in before starting computation and the length of time between the last computation and the last data output.

Figure 10.7. Streaming data into and out of an accelerator.

We are most interested in the speedup obtained by replacing the software implementation with the accelerator. The total speedup S for a kernel can be written as [Hen94]:

(10.2) S = n ( t C P U t a c c e l ) = n [ t C P U ( t i n + t x + t o u t ) ]

where t CPU is the execution time of the equivalent function in software on the CPU and n is the number of times the function will be executed. We can use the techniques of Chapter 5 to determine the value of t CPU . Clearly, the more times the function is evaluated, the more valuable the speedup provided by the accelerator becomes.

System speedup

Ultimately, we care not so much about the accelerator's speedup as the speedup for the complete system—that is, how much faster the entire application completes execution. In a single-threaded system, the evaluation of the accelerator's speedup to the total system speedup is simple: The system execution time is reduced by S. The reason is illustrated in Fig. 10.8—the single thread of control gives us a single path whose length we can measure to determine the new execution speed.

Figure 10.8. Evaluating system speedup in a single-threaded implementation.

Evaluating system speedup in a multithreaded environment requires more subtlety. As shown in Fig. 10.9, there is now more than one execution path. The total system execution time depends on the longest path from the beginning of execution to the end of execution. In this case, the system execution time depends on the relative speeds of P3 and P2 plus A1. If P2 and A1 together take the most time, P3 will not play a role in determining system execution time. If P3 takes longer, then P2 and A1 will not be a factor. To determine system execution time, we must label each node in the graph with its execution time. In simple cases we can enumerate the paths, measure the length of each, and select the longest one as the system execution time. Efficient graph algorithms can also be used to compute the longest path.

Figure 10.9. Evaluating system speedup in a multithreaded implementation.

This analysis shows the importance of selecting the proper functions to be moved to the accelerator. Clearly, if the function selected for speedup is not a big portion of system execution time, taking the number of times it is executed into account, you will not see much system speedup. We also learned from Eq. (10.2) that if too much overhead is incurred getting data into and out of the accelerator, we will not see much speedup.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128053874000108

Accurately modeling GPGPU frequency scaling with the CRISP performance model

R. Nath , D. Tullsen , in Advances in GPU Research and Practice, 2017

3.4 Example

We illustrate the mechanism of our model in Fig. 13 where loads and computations are from all the concurrently running warps in a single SM. The total execution time of the program at clock frequency f is 31 units (5 units stall   +   26 units computation). As we scale down the frequency to f 2 , all the compute cycles scale by a factor of 2. Because of the scaling of overlapped computation under load A (cycles 0–6) at clock frequency f 2 , the compute cycle 8 cannot begin before time unit 14. Similarly, the instruction at compute cycle 28 cannot be issued until time unit 50 because the expansion of overlapped computation under load C (cycles 14–23) pushes it further. Meanwhile, as the computation at cycle 28 expands due to frequency scaling, it overlaps with the subsequent store stalls.

Fig. 13. T Memory computation for a GPU program running at frequency f with different linear performance models.

For each of the prior performance models, we identify the cycles classified as the nonpipelined portion of computation (T Memory). Leading load (LEAD) computes T Memory as the sum of A, B, and D's latencies (8   +   5   +   5   =   18). Miss model (MISS) identifies three contributor misses (A, B, and D) and reports T Memory to be 24 (8   ×   3). Among two dependent load paths (A   +   B   +   D with length 8   +   5   +   5   =   18, A   +   C with length 8   +   12   =   20), CRISP computes the LCP as 20 (for the longest path A   +   C). The stall cycles computed by STALL are 4. We plug in the value of T Memory in Eq. (2) and predict the execution time at clock frequency f 2 . As expected, the presence of overlapped computation under T Memory and store-related stalls introduces an error in the prediction of the prior models. The STALL model overpredicts, while the rest of the models underpredict.

Our model observes two load outstanding phases (cycles 0–7 and 12–27), three pure compute phases (cycles 8–11, 28, and 30), and one store stall phase (cycle 29). CRISP computes the LCP as 20 (8 + 12 for load A–C). The rest of the cycles (11   =   31     20) are assigned to the CSP. The overlapped computation under LCP is 17 (cycles 0–6 and 14–23). The nonoverlapped computation in CSP is 10 (cycles 8–13, 26–28, and 30). Note that cycle 29 is the only store stall cycle. CRISP computes the scaled LCP as max (17   × 2, 20) = 34 and CSP as max (10   ×   2, 11) =   20. Finally, the predicted execution time by CRISP at clock frequency f 2 is 54 (34 + 20), which matches the actual execution time (54).

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128037386000185

Advances in Computers

Amjad Ali , Khalid Saifullah Syed , in Advances in Computers, 2013

9 Parallel Performance Metrics

A number of metrics are available in the literature to quantify performance of parallel programs on a given parallel platform. The most basic metrics include " total execution time ," "relative speedup," and "relative efficiency." In the present discussion, the metrics "relative speedup" and "relative efficiency" might simply be called "speedup" and "efficiency," respectively. The execution time consists of computation and communication time, both. It is "the elapsed wall clock time from the start of the execution of first process of a parallel program to the end of the execution of its last process." Simply knowing the execution time of any code or code segments could be done through a variety of timer functions available in the language and implementation.

Relative speedup, Ş, of a parallel program is "the ratio of elapsed time, τ 1 , taken by one process to solve a problem of size n to the elapsed time, τ p , taken by p processes to solve the same problem," i.e.,

Ş = Ş ( n , p ) = τ 1 τ p .

The relative efficiency, E , is defined as,

E = E ( n , p ) = Ş p .

In general, speedup is observed less than p and efficiency is observed between 0 and 1. In an ideal case,

τ p = τ 1 p , Ş = p and E = 1 .

Sometimes so-called "super-linear speedup" is observed where speedup is greater than p . This phenomenon is caused by the cache efficiency with smaller data sizes on the p processors as compared to the single processor case.

Usually, the efficiency per processing element decreases with increase in the number of processing elements for a given problem size. On the other hand, the overall speedup increases "significantly" with increase in the number of processing elements until an extent of the number of processing elements is reached, relative to the given problem size. Further increase in the number of processing elements brings the point of "diminishing returns." To gain further speedup (or sustain the efficiency) the problem size per process would need to be enlarged. Scalability of a parallel program is a measure of its ability to sustain the efficiency when the processing elements and the problem size are both increased in proportion to each other ([37, pp. 208–218]). A basic scaling test is to observe how the performance metric, speedup, or efficiency, of the parallel program varies/scales with increase in the number of processes for a given problem size. The isoefficiency metric determine that how much the total problem size is to be increased with increase in the number of processing elements in order to sustain a given efficiency per processing element. If the total problem size is needed to be increased at a rate not higher than the number of processing elements, then the application is considered as scalable according to this metric. In such a case, the problem size per process would remain bounded. On the other hand, if the problem size is needed to be increased at a rate higher than the number of processing elements, then the application is not considered as scalable. In such a case, eventually the problem size per process would require to be impracticably large to sustain a given efficiency per processing element.

The literature on parallel performance analysis includes several other models, especially the Amdahl's law, Gustafson–Barsis' law, Karp–Flatt metric, and some refined performance models. Based on some different approach and applicable to possibly some different situation, each of the performance analysis models might help to provide an indication of the performance extent of the parallel program. In general, the parallel performance analysis helps in predicting performance of the parallel program and understanding the hurdles in achieving higher performance. A brief description on some performance models is given below, however, relatively more rigorous theoretical discussions on the performance models can be found in ([5, pp. 123–130]) and ([38, pp. 161–173]).

The total sequential execution time of a program (i.e., execution time on one processor) can be considered as consisting of the two portions: (i) time required for inherently serial (non-parallelizable) computations, that may be denoted as τ serial and (ii) the time required for the parallelizable computations, that may be denoted as τ parallel , i.e.,

Sequential Execution Time = τ 1 = τ serial + τ parallel .

Note that the time required for inherently serial (non-parallelizable) computations, τ serial , remains fixed irrespective of the number of processors p . On the other hand, the time required for the parallelizable computations, τ parallel , is made to decrease through execution of the parallel program on a multiprocessor computer. Thus, the time for parallel execution of the program on p processors, involving parallel overheads, is given by:

Parallel Execution Time = τ p τ serial + τ parallel p + h ( n , p ) .

Here h ( n , p ) is the parallel overhead incurred due to communication, synchronization, redundant operations, and/or load balancing among the p processors. The Amdahl's law specifies an upper bound on speedup, as follows:

Ş = Ş ( n , p ) = τ 1 τ p τ serial + τ parallel τ serial + τ parallel p + h ( n , p ) τ serial + τ parallel τ serial + τ parallel p = 1 f + 1 - f p .

Here f is the fraction of τ 1 that is spent for inherently serial (non-parallelizable) computations, i.e.,

f = τ serial τ 1 = τ serial τ serial + τ parallel .

The Amdahl's law treats problem size as a constant and it assumes that the goal of parallelism is to minimize the execution time. The law indicates that how the execution time decreases and the speedup increases with increase in the number of processors. Sometimes, there might be another goal of parallelism, i.e., within a fixed available time for program execution obtain more precise/accurate result by increasing the problem size on large number of processors. This goal is taken into account by the Gustafson–Barsis' law, i.e., keep the execution time fixed and determine the effect of increasing problem size with increase in the number of processors. The Gustafson–Barsis' law puts an upper bound on speedup as follows:

Ş = Ş ( n , p ) = τ 1 τ p τ serial + τ parallel τ serial + τ parallel p + h ( n , p ) τ serial + τ parallel τ serial + τ parallel p = p + ( 1 - p ) g .

Here g is the fraction of the inherently serial (non-parallelizable) computations in the parallel program, i.e.,

g = τ serial τ serial + τ parallel p .

The Karp–Flatt metric provides another metric for parallel performance that, unlike the Amdahl's and Gustafson–Barsis' laws, takes the parallel overhead into account. The new metric is the "experimentally determined serial fraction" for a parallel program and is given by

e = 1 Ş - 1 p 1 - 1 p .

Here the speedup Ş is an experimentally measured quantity and hence the fraction e . A smaller value of e represents better parallel scaling of the program. The metric might be used sometimes for determining that which of the "parallel overhead" and "inherent sequential computation part" is the cause of parallel inefficiency of the program. If increases with the increase in number of processors for a fixed problem size then the parallel overhead is the cause for parallel inefficiency rather than the inherently limited parallelism in the program.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124080898000033