2011 ~ Performance By Design A blog devoted to Windows performance, application responsiveness and scala

Sunday, 27 November 2011

Measuring thread execution state using trace events.

Posted on 22:54 by Unknown

Continuing the discussion from the previous blog entry on event-driven approaches to measuring CPU utilization in Windows...

Besides measuring processor utilization at the system level, the stream of context switch events can also be re-constructed to drill into CPU consumption at the process and thread level. An exemplary example of this approach is the Visual Studio Profiler’s Concurrency Visualizer, available in Visual Studio 2010. (For reference, see “Performance Tuning with the Concurrency Visualizer in Visual Studio 2010 in the Visual Studio 2010 Profiler,” an MSDN Magazine article written by the tool’s principal architect, Hazim Shafi.) The Concurrency Visualizer gathers Context Switch events to calculate processor utilization for the application being profiled.

The VS Concurrency Visualizer creates a system-level CPU Utilization View with an interesting twist – the view pivots based on the application you are profiling, a perspective that matches that of a software performance engineer engaged in a performance investigation. Based on the sequence of context switch trace events, the Concurrency Visualizer calculates processor utilization by the process, aggregates it for the current selection window, and displays it in the CPU Utilization View. In the CPU Utilization View, all other processor activity for processes (other than one being profiled) is lumped together under a category called “Other Processes.” System-processes and the “Idle process,” which is a bookkeeping mechanism, not an actual process that is dispatched, are also broken out separately. See Dr. Shafi’s article for more details. (For reference, Figure 12 below illustrates the CPU Utilization View.)

The Concurrency Visualizer’s primary focus is on being able to reconstruct the sequence of events that impact an application’s execution progress. The Concurrency Visualizer’s Threads View is the main display showing an application’s execution path. The view here is of execution progress on a thread by thread basis. For each thread in your application, the Concurrency Visualizer shows the precise sequence of context switch events that occurred. These OS Scheduler events reflect that thread’s execution state over time. See Figure 8 for an example of this view.

Figure 8. Screen shot of the Concurrency Visualizer illustrating thread preemption by a higher priority system routine.

Figure 8 shows the execution path of six application threads, a Main thread, a generic worker thread, and 4 CLR (the Common Language Runtime for .NET languages) worker threads that the application created by instantiating a .NET ThreadPool object. (There were originally more threads than this, but I chose to hide those that were inactive over the entire run.) For each thread, the execution state of the thread – whether it is running or whether it is blocked – is indicated over time.

The upper half of the display is a timeline that shows the execution state of each thread over time. The execution progress of each thread display is constructed horizontally from left to right from rectangles that indicate the start and end of a particular thread state. An interval when the thread was running shows as green. An interval where the thread is sleeping is shown in blue. A ready thread that is blocked from executing because a higher priority thread is running is shown in yellow. (This state is labeled “preemption.”) A thread in a synchronization delay waiting on a lock is visualized as red.

On the lower left of the display is a Visible Timeline Profile. This summarizes the state of all threads that are visible within the selected time window. In the screen shot in Figure 8, I have zoomed into a time window that is approximately 150 milliseconds wide. During that interval, the threads shown were in a state where they were actively executing instruction only 11% of the time. For 25% of the time interval, threads were blocked waiting on a lock. Finally, there is a tabbed display at the lower right. If you click on the “Profile Report” tab, a histogram displays that summarizes the execution state of each individual thread over the time window. In the screen shot, I have clicked on the “Current stack” tab that displays the call stack associated with the ETW context switch event. If the thread is blocked, the call stack indicates where in the code the thread will resume execution once it unblocks. We will drill into that call stack in a moment.

Note: The Threads View also displays call stacks from processor utilization samples that ETW gathers on a system-wide basis once per millisecond. Call-stacks samples are visible during any periods when the thread is executing instructions (and ETW execution sampling is active). One of the ETW OS kernel events that the Concurrency Visualizer does not analyze is the ReadyThread event. The interval between a ReadyThread event and a subsequent Context Switch that signals that a ready Thread is being dispatched measures CPU queue time delay directly. Using event data, it is possible to measure CPU queuing delays precisely. Analysis of the ETW kernel event stream far exceeds anything that can be done using Windows performance counters to try to estimate the impact of CPU queuing delays.

The Concurrency Visualizer screen shot in Figure 8 illustrates the calculation of a running thread’s CPU queuing delay. Thread 6920, which happens to be a CLR thread pool worker thread, is shown at a point in time where it was preempted by a higher priority task. The specific delay that I zoomed in on in the screen shot is preemption due to the scheduling of a high priority LPC or ISR – note this category in the Concurrency Visualizer also encompasses assorted APCs and DPCs. In this specific example, execution of Thread 6920 was delayed for 0.7718 milliseconds. According to the trace, that is the amount of time between Thread 6920 being preempted by a high priority system routine and a subsequent context switch when the ready thread was again re-dispatched.

The tool also displays the call stack of the preempted thread. The call stack indicates that the CLR’s garbage collector (GC) was running at the time that thread execution was preempted. From the call stack, it looks like the GC is sweeping the Large Object Heap (LOH), trying to free up some previously allocated virtual memory. This is not an opportune time to get preempted. You can see that one of the other CLR worker threads, Thread 6420, is also delayed. Notice from the color coding that Thread 6420 is delayed waiting on a lock. Presumably, one of the other active CLR worker threads in the parent process holds the lock that Thread 6420 is waiting for.

This is one of those “Aha” moments. If you click on the synchronization delay that Thread 6420 is experiencing, as illustrated in Figure 9, you can see that the lock that Thread 6420 is trying to acquire is, in fact, currently held by Thread 6920, the one that was preempted somewhere in the midst of running garbage collection. Clicking on the tab that says “Current Stack” (not illustrated) indicates that the duration of the synchronization delay that Thread 6420 suffered in this specific instance of lock contention was about 250 milliseconds.

The scenario here shows one CLR worker thread blocked on a lock that is held by another CLR worker thread, which in turn finds itself being delayed due to preemptions from higher priority Interrupt processing. We can see that whatever high priority work preempted Thread 6920 has the side effect of also delaying Thread 6420, since 6420 was waiting on a lock that Thread 6920 happened to be holding at the time. The tool in Figure 9 displays the Unblocking stack from Thread 6920 which shows the original memory allocation from the Dictionary.Resize() method call being satisfied, releasing a global GC lock. When Thread 6920 resumed execution following its preemption, the GC operation completes, releasing the global GC lock. Thread 6920 continues to execute for another 25 microseconds or so, before it is preempted because its time slice expired. Even as Thread 6920 blocks, Thread 6420 continues to wait while a different CLR thread pool thread (4664) begins to execute instead. Finally, after another 25 microseconds delay, Thread 6420 resumes execution. For a brief period both 6420 and 4664 execute in parallel from approximately the 7640 to 7650 microsecond milestones. (However, they are subject to frequent preemptions during that period of overlapped execution.)

Welcome to the indeterminacy associated with parallel programming.

I won’t take the time here to go into what this little concurrent CLR (Common Language Runtime ) thread pool application is doing. Suffice to say that it instantiates and references a very large Dictionary object in .NET, and I wrote it to illustrate some of the performance issues developers can face trying to do parallel programming, which is a topic I was blogging about at the time. (I should also note that the test program puts the worker threads to sleep periodically to simulate synchronous I/O waits to create an execution profile similar to what one could expect in processing a typical ASP.NET web request that needs to access an external database, an excellent idea I appropriated from a colleague, Joe Hellerstein.)

When I first began to profile this test app using the VS Concurrency Visualizer, I was able to see blocking issues like the one described here where the CLR introduced synchronization and locking considerations that are otherwise opaque to the developer. Well, caveat emptor, I suppose, when it comes to utilizing someone else’s code framework in your application. (See Rico Mariani’s Performance Tidbits blog for a singular discussion of his intriguing proposal that a .NET Framework method provide a performance signature that would allow a developer to make an informed decision before ever calling into some 3^rd party’s code. Alas, static code analysis cannot be used to predict the performance of some arbitrarily complex method call embedded in your application, something Rico was eventually forced to concede.)

It turns out that .NET Framework collection classes do use locks to ensure thread-safe operation in a multi-threaded program, whether it is necessary or not. See the MSDN “Thread-Safe Collections” Help topic for more information. Each worker thread in my test program instantiated and accessed a dedicated instance of the Dictionary class during processing, so locking wasn’t necessary in this little test application. Because I had taken steps to ensure thread-safety issues would never arise in my test program, I was unpleasantly surprised when the tool uncovered lock contention for these Dictionary objects. Unfortunately, there is no way for the developer to explicitly signal the runtime that locking is not necessary. Some of the popular .NET Framework collection classes – like the HashTable – do provide a Synchronized method that exposes a lock created implicitly. But the Synchronized method is designed to support more complex multi-threaded access patterns, such as a multiple readers and writers scenario, for example. To assist in parallel programming tasks, several newer collection classes were introduced in the System.Collections.Concurrent Namespace that use “lock-free” and optimistic locking approaches that promise better scalability for parallel programs.

I eventually tweaked the test app into an especially ghoulish version I call the LockNestMonster program that uses explicit global locks to shine an even brighter light on these issues.

Figure 9. CLR Worker Thread 6420 blocked because it is waiting on a GC lock that happens to be held by Thread 6920, which is subject to preemption by higher priority system routines.

Time-slicing.

The Concurrency Visualizer also utilizes context switch events to calculate the delays a thread encounters during execution due to preemption, as a result of the expiration of a thread’s time-slice. In Figure 10, I clicked on the large yellow block on the right hand side of the execution time bar graph for Thread 6920 indicating another long delay. As in Figure 9, I have hidden all but the three active CLR thread pool threads. Using a combination of zooming to a point of interest in the event stream and filtering out extraneous threads, as illustrated in Figure 10, the Concurrency Visualizer is able to construct an execution time profile using just those events that are visible in the current time-window.

Overall, the three active CLR worker threads are only able to execute 18% of the time, while they are delayed by synchronization 9% of the time and subject to preemption 39% of the time. (You can click on the Profile Report tab in the middle right portion of the display and see a profile report by thread.)

Figure 10. Using the Concurrency Visualizer to drill into thread preemption delays.

At the point indicated by the selection, the time-slice quantum for Thread 6920 expired and the Scheduler preempted the executing thread in favor of some other ready thread. Looking at the visualization, it should be apparent that the ready thread the Scheduler chose to execute next was another CLR thread pool worker thread, namely Thread 4664, which then blocked Thread 6920 from continuing. The tool reports that a context switch(6920, 4664) occurred, and that Thread 6920 was delayed for about 275 milliseconds before it resumed execution after being preempted.

As illustrated in this example, the Concurrency Visualizer uses the ETW-based event data from a profiling run to construct a state machine that reflects the precise execution state of each application thread over the time interval being monitored. It goes considerably beyond calculating processor queue time at the thread level. It understands how to weave the sequence of Ready Thread and Context switch events together to create this execution time profile. It summarizes the profiling data, calculating the precise amount time of time each thread is delayed by synchronous IO, page faults (i.e., involuntary waits due to memory management overhead[1]), processor contention, preemption by higher priority work, and lock contention over the profiling interval. Furthermore, it analyzes the call stacks gathered at each Context Switch event, looking for signatures that identify the specific blocking reason. And, specifically, to help with lock contention issues, which are otherwise often very difficult to identify, it also identifies the thread that ultimately unblocks the thread that was found waiting to acquire a lock.

[1] In the Concurrency Visualizer, memory management waits that are resolved very quickly, usually in less than 1 m-second, correspond to soft page faults. When hard pages faults occur, the tool will show a corresponding disk IO, and the delay is ordinarily several milliseconds in duration, depending on the speed of the paging disk.

Posted in windows-performance; context switches; application-responsiveness; application-scalability; software-performance-engineering | No comments

Monday, 7 November 2011

Is there an ARM server in your future?

Posted on 09:07 by Unknown

I cannot resist adding to some of the industry buzz about the recent HP Project Moonshot announcement and what this potentially means for folks that run Windows. As part of Project Moonshot, HP is planning to release something called the Redstone Server Development Platform in 1H12, an ARM-based, massively parallel supercomputer that uses 90% less energy than comparable Intel microprocessors.

Coupled with the big.Little hardware announcement from the ARM Consortium about two weeks ago, I think this is big news that could shake up the foundations of Windows computing. ARM-based Windows Server machines could well be in all our futures.

Let’s start with ARM and the big.Little architecture announcement, which is pretty interesting in itself. Details are from this white paper. big.Little is an explicit multi-core architecture where a simple, low power version of the processor is packaged together with a significantly more powerful (~ 2x) version of the processor that also uses about 3 times the energy consumption. The Little CPU core in the architecture is “an in-order, non-symmetric dual-issue processor with a pipeline length of between 8-stages and 10-stages.” The big guy is “an out-of-order sustained triple-issue processor with a pipeline length of between 15-stages and 24-stages,” significantly more powerful and more complex.

The idea of pairing them is that a task that needs access to more CPU power could migrate from the Little guy to the big guy in about 20K cycles, about 20 m-seconds of delay. The idea is that when you want to playback something with HD graphics on your phone, the architecture can power up the big guy and migrate the CPU-bound graphics-rendering thread so fast that the user experience is like a turbocharger kicking in. Whoosh. Meanwhile, the device – a phone or a tablet, probably – can stay powered up and ready for action for an extended period of time with just the low power CPU. Which, at 1 GHz, is still a potent microprocessor.

Apple, Android, Blackberry and Windows smartphones today all run on ARM. ARM is a reduced instruction set computer (RISC) that has significantly lower energy costs, which is why it is so popular on smartphones. The Apple iPad tablet uses ARM, and so do most iPad competitors today from folks like Samsung and HTC. Some of these tablets are dual core ARM machines today. The tablet computer running an early build of Windows 8 that Microsoft gave away at the recent PDC also uses an ARM chip. When Windows 8 does ship next year sometime, we can expect to see a flurry of announcements for Windows 8 tablets based on ARM chips. Portable devices rely on ARM chips because they consume relatively little power, and power is the most significant performance constraint in portable computing of all types.

To be fair, we will probably see some Intel Atom-based tablets, too, running Win8 being released at the same time. (The Atom is a simplified, in-order processor that supports the full Intel x64 instruction set.) But I think ARM will continue to dominate the market for smartphones and tablets in the coming years. And an ARM-compatible version of Windows is quite likely to add to that momentum – most industry observers are guessing that plenty of Windows 8 tablets are going to be sold.

Another piece of the puzzle: the 64-bit ARM extensions were just formally announced this week.

So, ARM is happening in a big way, and it is time to take notice. Plus, Windows is jumping into ARM-based computing in its next release.

BTW, I am not counting Intel out. Intel has the semiconductor industry’s foremost fabrication facilities; it is still the industry leader in manufacturing. It is the low cost, high volume manufacturer. One of the keys to its manufacturing prowess has been high volume, which is necessary to recoup the enormous capital costs associated with constructing a new, leading edge fab plant.

However, the ARM architecture is the first viable, high volume challenger to Intel’s x86/x64 products to emerge in years. In the most recent quarter, about 400 million smartphones were shipped, easily 10 times the number of Intel-compatible PCs that were sold in the same time frame. That kind of high volume success breeds more success in semiconductor manufacturing because it drives down the manufacturing cost per unit, a key component of Moore’s law.

The HP Moonshot Project announcement suggests none of this is lost on the data center where energy consumption is a major cost factor. The goal is to fill that 1-x rack space with an array of ARM computers installed on a compact motherboard with just 10% of the energy consumption of the comparable Intel computing footprint. The projected cost savings over three years are something like 60%. So, at the higher end of HPC, this is starting to happen. The Windows 8 port to ARM then leaves just one small step to get Windows Server running on these same ARM-based servers.

It will be very interesting to see how this all plays out.

Posted in hardware performance; ARM | No comments

Wednesday, 2 November 2011

Using xperf to analyze CSwitch events

Posted on 12:40 by Unknown

Continuing the discussion from the previous blog entry on event-driven approaches to measuring CPU utilization in Windows...

Last time around I discussed the same CPU busy calculations that the Resource Manager in Windows 6 & 7 makes. This same calculation can also be performed after the fact using the event data from ETW. This is the technique used in the Windows Performance Toolkit (WPT, but which is better known around Microsoft as xperf), for example, to calculate CPU usage metrics.

Once you have downloaded and installed the Windows Performance Toolkit, you can launch a basic ETW collection session using the following xperf command:

xperf -on DiagEasy

Then, after you have accumulated enough data, issue another command to stop tracing and capture the event stream to a file:

xperf -d cputrace.etl

Next, process the cputrace.etl file using the xperfview app. After the trace file is loaded, xperfview provides visualizations that are very similar to ResMon. See Figure 6 for an example.

Figure 6. CPU utilization graphs in xperfview, based on ETW context switch event data gathered with the xperf utility. Additional ISR and DPC events are used to calculate the amount of time device drivers spend processing interrupts.

With xperfview, you can also request a Summary Table which displays CPU usage by process (and thread) by right-clicking on the CPU Usage graph and accessing the pop-up context menu. An example of the CPU Scheduling Aggregate Summary Table is illustrated in Figure 6. It is similar to the one ResMon produces. The data here was gathered while running a multi-threaded CPU soaker program called ThreadContentionGenerator while ResMon was also active. You can see that the calculation in Figure 7 roughly mirrors the results shown in Figure 5 for ResMon, allowing for some variation that is to be expected since the intervals themselves are not identical. The xperf interval shown in Figure 6 is approximately 500 seconds long, while ResMon maintains a rolling window that is only 60 seconds in duration. The ResMon screen shot was taken somewhere in the middle of the longer xperf tracing session.

Figure 7. The CPU Scheduling Aggregate Summary Table calculated by xperfview. The results of these calculations closely resembles the rolling one-minute calculation reported by the Resource Monitor in Figure 5.

For some perspective on the volume of trace events that can be generated, the binary .etl trace file produced in this example was approximately 325 MB on disk for a DiagEasy trace session that ran for more than ten minutes. Running with the xperf defaults, I received a notification when the trace session closed that three ETW 64K buffers of data were dropped during the trace because xperf was unable to keep pace with processing the event stream in real-time.

The Context Switch event also provides the old thread’s Wait Reason code, which helps you to understand why the sequence of thread scheduling events occurred. For reference, a Windows context switch is defined here, while the contents of the ETW (Event Tracing for Windows) context switch event record are defined here, including a list of the current thread WaitReason codes.

Note that you can measure CPU queue time accurately from the ETW events, an important indicator of processor contention when the processor becomes saturated. As illustrated in Figure 4, the transition from the Wait state to the Ready state is marked by a ReadyThread event record. The time between the ReadyThread event and a subsequent CSwitch event marking its transition to Running is one form of CPU Queue time. A second form of CPU queue time is the time between a CSwitch(…,out,WaitReason,…) where the WaitReason is either a Preempt or time-slice quantum expiration and a subsequent re-dispatch. Both forms of CPU queue time can be measured accurately using ETW.

When precision in understanding patterns of CPU consumption is required, post-processing the ETW context switching event stream is a much better way to proceed than attempting to use the Windows % Processor Time performance counters. Measuring CPU consumption from the context switch events is considerably more precise than, for example, the Windows performance counter data available in Perfmon that report processor utilization at the system and process level based on processor state sampling. Such high precision measurements are not always required, of course, and processing the ETW context switching event stream is relatively expensive due to the extremely high volume of trace data that you must deal with.

Figure 6 illustrates several of the CPU utilization graphs that xperfview creates from the context switch event stream. To help make the graph more readable, I filtered out Idle time calculation for all but two of the logical processors on this machine. (The machine illustrated has 8 logical CPUs.) To gain insight into what high priority Interrupt Service Routines (ISRs) and DPCs are running, ISR and DPC events should also be gathered, which the DiagEasy event profile in xperf does automatically. (Windows device driver developers are very interested in them. For an example, see this blog posting discussing using the xperf ETW utility to capture CPU consumption by the TCP/IP network driver stack.)

Posted in windows-performance; context switches; application-responsiveness; application-scalability; software-performance-engineering | No comments

Wednesday, 19 October 2011

Measuring Processor Utilization in Windows and Windows applications: Part 2

Posted on 12:41 by Unknown

An event-driven approach to measuring processor execution state.

The limitations of the legacy approach to measuring CPU busy in Windows and the need for more precise measurements of CPU utilization are recognized in many quarters across the Windows development organization at Microsoft. The legacy sampling approach is doubtless very efficient, and this measurement facility was deeply embedded in the OS kernel’s Scheduler facility, a chunk of code that is very risky to tamper with. But, for example, more efficient power management, something that is crucial for battery-powered Windows devices, strongly argues for an event-driven alternative. You do not want the OS to wake up from a low power state regularly on an idle machine just to perform its CPU usage accounting duties, for example.

A straightforward alternative to periodically sampling the processor execution state is to measure the time spent in each processor state directly. This is accomplished by instrumenting the phase state transitions themselves. Processor state transitions in Windows are known as context switches. A context switch occurs in Windows whenever the processor switches the processor execution context to run a different thread. Processor state transitions also occur as a result of high priority Interrupt Service Routines (ISRs) gaining control following a device interrupt, as well as the Deferred Procedure Calls (DPCs) that ISRs schedule to complete the interrupt processing. By recording the time that each context switch occurs, it is possible to construct a complete and an accurate picture of CPU consumption.
(See a two-part article in MSDN Magazine, entitled “Core OS Events in Windows 7,” written by Insung Park and Alex Bendetovers and published beginning in September 2009. The authors are, respectively, the architect and lead developer of the ETW infrastructure. The article provides a conceptual overview describing how to use the various OS kernel events to reconstruct a state machine for processor execution, along with other diagnostic scenarios. Park and Bendetovers report, “In state machine construction, combining Context Switch, DPC and ISR events enables a very accurate accounting of CPU utilization.”)

It helps to have a good, general understanding of thread scheduling in the OS in order to interpret this stream of events. Figure 3 is a diagram depicting the state machine associated with thread execution. At any point in time, a thread can be in only one of the three states indicated: Waiting, Ready, or Running. The state transition diagram shows the changes in execution state that can occur. A Waiting thread is usually waiting for some event to occur, perhaps a Wait timer to expire, an IO operation to complete, a mouse or keyboard click that signals user interaction with the application, or a synchronization event from another thread that indicates it is OK to continue processing.

A thread that is Ready to run is placed in the Dispatcher’s Ready Queue, which is ordered by priority. When a processor becomes available, the OS Scheduler selects the highest priority thread on the Ready Queue and schedules it for execution on that processor. Once it is running, a thread remains in the Running state until it completes its execution cycle and transitions back to the Wait state. An executing thread can also be interrupted because a higher priority execution unit needs to run (this is known as preemptive scheduling) or it is interrupted by the OS Scheduler because its time-slice has expired. A Running thread can also be delayed because of a page fault, accessing data or an instruction in virtual memory that is not currently resident in physical memory. These thread execution time delays are often referred to as involuntary waits.

Figure 3. A state machine for thread execution.

Figure 4 associates these thread execution state transitions with the ETW events that record when these transitions occur. The most important of these is the CSwitch event record that is written on every processor context switch. The CSwitch event record indicates the thread ID of the thread that is entering the Running state (the new thread id), the thread ID that was displaced (the old thread ID) and provides the Wait Reason code associated with an old thread ID that is transitioning from Running back to the Wait state. The processor number indicating which logical CPU has undergone this state change is provided in an ETW_Buffer_Context structure associated with the ETW standard record header. In addition, it is necessary to know that Thread 0 from Process 0 indicates the Idle thread, which is dispatched on a processor whenever there are no Ready threads waiting for execution. While a thread other than the Idle thread is “active,” the CPU is considered busy.

Conceptually, a context switch event is something like a processor state switch(oldThreadId, newThreadId), with a time stamp identifying when the context switch occurred. The CPU time of a thread is precisely the amount of time it spends in the Running state. It can be measured using the CSwitch events that show the thread transitioning from Ready to the Running state and the CSwitch events that show that thread transitioning back from the Running state to Waiting. To calculate processor busy, you summarize the amount of time each processor spends when the Idle thread is active and subtract from 100% over the measurement interval.

Figure 4. The state transition diagram for thread execution, indicating the ETW trace events that mark thread state transitions.

One complication in this approach is that the ETW infrastructure does not guarantee delivery of every event to a Listener application. If the Listener application cannot keep up with the stream of events, then ETW will drop memory-resident buffers filled with events rather than queue them for delivery later. CSwitch events can occur at very high rates, 20,000-40,000 times per second per CPU are not unusual on busy machines, so there is definitely potential to miss enough of the context switch events to bias the calculations that result. In practice, handling the events efficiently in the Listener application and making appropriate adjustments to the ETW record buffering options can be used to minimize the potential for missing events.

To see this event-driven processor execution state measurement facility at work, access the Resource Monitor application (resmon.exe) that is available beginning in Vista and Windows Server 2008. Resource Monitor can be launched directly from the command line, or from either Performance Monitor plug-in or Task Manager Performance tab. Figure 5 displays a screen shot that shows Resource Monitor in action on a Windows 7 machine, calculating CPU utilization over the last 60 seconds of operation, breaking out that utilization by process. The CPU utilization measurements that ResMon calculates are based on the context switch events. These measurements are very accurate, about as good as it gets from a vantage point inside the OS.

Figure 5. The Windows 7 Resource Manager application.

The Resource Monitor measures CPU busy in real time by listening to the ETW event stream that generates an event every time a context switch occurs. It also produces similar reports from memory, disk, and network events.

To summarize these developments, this trace-driven measurement source positions the Windows OS so it could replace its legacy CPU measurement facility with something more reliable and accurate sometime in the near future. Unfortunately, converting all existing features in Windows, including Perfmon and Task Manager, to support the new measurements is a big job, not without its complications and not always as straightforward as one would hope. But we can look forward to future versions of the Windows OS where an accurate, event-driven approach to measuring processor utilization supplants the legacy sampling approach that Task Manager and Perfmon rely on today.

In the next blog entry in this series, I will show a quick example using xperf to calculate the same CPU utilization metrics from the ETW event stream. I will point xperfview at an .etl file gathered during the same measurement interval as the one illustrated in Figure 5 using ResMon.

Posted in windows-performance; context switches; application-responsiveness; application-scalability; software-performance-engineering | No comments

Sunday, 2 October 2011

Measuring Processor Utilization in Windows and Windows applications: Part 1

Posted on 23:05 by Unknown

Introduction.

This blog entry discusses the legacy technique for measuring processor utilization in Windows that is based on sampling and compares and contrasts to other sampling techniques. It also introduces newer techniques for measuring processor utilization in Windows that are event-driven. The event-driven approaches are distinguished by far greater accuracy. They also entail significantly higher overhead, but measurements indicate this overhead is well within acceptable bounds on today’s high powered server machines.

As of this writing, Windows continues to report measurements of processor utilization based on the legacy sampling technique. The more accurate measurements that are driven using an event-driven approach are gaining ground and can be expected to supplant the legacy measurements in the not too distant future.

While computer performance junkies like me positively salivate at the prospect of obtaining more reliable and more precise processor busy metrics, the event-driven measurements do leave several very important issues in measuring CPU utilization unresolved. These include validity and reliability issues that arise when Windows is running as a guest virtual machine under VMware, Zen, or Hyper-V that impact the accuracy of most timer-based measurements. (In an aside, mitigation techniques for avoiding some of the worst measurement anomalies associated with virtualization are discussed.)

A final topic concerns characteristics of current Intel-compatible processors that that undermine the rationale for using measurements of CPU busy that are based solely on thread execution. This section outlines the appeal of using internal hardware measurements of the processor’s instruction execution rate in addition to or instead of wall-clock timing of thread execution state. While I do try to make the case for using internal hardware measurements of the processor’s instruction execution rate to augment more conventional measures of CPU busy, I will also outline some of the current difficulties that advocates of this approach encounter when they attempt to put it into practice today.

Sampling processor utilization.

The methodology used to calculate processor utilization in Windows was originally designed 20 years ago for Windows NT. Since the original design goal of Windows NT was to be hardware independent, the measurement methodology was also designed so that it was not dependent on any specific set of processor hardware measurement features.

The familiar % Processor Time counters in Perfmon are measurements derived using a sampling technique. The OS Scheduler samples the execution state of the processor once per system clock tick, driven by a high priority timer-based interrupt. Currently, the clock tick interval the OS Scheduler uses is usually 15.6 ms, roughly 64 clock interrupts per second. (The precise value that the OS uses between timer interrupts is available by calling the GetSystemTimeAdjustment() function.) If the processor is running the Idle loop when the quantum interrupt occurs, it is recorded as an Idle Time sample. If the processor is running some application thread, that is recorded as a busy sample. Busy samples are accumulated continuously at both the thread and process level.

Figure 1 illustrates the calculation of CPU time based on this sampling of the processor execution state as reported in the Performance tab of the Windows Task Manager.

Figure 1. The Performance tab of the Windows Task Manager reports processor utilization based on sampling the processor execution state once every quantum, approximately 64 times per second.

When a clock interrupt occurs, the Scheduler performs a number of other tasks, including adjusting the dispatching priority of threads that are currently executing with the intention of stopping the progress of any thread that has exceeded its time slice. Using the same high priority OS Scheduler clock interrupt that is used for CPU accounting to implement processor time-slicing is the reason the interval between Scheduler interrupts is known as the quantum. At one time in Windows NT, the quantum was set based on the speed of the processor; the faster the processor the shorter the more frequently the OS Scheduler would gain control. Today, however, the quantum value is constant across all processors.

Another measurement function that is performed by the OS Scheduler’s clock interrupt is to take a sample of the length of the processor queue consisting of Ready, but waiting threads. The System\Processor Queue Length counter in Perfmon is an instantaneous counter that reflects the last measurement taken by the OS Scheduler’s clock interrupt of the current number of Ready threads waiting in the OS Scheduler queue. Thus, the System\Processor Queue Length counter represents one sampled observation, and needs to be interpreted with that in mind.

The processor Queue Length metric is sometime subject to anomalies due to the kind of phased behavior you sometimes see on an otherwise idle system. Even on an mostly idle system, a sizable number of threads can be waiting on the same clock interrupt (typically, polling the system state once per second), one of which also happens to be the Perfmon measurement thread, also cycling once per second. These sleeping threads tend to clump together so that they get woken up at the exact same time by the timer interrupt. (As I mentioned, this happens mainly when the machine is idling with little or no real work to do.) These awakened threads then flood the OS dispatching queue at exactly the same time. If one of these threads is the Perfmon measurement thread that gathers the Processor Queue Length measurement, you can see how this “clumping” behavior could distort the measurements. The Perfmon measurement thread executes at an elevated priority level of 15, so it is scheduled for execution ahead of any other User mode threads that were also awakened by the same Scheduler clock tick. The effect is that at the precise time when the Processor ready queue length is measured, there are likely to be a fair number of Ready Threads. Compared to the modeling assumption where processor scheduling is subject to random arrivals, one observes a disproportionate number of Ready Threads waiting for service, even (or especially) when the processor itself is not very busy overall.

This anomaly is best characterized as a low-utilization effect that perturbs the measurement when the machine is loafing. It generally ceases to be an issue when processor utilization climbs or there are more available processors on the machine. But this bunching of timer-based interrupts remains a serious concern, for instance, whenever Windows is running as a guest virtual machine under VMware or Hyper-V. Another interesting side discussion is how this clumping of timer-based interrupts interacts with power management, but I do not intend to venture further into that subject here.

To summarize, the CPU utilization measurements at the system, process and thread level in Windows are based on a sampling methodology. Similarly, the processor queue length is also sampled. Like any sampling approach, the data gathered is subject to typical sampling errors, including

accumulating a sufficient number of sample observations to be able to make a reliable statistical inference about the underlying population, and
ensuring that there aren’t systemic sources of sampling error that causes sub-classes of the underlying population to be under or over-sampled markedly

So, these CPU measurements face familiar issues with sampling size and the potential for systematic sampling bias, as well as the usual difficulty in ensuring that the sample data is representative of the underlying population (something known as non-sampling error). For example, the interpretation of the CPU utilization data that Perfmon gathers at the process and thread level is subject to limitations based on a small sample size for collection intervals less than, for example, 15 seconds. At one minute intervals, there are enough samples to expect accuracy within 1-2%, a reasonable trade-off of precision against overhead. Over even longer measurement intervals, say 5 or 10 minutes, the current sampling approach leads to miniscule sampling errors, except in anomalous cases where there is systematic under-sampling of the processor’s execution state.

Sample size is also the reason that Microsoft does not currently permit Perfmon to gather data at intervals more frequent than once per second. Running performance data collection at intervals of 0.1 seconds, for example, the impact of relying on a very small number of processor execution state samples is quite evident. At 0.1 second intervals, it is possible to accumulate just 5 or 6 samples per interval. If you are running a micro-benchmark and want to access the Thread\% Processor Time counters from Perfmon over 0.1 second intervals, you are looking for trouble. Under these circumstances, the % Processor Time measurements cease to resemble a continuous function over time.

The limitations of the current approach to measuring CPU busy in Windows and the need for more precise measurements of CPU utilization are recognized in many quarters across the Windows development organization at Microsoft. More on how this area is evolving in the next post in this series.

Posted in windows-performance; context switches; application-responsiveness; application-scalability; software-performance-engineering | No comments

Wednesday, 11 May 2011

Deconstructing disk performance rules: final thoughts

Posted on 11:12 by Unknown

To summarize the discussion so far:

While my experience with rule-based approaches to computer performance leads me to be very skeptical of their ultimate value, I recognize they can be useful under many circumstances, especially if people understand their inherent limitations. For example, in the last couple of blog entries, I noted the usefulness of threshold rules for filtering the great quantities of esoteric performance data that can readily be gathered on Windows (and other computing platforms). The threshold rules implicitly select among the performance data to be gathered – after all, before you can perform the threshold test, you must first have acquired the data to be analyzed. Concentrating on measurement intervals where the threshold test succeeds also help you to narrow the search to periods of peak load and stress.

However, the mere mechanical iteration over some set of expert-derived performance rules, absent the judgment of an experienced performance analyst, is unlikely to be very useful as a diagnostic tool. A process that mechanically exercises a pre-defined set of threshold-based performance rules using settings arbitrarily defined by some “expert” or other without regard to the particulars of your application and environment is apt to be quite brittle. And if the rule happens to fire, signaling that a problem has possibly occurred, will you understand what to do next? Following confirmation of the diagnosis, will you even know how to begin to resolve the issue you have uncovered?

A crippling deficiency of static performance rules is that the best inference that can be made about whether any symptom of a performance problem that you can measure is “good” or “bad” is frequently, “It depends.” That is why I often try to turn the debate over what the proper setting for the threshold test should be into a discussion that explores the reasoning behind the setting. Knowing what insights the “expert” relied upon in formulating the Rule in the first place is often extremely helpful. Knowing what the rule threshold depends on turns out to be quite useful, especially as you proceed from diagnosis to cure.

* * *

Bottleneck analysis

In the last blog entry, I attempted to make this argument more concrete by drilling into a popular disk performance rule from the Microsoft experts behind the PAL tool. This rule primarily serves as a filter, a role that I can endorse without too much hesitation. I then looked at a similar, but more ambitious and more complex disk performance rule that was promulgated by a colleague of mine at Microsoft, Grant Holliday. Unbeknownst to me, this rule was formulated following work we did together to diagnose and solve a serious performance and scalability problem affecting a major server-based application. The rule attempts to encapsulate the analysis I performed to first diagnose the problem and then devise a remedy to fix it.

My style of working on an application performance investigation such as this one can appear bracingly unstructured, but it isn’t. Essentially, the method is one of bottleneck analysis, and is informed by queuing theory. At the outset I sift through large amounts of very detailed and esoteric measurement data very quickly. It may not look like I am being systematic & deliberate, but I am. During these initial stages, I do not always take the time to explain this analytic process. When the problem is diagnosed, of course, I am prepared to defend and justify the diagnosis. But at that point you are backtracking from the solution, crafting a narrative designed to convince and persuade, shorn of all the twists and wrong turns you took in actually arriving at your final destination.

A general method in a performance investigation is to hunt for a bottlenecked resource, defined as the one that saturates around the time response time (or throughput) also begins to back up. Analytically, I am searching for an elongation of queue time at some saturated resource (CPU, disk, network, software locking mechanism, all the usual suspects first) that also appears to be correlated with measures of application response time that are (hopefully) available. You formulate a tentative hypothesis about what bottlenecked resource could be responsible for the problem and then start testing the validity of that hypothesis. If there is a hypothesis that remains still credible after increasingly rigorous testing, that is the one to proceed with.

I can recognize in the disk performance diagnostic rule Grant created some of the line of reasoning I followed in the tuning project we worked on together. At least some of it has a familiar ring to it. The disk response time measurements in Windows include queue time, so an elongation in disk queue time would be reflected in those measurements. Since the disk response time measurements are difficult to interpret in isolation, I also like to examine the Current Disk Queue Length counters; to be sure, these measurements are samples taken at the end of each measurement interval, but they have the virtue of being a direct measurement of the length a disk device’s current request queue. If the physical disk entity known to Windows is actually a standalone physical disk, it is also possible to derive the disk service time. Then, you can back out the service time from the disk response time measurements taken at the device driver level and calculate the queue time (as detailed here).

Queuing for the disk is not always evident at the device driver level, however, and may not show up in the Windows counter data. Disk requests can back up inside an application like Microsoft SQL Server that performs its own scheduling of disk activity, for example. That turned out to be the case here, as a matter of fact.

My recollection is that in the tuning project I worked on with Grant disk response times were generally within acceptable levels of 15-20 ms. In other words, the rule Grant formulated in retrospect would not have been triggered in this instance. However, one of the physical disks was being hammered at rates ranging from 400-1400 requests/second. (The fact that a range of performance levels like that is expected on a physical disk that has caching enabled is discussed in the disk tuning article I referenced above.) I also observed Current Disk Queue Length measurements in the double digits during peak intervals. I needed to look at disk throughput, too, not just response times, which also strongly suggested the disk was saturated, which was not evident from the disk response time measurements alone.

So, it was clear to me pretty quickly that there was a disk device that was severely overloaded. I then had to make the link between what was happening at the disk and the request queue in the application that would also start to back-up when the disk became overloaded. When that causal link was established empirically, I thought we could proceed with confidence with a plan to replace the bottlenecked disk device with a much faster one. This would improve both throughput and response time. The new disk subsystem was quite expensive, and converting SQL Server to use it required a lengthy outage, but, by then, I was confident that the application would benefit enough to justify those costs. By the time we were able to make a switch, Grant had improved service level reporting for the TFS application in place, and he documented a 2x-5x improvement (depending on the request type) in application response time at the 90^th percentile of all requests following the disk upgrade. The customer was satisfied with the result.

Deconstructing Grant's Disk Performance Rule

Human reasoning proceeds from the particular to the general, from the concrete to the abstract. Socrates is a man. All men are mortal. Thus, Socrates is mortal. Q.E.D. We generate logical rules in an attempt to encapsulate this reasoning process into a formal one that can be executed repeatedly and with confidence.

But humans, by necessity, incorporate bounded rationality into their decision-making under conditions of uncertainty. Facing time constraints and lacking complete and reliable data, we still need to make timely decisions and execute effectively on them. (Daniel Dennett is particularly good on this subject in Elbow Room, I think.) Once we accept bounds on the formal reasoning process are necessary to arrive at timely decisions, it is difficult to know how far to relax those bounds. Knowing when to deviate from formal logic calls for experience and judgment, and these are precisely the aspects of expert decision-making are impossible to encapsulate in static, declarative rules. (We also learn from our mistakes, a process that machine learning approaches can successfully emulate, so there is hope for the Artificial Intelligence agenda.)

In this particular performance investigation that led to Grant’s formulating a disk performance rule, I was careful to explain to Grant what I was thinking and doing at some length since he was the public face of the project. I wanted to make sure that that Grant could, in the short term, faithfully report our findings up the chain of command, which he did. In the long term, I wanted him to be capable of carrying on independent of me in the various meetings and briefings he conducted and the memos and reports he prepared. Ultimately, I wanted him to produce a tuning guide that we could give customers faced with similar product scalability issues. I found Grant to be both willing & capable.

However, I am less than satisfied by the rule that Grant created to encapsulate the bottleneck analysis approach I employed. I find Grant’s rule a little too narrow to be genuinely useful, and certainly it would not have been adequate in the case we worked on together. Without digging considerably deeper, I would not be ready to jump to any conclusions, should this simple rule fire.

While I am not ready to throw out Grant’s proposed rule completely, I would argue that, at a minimum, it needs to be augmented by consideration of the IO rate, the block size of typical requests, the performance characteristics of the underlying physical disk, and the applications accessing the disk. There is a section in the Performance Guide (also available here in somewhat expanded form) that describes steps I recommend in a more thorough (and more complicated) investigation of a potential disk performance problem. I call it a “simplified” approach; at any rate, it is about as simple as I think I can make it, but it is considerably more complicated than any of the disk performance rules that PAL implements.

Final Thoughts on the Disk Performance Counters

In conclusion, as I discussed previously, the average disk response time measurements in Windows are difficult to interpret in isolation. The context is crucial to interpreting them correctly. For example:

· These counters measure the average response time of disk requests. Response time includes time spent in either the disk driver request queue or the physical disk’s request queue, This complicates matters. Are these queuing mechanisms enabled? (That is usually a pretty good first question to ask if you suspect disk performance is a problem.) How much of the problem is due to poor device performance or is the disk simply overloaded? If you are able to answer that question, you already have a leg up in how to solve it.

· Why choose 15 ms? Why not 20 or 25 ms? This isn‘t simply a question that is reducible to experts jockeying over what the best value for the threshold test is. The overall rate of IO requests and the block size of those requests impact the performance expectations that are reasonable for the underlying hardware. And, if there is potential for contention at the disk, that will factor into the response time measurements that include queue time.

· Why choose Logical Disk, and not the associated measurements of the Physical Disk? Depending on how Logical Disks are mapped to Physical Disks, there could be built-in contention for the underlying physical disk.

· Doesn’t it also depend on the performance characteristics of the underlying physical disk? What kind of physical disk entity is it? A standard SATA drive running at 7200 RPMs, a high end drive that spins at twice that speed, an SSD that doesn’t spin at all, a low power drive on a lightweight portable, or a logical volume carved out of an expensive SAN controller configuration? Each of these configuration options has different price/performance characteristics, and, consequently, a different set of expectations about what are acceptable and unacceptable levels of performance.

· Won’t also it depend on how the application is accessing the disk? Is it accessing the file system sequentially, sucking up large blocks in a series of throughput-oriented requests? Or it requesting small blocks randomly distributed across the disk?

· What about applications like back-up, search, virus scans, and database Table scans that intentionally monopolize the disk? These applications queue requests internally, so the driver is unable to measure queue time directly. What do the measurements look like under these circumstances when disk usage is scheduled at the application level? If one of those applications is running, how does that impact the performance of other applications that are trying to access the same disk concurrently?

· Is caching enabled at the disk, the disk subsystem, in the application, or in multiple places? How effective is the caching? Doesn’t disk caching affect the pattern of physical disk access? What is the impact on the overall response times of requests when caching is enabled?

Ok, I admit I am deliberately messing with your head a little here to show how quickly Clint’s original simple and elegant Rule of Thumb with regard to expected levels of disk performance and Grant’s more nuanced version starts to sprout messy complications.

Crucially, Grant’s rule leaves out consideration of the IO rate. When the disk activity rate is low, then even quite poor disk response time will not have a significant impact on the performance of the application. On the other hand, if the disk IO rate is high enough, indicating saturation of the resource, the disk can be a significant bottleneck without any degradation of the disk response time measurements being observed, which is what I mainly observed in this case.

In addition, both the block size and the proportion of reads vs. writes is often very important, especially when it comes to figuring out what to do to remedy the situation. I am certain I discussed some of this with Grant at the time (and referred him to the Performance Guide for the rest), but none of those considerations found its way into Grant’s rule. My version of the rule would definitely include a clause that filters out situations where the overall disk activity rates are low. In addition, as the disk activity rate increases, I grow increasingly concerned about disk performance becoming the bottleneck that constrains the overall responsiveness or throughput of the application.

Finally, knowing what to do to resolve a disk performance problem that you have just diagnosed is crucial, but is something that Grant’s rule fails to touch upon. Once you understand the problem, then you have to start thinking about what the fix is. In this case, I felt that swapping out the overloaded device and replacing it with a device from a very expensive, cached enterprise class disk subsystem would relieve the bottleneck. (We anticipated that fixing the application code would take considerably longer.) But before making that suggestion, which was both disruptive and expensive, I needed to understand the way the application itself was architected. It was multi-tier, with a large SQL Server-based back-end and a large variety of web services that are used to communicate with the database.

The next step was to provide the technical input to justify to senior management a very costly disk upgrade, also an unanticipated one that had not been previously budgeted for. Not only that, I had to justify a significant operational outage allowing the data the application needed to be migrated to the new disk subsystem. Finally, I had to make sure after the fact that the change yielded a significant performance boost, warranting both the time and expense of everyone involved.

Posted in artificial intelligence; automated decision-making; | No comments

Performance By Design A blog devoted to Windows performance, application responsiveness and scala

Sunday, 27 November 2011

Measuring thread execution state using trace events.

Time-slicing.

Monday, 7 November 2011

Is there an ARM server in your future?

Wednesday, 2 November 2011

Using xperf to analyze CSwitch events

Wednesday, 19 October 2011

Measuring Processor Utilization in Windows and Windows applications: Part 2

An event-driven approach to measuring processor execution state.

Sunday, 2 October 2011

Measuring Processor Utilization in Windows and Windows applications: Part 1

Sampling processor utilization.

Wednesday, 11 May 2011

Deconstructing disk performance rules: final thoughts

Popular Posts

Categories

Blog Archive

About Me