2013 ~ Performance By Design A blog devoted to Windows performance, application responsiveness and scala

Tuesday, 26 November 2013

Correcting the Process level measurements of CPU time for Windows guest machines running under VMware ESX

Posted on 16:15 by Unknown

Recently, I have been writing about how Windows guest machine performance counters are affected by running in a virtual environment, including publishing two recent, longish papers on the subject, one about processor utilization metrics and another about memory management. In the processor utilization paper, (which is available here), it is evident that running under VMware, the Windows performance counters that measure processor utilization are significantly distorted. At a system level, this distortion is not problematic so long as one has recourse to the VMware measurements of actual physical CPU usage by each guest machine.

A key question – one that I failed to address properly, heretofore – is whether it is possible to correct for that distortion in the measurements of processor utilization taken at the process level inside the guest machine OS. The short answer for Windows, at least, is, “Yes.” The % Processor Time performance counters in Windows that are available at the process level are derived using a simple technique that samples the execution state of the machine approximately 64 times per second. VMware does introduce variability in the duration of the interval between samples and sometimes causes the sampling itself rate to fluctuate. However, neither the variability in the time between samples nor possible changes to the sampling rate undermine the validity of the sampling process.

In effect, the sampling data should still accurately reflect the relative proportion of the CPU time used by the individual processes running on the guest machine. If you have access to the actual guest machine CPU usage – from the counters installed by the vmtools, for example – you should be able to adjust the internal guest machine process level data accordingly. You need to accumulate a sufficient number of guest machine samples to ensure that the sample data is a good estimator of the underlying population. And, you also need to make sure that the VMware measurement intervals are reasonably well-synchronized to the guest machine statistics. (The VMware ESX performance metrics are refreshed every 20 seconds.) In practice, that means if you have at least five minutes worth of Windows counter data and VMware guest CPU measurements across a roughly equivalent five minute interval, you should be to correct for the distortion due to virtualization and reliably estimate processor utilization at the process level within Windows guest machines.

Calculate a correction factor, w, from the ratio of the actual guest machine physical CPU usage as reported by the VMware hypervisor to the average processor utilization reported by the Windows guest:

w = Aggregate Guest CPU Usage / Average % Processor Time

Then multiply the process level processor utilization measurements by this correction factor to estimate the actual CPU usage of individual guest machine processes.

The impetus for me revisiting this subject was the following question posed by Steve Marksamer posted on a LinkedIn forum devoted to computer capacity and performance:

[What is] the efficacy of using cpu and memory data extracted from Windows O/S objects when this data is for a virtual machine in VMware. Has there been any timing correction allowing for guest to host delay? Your thoughts/comments are appreciated. Any other way to get at process level data more accurately?

This is my attempt to formulate a succinct answer to Steve’s question.

In the spirit of full disclosure, I should also note that Steve and I once collaborated to write an article on virtualization technology that was published back in 2007. I don’t think I have seen or talked to Steve since we last discussed the final draft of our joint article back in 2007.

A recent paper of mine focused on the processor utilization measurements that are available in Windows (the full paper is available here), and contains a section that describes them being impacted by VMware’s virtualization of timer interrupts and the Intel RDTSC clock instruction. Meanwhile, on this blog I posted a number of entries that covered similar ground (beginning here), but in smaller chunks. But, up to now, I neglected to post any material here related to the impact of virtualization on the Windows processor utilization measurements. Let me start to correct that error of omission.

Considering how VMware affects the measurements CPU reported at the system and process level inside a guest Windows machine, a particularly salient point is that VMware virtualizes the clock timer interrupts that is basis for the Windows clock. Virtualizing clock timer interrupts has an impact on the legacy method used in Windows to measure CPU utilization at the system and process level, which samples the execution state of the system roughly 64 times per second. When the OS handles this periodic clock interrupt, it examines the execution state of the system prior to the interrupt. If a processor was running the Idle thread, then the CPU accounting function attributes all the time associated with the current interval as Idle Time. Idle Time is then accumulated at the processor level. If the machine was executing an actual thread, all the time associated with the current interval is attributed to that thread and process. Processor time is also accumulated in counters associated with active threads and processes. Utilization at the processor level is calculated by subtracting the amount of Idle time accumulated in a measurement interval from the total amount of time in that interval.

So, the familiar Windows performance counters that report % processor time are derived using a sampling technique. Note that a simple statistical argument based on the sample size strongly suggests that these legacy CPU time measurements (i.e., any of the % Processor Time counters in Windows) are not reliable at intervals of less than 15 seconds, which was the official guidance in the version of the Windows Performance Guide that I wrote for Microsoft Press back in 2005. But so long as the measurement is interval is long enough to accumulate a sufficient number of samples – one minute measurement intervals are based on slightly more than 3600 execution state samples, the Windows performance counters provide reasonable estimates of processor utilization at both the hardware and process levels. You can easily satisfy yourself that this is true by using xperf to calculate CPU utilization precisely from context switch events and comparing those calculations to Windows performance counters gathered over the same measurement interval.

In virtualizing clock interrupts, VMware introduces variability into the duration of the interval between processor utilization samples. If a clock interrupt destined for a Windows machine that is currently parked in the ESX dispatcher queue occurs, the VMware dispatcher delays delivery of that interrupt to the guest machine until the next time that guest machine is dispatched. A VMware white paper entitled “Timekeeping in VMware Virtual Machines” has an extended discussion of the clock and timer distortions that occur in Windows guest machines when there are virtual machine scheduling delays. Note that it is possible for VMware guest machine scheduling delays to grow large enough to cause some periodic timer interrupts to be dropped entirely. In VMware terminology, these are known as lost ticks, another tell-tale sign of contention for physical processors. In extreme cases where the backlog of timer interrupts ever exceeds 60 seconds, VMware attempts a radical re-synchronization of the time of day clock in the guest machine, zeros out its backlog of timer interrupts, and starts over.

Unfortunately, there is currently no way to measure directly the magnitude of these guest machine dispatcher delays, which is the other part of Steve’s question. The VMware measurement that comes closest to characterizing the nature and extent of the delays associated with timely delivery of timer interrupts is CPU Ready milliseconds, which reports the amount of time a guest machine was waiting for execution, but was delayed in the ESX Dispatcher queue. (VMware has a pretty good white paper on the subject available here).

If the underlying ESX Host is not too busy, guest machines are subject to very little CPU Ready delay. Under those circumstances, the delays associated with virtualizing clock timer interrupts will introduce some jitter into the Windows guest CPU utilization measurements, but these should not be too serious. However, if the underlying ESX Host does become very heavily utilized, clock timer interrupts are subject to major delays. At the guest level, these delays impact both the rate of execution state samples and the duration between samples.

In spite of this potential disruption, it is still possible to correct the guest level % processor time measurements at a process and thread level if you have access to the direct measurements of hardware CPU utilization that VMware supplies for each guest machine. The execution state sampling performed by periodic clock interval handler still amounts to a random sampling of processor utilization. The amount of CPU time accumulated at the process level and reported in the Windows performance counters remains proportional to the actual processor time consumed by processes running on the guest machine. Using processor utilization measurements gathered over longer intervals and with a reasonable degree of synchronization in the measurement gathering process to encompass both the VMware external measurements and the Windows guest internal measurements are both required for this reconciliation.

To take a simple example of calculating and applying the factor for correcting the processor utilization measurements, let’s assume a guest machine that is configured to run with just one vCPU. Suppose you see the following performance counter values for the Windows guest machine:

	% Processor Time
	(virtual)
Overall	73%
Process A	40%
Process B	20%

From VMware ESX measurements, then suppose you see that the actual CPU utilization of the guest machine is 88%. You can then correct the internal process CPU time measurements by weighting them by the ratio of the actual CPU time consumed compared to the CPU time reported by the guest machine (88:73, or roughly 1.2).

	% Processor Time
	(virtual)	(actual)
Overall	73%	88%
Process A	40%	48%
Process B	20%	24%

Similarly, for a guest machine with multiple vCPUs, calculate the average utilization per processor to calculate the weighting factor, since VMware only reports an aggregate CPU utilization for each guest machine.

What about the rest of the Windows performance counters?

Stay tuned. I will discuss the rest of the Windows performance counters that in the next blog post

Posted in VMware, Windows, windows-performance; application-responsiveness; application-scalability; software-performance-engineering | No comments

Friday, 25 October 2013

Page Load Time and the YSlow scalability model of web application performance

Posted on 13:29 by Unknown

This is the first of a new series of blog posts where I intend to drill into an example of a scalability model that has been particularly influential. (I discussed the role of scalability models in performance engineering in an earlier blog post.) This is the model of web application performance encapsulated in the YSlow tool, originally developed at Yahoo by Steve Souders. The YSlow approach focuses on the time it takes for a web page to load and become ready to respond to user input, a measurement known a Page Load Time. Based on measurement data, the YSlow program makes specific recommendations to help minimize Page Load Time for a given HTML web page.

The conceptual model of web application performance underpinning the YSlow tool has influenced the way developers think about web application performance. YSlow and similar tools influenced directly by YSlow that attempt to measure Page Load Time are among the ones used most frequently in web application performance and tuning. Souders is also the author of an influential book on the subject called “High Performance Web Sites,” published by O’Reilly in 2007. Souders’ book is frequently the first place web application developers turn for guidance when they face a web application that is not performing well.

Page Load Time

Conceptually, Page Load Time is the delay between the time that an HTTP GET Request for a new web page is issued and the time that the browser was able to complete the task of rendering the page requested. It includes the network delay involved in sending the request, the processing time at the web server to generate an appropriate Response message, and the network transmission delays associated with sending the Response message. Meanwhile, the web browser client requires some amount of additional processing time to compose the document requested in the GET Request from the Response message, ultimately rendering it correctly in a browser window, as depicted in Figure 1.

Figure 1. Page Load Time is the delay between the time that an HTTP GET Request for a new web page is issued and the time that the browser was able to complete the task of rendering the page requested from the Response message received from the designated web server.

Note that, as used above, document is intended as a technical term that refers to the Document Object Model, or DOM, that specifies the proper form of an HTTP Response message such that the web browser client understands how to render the web page requested correctly. Collectively, IP, TCP and HTTP are the primary Internet networking protocols. Initially, the Internet protocols were designed to handle simple GET Requests for static documents. The HTML standard defined a page composition process where the web browser assembles all the elements needed to build the document for display on the screen, often requiring additional GET Requests to retrieve elements referenced in the original Response message. For instance, the initial Response message often contains references to additional resources – image files and style sheets, for example – that the web client must then request. And, as these Response messages are received, the browser integrates these added elements into the document being composed for display. Some of these additional Response messages may contain embedded requests for additional resources. The composition process proceeds ad infinitum until all the elements referenced are assembled.

Page Load Time measures the time from the original GET Request, all subsequent GET Requests required to compose the full page. It also encompasses the client-side processing of both the mark-up language and the style sheets by the browser’s layout engine to format the display in its final form, as illustrated in Figure 2.

Figure 2. Multiple GET Requests are frequently required to assemble all the elements of a page that are referenced in Response messages.

This conceptual model of web application response time, as depicted in Figure 2, suggests the need to minimize both the number and duration of web server requests, each of which requires messages to traverse the network between the web client and the web server. Minimizing this back and forth across the network will minimize Page Load Time:

Page Load Time » RoundTrips * Round Trip Time

The original YSlow tool does not attempt to measure Page Load Time directly. Instead, it attempts to assess the network impact of constructing a given web page by examining each element of the DOM after the page has been assembled and fully constructed. The YSlow tuning recommendations are based on a static analysis of the number and sizes of the objects that it found in the fully rendered page since the number of network round trips required can faithfully be estimated based on the size of the objects that are transmitted. For each Response object, then:

RoundTrips = (httpObjectSize / packet size) + 1

which is then summed over all the elements of the DOM that required GET Requests (including the original Response message).

Later, after Souders left Yahoo for Google, he was responsible for the construction of a tool similar to YSlow’s being incorporated into the Chrome developer tools that ship with Google’s web browser. The Google tool that corresponds to YSlow is called PageSpeed. (Meanwhile, the original YSlow tool continues to be available as a Chrome or IE plug-in.)

In the terminology made famous by Thomas S. Kuhn in his book “The Structure of Scientific Revolutions,” the YSlow model of web application performance generated a paradigm shift in the emerging field of web application performance, a paradigm that continues to hold sway today. The suggestion that developers focus on page load time gave prominence to a measurement of service time as perceived by the customer. Not only is page load time a measure of service time, there is ample evidence that it is highly correlated with customer satisfaction. (See, for example, Joshua Bixby’s 2012 slide presentation.)

While the YSlow paradigm is genuinely useful, it can also be abused. For instance, the YSlow rules and recommendations for improving the page load time of your application often need to be tempered by experience and judgment. (This is hardly unusual in rule-based approaches to computer performance, something I discussed in an earlier series of blog posts.) Suggestions to minify all your Javascript files or consolidate multiple scripts into one script file are often contraindicated by other important software engineering considerations. For example, it may be important to factor Javascript code into separate files when the scripts originate in different development teams and have very different revision and release cycles. Moreover, remember that YSlow does not measure page load time directly. Instead, it reasons about page load time based on the elements of the page that it discovers in the DOM and its knowledge of how the DOM is assembled.

Subsequently, other tools, including free tools like VRTA and Fiddler and commercial application performance monitoring tools like Opnet and DynaTrace, try a more direct approach to measuring Page Load Time. Many of these tools analyze the network traffic generated by HTTP requests. These network capture tools attempt to estimate Page Load Time based on the time the first HTTP GET Request generated a network packet that was transmitted by the client to the last packet sent by the web server in the last Response message associated with the initial GET. Network-oriented tools like Fiddler are easy for web developers to use and Fiddler, in particular, has many additional facilities, including ones that help in debugging web applications.

Over time, the Internet protocols began developing the capabilities associated with serving up content generated on the fly by applications. This entailed supporting the generation of dynamic HTML, where Response messages are constructed on demand by web applications, customized based on factors such as the identity of the person who issued the Request, where in the world the requestor was located when the Request was initiated, etc. With dynamic HTML requests, the still relatively simple process illustrated in Figure 2 potentially can grow considerably more complex. One of these developments included the use of Javascript code running inside the web browser to manipulate the DOM directly on the client, without the need to ever contact the web server, once the script itself was downloaded. Note that web application performance tools that rely solely on the analysis of the network traffic associated with HTTP requests cannot measure the amount of time spent inside the web browser executing Javascript code that is construct a web page dynamically.

However, developer-oriented timeline tools running inside the web browser client can gain access to the complete sequence of events associated with composing and rendering the DOM, including Javascript execution time. The developer-oriented performance tools in Chrome, which includes the PageSpeed tool, then influenced the design and development of similar web developer-oriented performance tools that Microsoft’s started to build into the Internet Explorer web browser. Running inside the web browser, the Chrome and IE performance tools have direct access to all the timing data associated with Page Load Time, including Javascript execution time.

A recent and very promising development in this area is a W3C spec that standardizes the web client timeline events associated with page load time, which also specifies an interface for accessing the performance timeline data from Javascript. Chrome, Internet Explorer, and webkit have all adopted this standard, paving the way for tool developers for the first time to gain access to complete and accurate page load time measurements in a consistent fashion across multiple web clients.

I will continue this discussion of YSlow and its progeny in the next blog post.

Posted in | No comments

Thursday, 26 September 2013

A comment on “Memory Overcommitment in the ESX Server”

Posted on 16:29 by Unknown

The VMware Technical Journal recently published a paper entitled “Memory Overcommitment in the ESX Server.” It traverses similar ground to my recent blog entries on the subject of VMware memory management, and similarly illustrates the large potential impact paging can have on the performance of applications running under virtualization. Using synthetic benchmarks, the VMware study replicates the major findings from the VMware benchmark data that I recently reported on beginning here.

VMware being willing to air performance results publicly that are less than benign is a very positive sign. Unfortunately, it is all too easy for VMware customers to configure machines where memory is overcommitted, subject to severe performance problems. VMware customers require frank guidance from their vendor to help them recognize when this happens and understand what steps to try to minimize these problems arising in the future. The publication of this article in the VMTJ is a solid, first step in that direction.

Memory overcommitment factor

One of the interesting results reported in the VMTJ paper is Figure 6, reproduced below. It reports application throughput (in operations per minute for some synthetic benchmark workload that simulates an online DVD store) against a memory overcommitment factor, calculated as:

S guest machine allocated virtual memory / available machine memory

This overcommitment factor is essentially a V:R ratio, the amount of virtual memory allocated to guest machines divided by the amount of RAM available to the VMware ESX host.

Figure 6. Reproduced from the article, “Memory Overcommitment in the ESX Server” published in the VMware Technical Journal.

The two virtual machines configured in the test allocate about 20 GB of RAM, according to the authors. The benchmark was initially run on a physical machine configured with 42 GB of RAM. With the machine’s memory less than fully over-committed, throughput of the test workload was measured at almost 26K operations per minute. The authors then began to tighten the screws on the amount machine memory available for the VMware Host machine to manage. The performance of the workload degraded gradually as machine memory was reduced in stages down to 12 GB of machine memory. (The memory overcommitment factor for the 12 GB machine is 20/12, or 1.6667.)

Finally, the authors attempted to run the benchmark workload in a machine configured with only 7 GB of RAM, yielding the degradation shown on the line chart. Under severe memory constraints, the benchmark app managed less than 9,000 operations/minute, a degradation of about 3X.

The same sequence of experiments was repeated on the machine, but this time with a solid state disk (SSD) configured for VMware to use for swapping (plotted using the green line). With the faster paging device configured, throughput of the test workload increased about 10% on the 12 GB machine. The magnitude of the improvements associated with using the SSD increased even more for the 7 GB machine, but the workload still suffered serious performance degradation, as shown.

Reading the SSD comparison, I experienced an extremely acute (but very pleasant) sense of dejà vu. Thirty years ago, I worked for StorageTek, which was selling one of the industry’s first SSDs, known as the 4305 because it emulated a high speed disk drum that IBM built called the 2305. Due to its limited capacity and the fact the storage itself was volatile RAM, the 4305 was best suited to serve as a fast paging device. It was often attached to an IBM mainframe running the MVS operating system.

In the system engineering organization at STK, where I worked, I learned to assess the potential improvement in performance from plugging in the 4305 by calculating a V:R ratio using the customer’s data. I was very taken with the idea that we could use this V:R calculation for performance prediction, something I explored more rigorously a few years later in a paper called “Nonlinear Regression Models and CPE Applications,” available at the Computer Measurement Group’s web site. V:R remains a very useful metric to calculate, and I have also used it subsequently when discussing Windows virtual memory management.

By way of comparison to the results reported in the VMTJ paper, the overcommitment factor in the VMware ESX benchmarks I reported on was 2:1; the performance impact on throughput was an approximately 3X degradation in the elapsed time of the benchmark app under the impact of VMware memory ballooning and swapping.

Flaws in the analysis

I can recommend reading the VMTJ paper on its merits, but I also noticed it was flawed in several places, which I will mention here.

Transparent memory sharing

The first is an error of omission. The authors fail to report any of the shared memory measurements that are available. Instead, they accept on faith the notion that VMware’s transparent memory sharing feature is effective even when machine memory is substantially over-committed. That is not what I found.

Transparent memory sharing is an extremely attractive mechanism for packing as many idle guest Windows machines as possible onto a single VMware ESX host. What I discovered (and reported here) in the benchmarks we ran was that the amount of memory sharing was effectively reduced to near zero as the guest workloads started to become active.

The VMTJ authors do note that when the Windows OS initializes, it writes zeros to every physical page of RAM that is available. Since VMware can map all the pages on the OS Zero list to a single resident machine memory page, transparent memory sharing is a spectacularly successful strategy initially. However, once the Windows OS’s own virtual memory management kicks in when the guest machine becomes active, the contents of the Zero list, for example, become quite dynamic. This apparently defeats the simple mechanism ESX uses to detect identical virtual pages in real-time.

VMware transparent memory sharing is a wonderful mechanism for packing as many mostly idle guest Windows machines onto ESX Servers as possible. On the other hand, it can be a recipe for disaster when you are configuring production workloads to run as VMware virtual machines.

Measuring guest machine Active Memory

A second flaw in the paper is another error of omission. The VMTJ authors decline to discuss the sampling methodology VMware uses to measure a guest machine’s current working set. This is what VMware calls guest machine active memory, the measurement that is plugged into the formula for calculating the memory over-commitment factor as the numerator. To avoid duplicating all the effort that the guest machine OS expends trying to implement some form of LRU-based page replacement, VMware calculates a guest machine’s active memory using a sampling technique. The benchmark results I reported on showed the guest machine Active Memory measurement VMware reports can be subject to troubling sampling errors.

The VMTJ authors manage to avoid using this dicey metric to calculate the memory overcommitment factor by relying on a benchmark program that is effectively a memory “sponge.” These are programs designed specifically to stress LRU-based memory management by touching at random all the physical pages that are allocated to the guest machine.

With their sponge program running, the authors can reliably calculate memory overcommitment factor without using the Active Memory counter that is subject to measurement anomalies. With the sponge program, the amount of guest machine committed physical memory is precisely equal to the amount of active memory. For the other DVDStore workload, they were able to assess its active memory requirements by running it in a stand-alone mode without any memory contention.

Again, thing aren’t nearly so simple in the real world of virtualization. In order to calculate the memory overcommitment factor for real-world VMware workloads, customers need to use the Active Memory counter, which is not a wholly reliable number due to the way it is calculated using sampling.

To manage VMware, access to a reliable measure of a guest machine’s active memory is very important. Assessing just how much virtual memory your Windows application needs isn’t always a simple matter, as I tried to discuss in my discussion of the VMware benchmark results. A complication arises in Windows whenever any well-behaved Windows application is running that (1) expands to fill the size of physical RAM, (2) listens for low physical memory notifications from the OS, and (3) uses these notifications to trigger virtual memory management routines internally. Examples of applications that do this include Microsoft SQL Server. Any 64-bit .NET Framework application, including 64-bit ASP.NET web sites, is also in this category of applications because the Framework’s Common Language Runtime (CLR) triggers garbage collection in its managed Heaps whenever it receives low physical memory notifications from the OS. With processes designed to take advantage of whatever physical memory is available, it is actually quite difficult to assess how much RAM they require for good performance, absent, of course, internal performance statistics on throughput or response time.

Memory management in Hyper-V

Finally, the VMware paper mischaracterizes the memory management features in Microsoft’s Hyper-V product, dismissing it is an inferior derivative of what VMware accomplishes in a more comprehensive fashion. Moreover, this dismissal is unaccompanied by any hard performance data to back up the claim that VMware ESX is better. It may well be true that VMware is superior, but I am not aware of any data in the public domain that would enable anyone to argue conclusively one way or the other. (Hmmm. I smell an opportunity there. J)

It seems entirely possible to me that Hyper-V actually currently holds the advantage when it comes to virtual memory management. Memory management in Hyper-V, which installs on top of the Windows Server OS, can leverage the Windows Memory Manager. Memory management benefits from the fact that the OS understands the guest machine’s actual pattern of physical memory usage since the guest machine appears as simply another process address space executing on the virtualization host.

In its original incarnation, Hyper-V could not take advantage of this because the entire container process that the guest machine executed in was relegated to non-paged memory in the virtualization host. Guest machines acquired their full allotment of virtualized “physical” memory at initialization and retained it throughout as static non-paged memory, which put Hyper-V at a significant disadvantage compared to VMware, which granted guest machine memory on demand.

In the case of a Windows guest machine, VMware would initially have to grant the guest all the physical memory it requested because of the zero memory initialization routine Windows runs at start-up to clear the contents of RAM. While a Windows guest machine is idle, however, VMware’s transparent memory sharing feature reduces the guest machine memory footprint by mapping all those identical, empty zero-filled pages to a single page in machine memory. This behavior was evident in the benchmark test results I reported.

So, VMware’s dynamic approach to memory management let you over-commit machine memory and pack significantly more guest machines into the same memory footprint. This, as I have indicated, works just great so long as those guest machines are mostly idle, making, for example, virtualization of machines devoted to application testing and QA no-brainer candidates for virtualization.

However, the most recent versions of Hyper-V have a feature that Microsoft calls “Dynamic Memory” that changes all that. It is fair to say that Dynamic Memory gives Hyper-V parity with VMware in this area. RAM is allocated on demand, similar to VMware. When you configure a Hyper-V guest, you specify the maximum amount of RAM that Hyper-V will allocate on its behalf. But that RAM is only granted on demand. To improve start-up performance, you can also specify the amount of an initial grant of RAM to the guest machine. (I suspect a Windows guest machine is “enlightened” when it detects it is running under Hyper-V and skips the zero memory initialization routine.)

There is one other configurable parameter the new dynamic memory management feature uses called the memory buffer. The VMTJ authors mistakenly conflate the Hyper-V memory buffer with the ballooning technique that VMware pioneered. In fact, since the Hyper-V Host understands the memory usage pattern of its guests, there is no reason to resort to anything like ballooning to drive page replacement at the guest machine level when the virtualization host faces memory over-commitment.

The configurable Hyper-V guest machine memory buffer is a feature which tries to maintain a pool of free or available pages to stay ahead of the demand for new pages. In Hyper-V, the virtualization host allocates machine memory pages for the guest based on guest machine committed memory. (My guess is that the allocation mechanism Hyper-V relies on is the guest machine’s construction of the Page Table entry for the virtual memory that is allocated.)

The memory buffer is a cushion that Hyper-V adds to the guest machine’s Committed bytes – but only when sufficient machine memory is available – to accommodate rapid expansion since the number of committed bytes fluctuates as processes come and go, for example. Requests for new pages are satisfied from this pool of available memory immediately without incurring any of the delays associated with page replacement, which can often be relegated to background operations.

Configuring a guest machine’s memory buffer is basically a hint to the Hyper-V virtualization host memory manager – essentially, the Windows Memory Manager – to allocate and maintain some amount of additional virtual memory for the guest machine to use when it needs to expand. This is a good practice in any virtual memory management system. It is not a ballooning mechanism at all.

Posted in | No comments

Sunday, 14 July 2013

Virtual memory management in VMware: Final thoughts

Posted on 17:00 by Unknown

This is final blog post in a series on VMware memory management. The previous post in the series is here:

Final Thoughts

We constructed and I have been discussing in some detail a case study where VMware memory over-commitment led to guest machine memory ballooning and swapping, which, in turn, had a substantial impact on the performance of the applications that were running. When memory contention was present, the benchmark application executed to completion three times slower than the same application run standalone. The difference was entirely due to memory management “overhead,” the cost of demand paging when the supply of machine memory was insufficient to the task.

Analysis of the case study results unequivocally shows that the cost equation associated with aggressive server consolidation using VMware needs to be adjusted based on the performance risks that can arise when memory is over-committed. When configuring the memory on a VMware Host machine, for optimal performance it is important to realize that virtual memory systems do not degrade gracefully. When virtual memory workloads overflow the amount of physical memory available for them to execute, they are subject to page fault resolution delays that are punishing in nature. The amount of delay during the time it takes to perform a disk I/O necessary to bring a block of code or data from the paging file into memory to resolve a page fault is several orders of magnitude larger than almost any other sort of execution time delay a running thread is ever likely to encounter.

VMware implements a policy of memory over-commitment in order to support aggressive server consolidation. In many operational environments, like Server hosting or application testing, guest machines are frequently dormant. But when they are active, they are extremely active in bursts. These kinds of environments are well-served by aggressive guest machine consolidation on server hardware that is massively over-provisioned.

On the other hand, implementing overly aggressive server consolidation of active production workloads with more predictable levels of activity presents a very different set of operational challenges. One entirely unexpected result of the benchmark was the data on Transparent Memory Sharing reported in an earlier post that showed the benefits of memory sharing evaporating almost completely in the face of guest machines actively using their allotted physical memory. Since the guest machines used in the benchmark were configured identically, down to running the same exact application code, it was surprising to see how ineffective memory sharing proved to be once the benchmark applications started to execute on their respective guest machines. Certainly the same memory over-commitment mechanism is extremely effective when guest machines are idle for extended periods of time. This finding that memory sharing is ineffective when the guest machines are active, if it can be replicated in other environments, would call for re-evaluating the value of the whole approach, especially since idle machines can be swapped out of memory entirely.

Moreover, the performance-related risks for critical workloads that arise when memory over-commitment leads to ballooning and swapping are substantial. Consider that if an appropriate amount of physical memory was chosen for a guest machine configuration at the outset, any pages removed from the guest machine memory footprint via ballooning and/or swapping is potentially very damaging. For this reason, for example, warnings from SQL Server DBAs about VMware’s policy of over-committing machine memory are very prominent in blog posts. See http://www.sqlskills.com/blogs/jonathan/the-accidental-dba-day-5-of-30-vm-considerations/ for an example.

In the benchmark test discussed here, each of the guest machines ran identical workloads that, when a sufficient number of them were run in tandem, combined to stress the virtual memory management capabilities of the VMware Host. Using the ballooning technique, VMware successfully transmitted the external memory contention in effect to the individual guest machines. This successful transmission diffused the response to the external problem, but did not in any way lessen its performance impact.

More typical of a production environment, perhaps, is the case where a single guest machine is the primary source of the memory contention. Just as in a single OS image when one user process consuming an excess of physical memory can create a resource shortage with a global impact, a single guest machine consuming an excess of machine memory can generate a resource shortage that impacts multiple tenants in the virtualization environment.

Memory Reservations.

In VMware, customers do have the ability to prioritize guest machines so that all tenants sharing an over-committed virtualization Host machine are not penalized equally when there is a resource shortage. The most effective way to protect a critical guest machine from being subjected to ballooning and swapping due to a co-resident guest is to set up a machine memory Reservation. A machine memory Reservation establishes a floor guaranteeing that a certain amount of machine memory is always granted to the guest. With a Reservation value set, VMware will not subject a guest machine to ballooning or swapping that will result in the machine memory granted to the guest falling below that minimum.

But in order to set an optimal memory Reservation size, it is first necessary to understand how much physical memory the guest machine requires, not always an easy task. A Reservation value that is set too high on a Host machine experiencing memory contention will have the effect of increasing the level of memory reclamation activity on the remaining co-tenants of the VMware Host.

Another challenge is how to set an optimal Reservation value for guest machines running applications that, like the .NET Framework application used in the benchmark discussed here, dynamically expand their working set to grab as much physical memory as possible on the machine. Microsoft SQL Server is one of the more prominent Windows server applications that does that, but others include the MS Exchange Store process (fundamentally also a database application), and ASP.NET web sites. Like the benchmark application, SQL Server and Store listen for Low Memory notifications from the OS, and will trim back their working set of resident pages in response. If the memory remaining proves inadequate to the task, there are performance ramifications.

With server applications like SQL Server that expand to fill the size of RAM, it is often very difficult to determine how much RAM is optimal, except through trial and error. The configuration flexibility inherent in virtualization technology does offer a way to experiment with different machine memory configurations. Once the appropriate set of performance “experiments” have been run, the results can then be used to reserve the right amount of machine memory for these guest machines. Of course, these workloads are also subject to growth and change over time, so once memory reservation parameters are set, they need to be actively monitored at both the VMware Host and guest machine and application levels.

Posted in memory management, VMware | No comments

Wednesday, 10 July 2013

Virtual memory management in VMware: Swapping

Posted on 11:49 by Unknown

This is a continuation of a series of blog posts on VMware memory management. The previous post in the series is here.

Swapping

VMware has recourse to steal physical memory pages granted to a guest OS at random, which VMware terms swapping, to relieve a serious shortage of machine memory. When free machine memory drops below a 4% threshold, swapping is triggered.

During the case study, VMware resorted to swapping beginning around 9:10 AM when the Memory State variable reported a memory state transition to the “Hard” memory state, as shown in Figure 19. Initially, VMware swapped out almost 600 MB of machine memory granted to the four guest machines. Also, note that swapping is very biased. The ESXAS12B guest machine was barely touched, while at one point 400 MB of machine memory from the ESXAS12E machine was swapped out.

Figure 19. VMware resorted to random page replacement – or swapping – to relieve a critical shortage of machine memory when usage of machine memory exceeded 96%. Swapping was biased – not all guest machines were penalized equally.

Given how infrequently random page replacement policies are implemented, it is surprising to discover they often perform reasonably well in simulations, although they still perform much worse than stack algorithms that order candidates for page replacements based on Least Recently Used criteria. Because VMware selects pages from a guest machine’s allotted machine memory for swapping at random, it is entirely possible for VMware to remove truly awful candidates from the current working set of a guest machine’s machine memory pages using swapping. With random page replacement, some worst case scenarios are entirely possible. For example VMware might to choose to swap out a frequently referenced page that contains code from the operating system kernel or Page Table entries, pages that the guest OS would be among the pages least likely to be chosen for page replacement.

To see how effective VMware’s random page replacement policy is, the rate of pages swapped out were compared to the swap-in rate. This comparison is shown in Figure 20. There were two large bursts of swap out activity, the first one taking place at 9:10 AM when the swap out rate was reported at about 8 MB/sec. The swap-in rate never exceeded 1 MB/sec, but a small amount of swap-in activity continued to be necessary over the next 90 minutes of the benchmark run, until the guest machines were shut down and machine memory was no longer over-committed. In clustered VMware environments, the vMotion facility can be invoked automatically to migrate a guest machine from an over-committed ESX Host to another machine in the cluster that is not currently experiencing memory contention. This action may relieve the immediate memory over-commitment, but may also succeed in simply shifting the problem to another VM Host.

As noted in the previous blog entry, the benchmark program took three times longer to execute when there was memory contention from all four active guest machines, compared to running in a standalone guest machine. Delays due to VMware swapping were certainly one of the important factors contributing to elongated program run-times.

Figure 20. Comparing pages swapped out to pages swapped in.

This entry on VMware swapping concludes the presentation of the results of the case study that stressed the virtual memory management facilities of an VMware ESX host machine. Based on an analysis of the performance data on memory usage gathered at the level of both the VMware Host and internally in the Windows guest machines, it was possible to observe the virtual memory management mechanisms used by VMware in operation very clearly.

With this clearer understanding of VMware memory management in mind, I'll discuss some of the broader implications for performance and capacity planning of large scale virtualized computing infrastrucures in the next (and last) post in this series.

Posted in memory management, VMware | No comments

Monday, 1 July 2013

Virtual memory management in VMware: memory ballooning

Posted on 18:45 by Unknown

This is a continuation of a series of blog posts on VMware memory management. The previous post in the series is here.

Ballooning

Ballooning is a complicated topic, so bear with me if this post is much longer than the previous ones in this series.

As described earlier, VMware installs a balloon driver inside the guest OS and signals the driver to begin to “inflate” when it begins to encounter contention for machine memory, defined as the amount of free machine memory available for new guest machine allocation requests dropping below 6%. In the benchmark example I am discussing here, the Memory Usage counter rose to 98% allocation levels and remained there for duration of the test while all four virtual guest machines were active.

Figure 7, which shows the guest machine Memory Granted counter for each guest, with an overlay showing the value of the Memory State counter reported at the end of each one-minute measurement interval, should help to clarify the state of VMware memory-management during the case study. The Memory state transitions indicated mean that VMware would attempt to use both ballooning and swapping to try to relieve the over-committed virtual memory condition.

The impact of ballooning will be discussed first.

Figure 7. Memory State transitions that are associated with active memory granted to guest machines that drained the supply of free machine memory pages.

Ballooning occurs when the VMware Host recognizes that there is a shortage of machine memory and must be replenished using page replacement. Since VMware has only limited knowledge of current page access patterns, it is not in a position to implement an optimal LRU-based page replacement strategy. Ballooning attempts to shift responsibility for page replacement to the guest machine OS, which presumably can implement a more optimal page replacement strategy than the VMware hypervisor. Essentially, using ballooning, VMware reduces the amount of physical memory available for internal use within the guest machine, forcing the guest OS into exercising its memory management policies.

The impact that ballooning has on virtual memory management being performed internally by the guest OS suggests that it will be well worth looking inside the guest OS to assess how it detects and responds to the shortage of physical memory that ballooning induces. Absent ballooning, when the VMware Host recognizes that there is a shortage of machine memory, this external contention for machine memory is not necessarily manifest inside the guest OS where, for example, its physical memory as configured might, in fact, be sized very appropriately. Unfortunately, for Windows guest machines the problem is not as simple and straightforward as a guest machine configured to run well with x amount of physical memory assigned finds itself abruptly in a situation where it can access far less than the amount of physical memory configured for it to use, which is serious enough. An additional complication is that well-behaved Windows applications listen for notifications from the OS and attempt to manage their own process virtual memory address spaces in response to these low memory events.

Baseline measurements without memory contention.

To help understand what happened during the memory contention benchmark, it will be useful to compare those results to a standalone baseline set of measurements gathered when there was no contention for machine memory. We begin by reviewing some of the baseline memory measurements taken from inside Windows when the benchmark program was executed standalone on the VMware ESX server with only one guest machine active.

When the benchmark program was executed standalone with only one guest machine active, there was no memory contention. A single 8 GB guest defined to run on a 16 GB VMware ESX machine could count on all the machine memory granted to it being available. For the baseline, the VMware Host reported overall machine memory usage never exceeding 60% and the balloon targets communicated to the guest machine balloon driver are zero values. The Windows guest machine does experience some internal memory contention, though, because there is a significant amount of demand paging that occurred during the baseline.

Physical memory usage inside the Windows guest machine during the baseline run is profiled in Figure 8. The Available Bytes counter is shown in light blue, while memory allocated to process address spaces is reported in dark blue. Windows performs page replacement when the pool of available bytes drops below a threshold number of free pages on the Zero list. When the benchmark process is run, beginning at 2:50 pm, physical memory begins to be allocated to the benchmark process, shrinking the pool of available bytes. Over the course of the execution of the benchmark process – a period of approximately 30 minutes – the Windows page replacement policy periodically needs to replenish the pool of available bytes by trimming back the number of working set pages allocated to running processes.

In a standalone mode, when the benchmark process is active, the Windows guest machine manages to utilize all the 8 GBs of physical memory granted to it under VMware. This is primarily a result of the automatic memory management policy built into the .NET Framework runtime which the benchmark program uses. While a .NET Framework program specifically allocates virtual memory as required, the runtime, not the program, is responsible for reclaiming any memory that was previously allocated but is no longer needed. The .NET Framework runtime periodically reclaims currently unused memory by scheduling a garbage collection when it receives a notification from the OS that there is a shortage of physical memory Available Bytes. The result is a tug of war, the benchmark process continually growing its working set to fill physical memory and Windows memory management periodically signally the CLR of an impending shortage of physical memory, causing the CLR to free some of the virtual memory previously allocated by the managed process.

The demand paging rate of the Windows guest is reported in Figure 8 as a dotted line chart, plotted against the right axis. There are several one-minute spikes reaching 150 hard page faults per second. The Windows page replacement policy leads to the OS trimming physical pages that are then re-accessed during the benchmark run that subsequently need to be retrieved from the paging file. In summary, during a standalone execution of the benchmark workload, the benchmark process allocates and uses enough physical memory to trigger the Windows page replacement policy on the 8 GB guest machine.

Figure 8. Physical Memory usage by the Windows guest OS when a single guest machine was run in a standalone mode.

The physical memory utilization profile of the Windows guest machine reflects the use of virtual memory by the benchmark process, which is the only application running. This process, ThreadContentionGenerator.exe, is a multithreaded 64-bit program written in C# that deliberately stresses the automated memory management functions of the .NET runtime. The benchmark program's usage of process virtual memory is highlighted in Figure 9.

The benchmark program allocates some very large data structures and persists them through long processing cycles that access, modify and update them at random. Inside the program, the program allocates these data structures using the managed Heaps associated with the .NET Framework’s Common Language Runtime (CLR). The CLR periodically schedules a garbage collection thread that automatically deletes and compacts the memory previously allocated to objects that are no longer actively referenced. (Jeff Richter's book, CLR via C#, is a good reference on garbage collection inside a .NET Framework process.)

Figure 9. The sizes of the managed heaps inside the benchmarking process address space when there was no "external" memory contention. The size of the process’s Large Object Heap varies between 4 and 8 GB during a standalone run.

Figure 9 reports the sizes of the four managed Heaps during the standalone benchmark run. It shows the amount of process private virtual memory bytes allocated ranging between 4 and 8 GBs during the run. The size of the Large Object Heap, which is never compacted during garbage collection, dwarfs the sizes of the other 3 managed Heaps, which are generation-based. Normally in a .NET application process, garbage collection is initiated automatically when the size of the Generation 0 heap grows to exceed a threshold value, which is chosen in 64-bit Windows based on the amount of physical memory that is available. But the Generation 0 heap in this case remains quite small, smaller than the Generation 0 “budget.”

The CLR also initiates garbage collection when it receives a LowMemoryResourceNotification event from the OS, a signal that page trimming is about to occur. Well-behaved Windows applications that allow their working sets to expand until they reach the machine’s physical memory capacity wait on this notification. In response to the Low Memory resource Notification, the CLR dispatches its garbage collection thread to reclaim whatever unused virtual memory it can find inside the process address space. Garbage collections initiated by LowMemoryResourceNotification events cause the size of the Large Object Heap to fluctuate greatly during the standalone benchmark run.

To complete the picture of virtual memory management at the .NET process level, Figure 10 charts the cumulative number of garbage collections that were performed inside the ThreadContentionGenerator.exe process address space. For this discussion, it is appropriate to focus on the number of Generation 0 garbage collections – the fact that so many Generation 0 collections escalate to Gen 1 and Gen 2 collections is a byproduct of the fact that so much of the virtual memory used by the ThreadContentionGenerator program was allocated in the Large Object Heap. Figure 10 shows about 1000 Generation 0 garbage collections occurring. This represents a reasonable, but rough, estimate of the number of times the OS generated Low Memory resource notifications during the run to trigger garbage collection.

Figure 10. The number of CLR Garbage collections inside the benchmarking process when the guest Windows machine was run in standalone mode.

In summary, the benchmark program is a multi-threaded 64-bit .NET Framework application that will allocate virtual memory up to the physical memory limits on the machine. When the Windows OS encounters a shortage of empty Zero pages as a result of these virtual memory allocations, it issues a Low Memory notification that is received and processed by the CLR. Upon receipt of the this Low Memory notification, the CLR schedules a garbage collection to reclaim any private bytes previously allocated on the managed Heaps that are no longer in use.

Introducing external memory contention.

Now, let’s review the same memory management statistics when the same VMware Host is asked to manage a configuration of four such Windows guest machines running concurrently, all actively attempting to allocate and use 8 GB of physical memory.

As shown in Figure 11, ballooning begins to kick in during the benchmark run around 9:10 AM, which also corresponds to an interval in which the Memory State transitioned to the “hard” state where both ballooning and swapping would be initiated. (From Figure 2, we saw this corresponds to intervals where the machine memory usage was reported running about 98% full.) These balloon targets are communicated to the balloon driver software resident in the guest OS. An increase in the target instructs the balloon driver to “inflate” by allocating memory, while a decrease in the target causes the balloon driver to deflate. Figure 11 reports the memory balloon targets communicated to each of the guest machine resident balloon drivers, with balloon targets rising to over 4 GB per machine. When the balloon drivers in each guest machine begin to inflate, the guest machines will eventually encounter contention for physical memory, which they will respond to by using their page replacement policies to identify older pages to be trimmed from physical memory.

Figure 11. The memory balloon targets for each guest machine increase to about 4 GB when machine memory fills.

Note that the balloon driver’s memory targets are also reported when the Windows guest machine has the VMware Windows tools installed in the VM Memory performance counters.

In Windows, when VMware’s vmmemsty.sys balloon driver inflates, it allocates physical memory pages and pins them in physical memory until explicitly released. To determine how effective ballooning works to relieve a shortage of machine memory condition, it is useful to drill into the guest machine performance counters and look for signs of increased demand paging and other indicators of memory contention. Based on how Windows virtual memory management works [3], we investigated the following telltale signs that virtual memory appeared to be under stress as a result of the balloon driver inflating inside the Windows guest:

memory allocated in the nonpaged pool should spike due to allocation requests from the VMware balloon driver
a reduction in Available Bytes, leading to an increase in hard page faults (Page Reads/sec) as a result of increased contention for virtual memory
applications that listen for low memory notifications from the OS will initiate page replacement to trim their residents sets voluntarily

(As discussed above, the number of garbage collections performed inside the ThreadContentionGenerator process address space corresponds to the number of low memory notifications received from the OS indicating that page trimming is about to occur.)

Generally speaking, if ballooning is effective, it should cause a reduction in the working set of the benchmark process address space since that is main consumer of physical memory inside each guest. Let's take a look.

Physical memory usage.

Figure 12 shows physical memory usage inside one of the active Windows guest machines, reporting the same physical memory usage metrics as Figure 8. (All four show a similar pattern.) Beginning at 9:10 AM when the VMware balloon driver inflates, process working sets (again shown in dark blue) are reduced, which reduces the physical memory footprint of the guest machine to approximately 4 GB. Concurrently, the Available Byes counter, which includes the Standby list, also drops precipitously.

Figure 12. Windows guest machine memory usage counters show a reduction in process working sets when the balloon driver inflates.

Figure 12 also shows an overlay line graph of the Page Reads/sec counter, which is a count of all the hard page faults that need to be resolved from disk. Page Reads/sec is quite erratic while the benchmark program is active. What tends to happen when the OS is short of physical memory is that demand paging to and from disk increases, until the paging disk saturates. Predictably, a physical memory bottleneck is transfigured into a disk IO bottleneck in systems with virtual memory. Throughput of the paging disk serves as an upper limit on performance when there is contention for physical memory. In a VMware configuration, this paging IO bottleneck is compounded when guest machines share the same physical disks, which was the case here.

Process virtual memory and paging.

Figure 13 drills into the process working set for the benchmarking application, ThreadContentionGenerator.exe. The process working set is evidently impacted by the ballooning action, decreasing in size from a peak value of 6 GB down to less than 4 GB. The overlay line in Figure 13 shows the amount of Available Bytes in Windows. The reduction in the number of Available Bytes triggers the low memory notifications that the OS delivers and .NET Framework CLR listens for. When the CLR receives an OS LowMemoryResourceNotification event, it schedules a garbage collection run to release previously allocated, but currently unused, virtual memory inside the application process address space.

Figure 13. Working set of the benchmarking application process is reduced when VMware ballooning occurs from over 4 GB down to about 2 GB.

Figure 14 looks for additional evidence that the VMware balloon driver induces memory contention inside the Windows guest machine when the balloon inflates. It graphs the counters associated with physical memory allocations, showing a sharp increase in the number of bytes allocated in the Nonpaged pool, corresponding to the period when the balloon driver begins to inflate. The size of the Nonpaged pool shows a sharp increase from 30 MB to 50 MB, beginning shortly after 9:10 AM. The balloon evidently deflates shortly before 10:30 AM, over an hour later when VMware no longer experiences memory contention.

What is curious, however, is that the magnitude of the increase in the size of the nonpaging pool shown in Figure 14 is so much smaller than the VMware guest machine balloon targets reported in Figure 9. The guest machine balloon target is approximately 4 GB in Figure 10, and it is evident in Figure 12 that the balloon inflating reduced the memory footprint of the OS by approximately 4 GB. However, the increase in the size of the nonpaging pool (in light blue) reported in Figure 12 is only 20 MB. This discrepancy requires some explanation.

What seems likely is that the balloon driver inflates by calling MmProbeAndLockPages, allocating physical memory pages that are not associated with either of the standard paged or nonpaged system memory pools. (Note that he http.sys IIS kernel-mode driver allocates a physical memory resident cache that is similarly outside the range of both the nonpaged and paged pool and is not part of any process address space either. Like the balloon driver, the size of the http.sys cache is not reported in any of the Windows Memory performance counters. By allocating physical memory that is outside the system’s virtual memory addressing scheme, the VMware balloon driver can inflate effectively in 32-bit Windows when the virtual memory size of the nonpaged pool is constrained architecturally.)

The size of the VMware memory balloon is not captured directly by any of the standard Windows memory performance counters. The balloon inflating does appear to cause an increase in the size of the nonpaged pool, probably reflecting the data structures that the Windows Memory Manager places there to keep track of the locked pages that the balloon driver allocates.

Figure 14. Physical memory allocations from inside one of the Windows guest machines. The size of the nonpaged pool shows a sharp increase from 30 MB to 50 MB, beginning shortly after 9:10 AM when the VMware balloon driver inflates.

Figure 14 clearly shows a gap in the measurement data at around 9:15. This gap in the measurement data is probably caused by a missing collection data interval because VMware was taking drastic action to alleviate memory contention by blocking execution of the guest machine, triggered by the amount of free machine memory dipping below 2%.

Figure 15 provides some evidence that ballooning induces an increased level of demand paging inside the Windows guest machines. Windows reports demand paging rates using the Page Reads/sec counter, which shows consistently higher levels of demand paging activity once ballooning is triggered. The increased level of demand paging is less pronounced than might otherwise be expected from the extent of the process working set reduction that was revealed in Figure 13. As discussed above, the demand paging rate is predictably constrained by the bandwidth of the paging disk, which in this configuration is a disk shared by all the guest machines.

Figure 15. Ballooning causes an increased level of demand paging inside the Windows guest machines. The Pages Read/sec counter from one of the active Windows guest machines is shown. Note, however, the demand paging rate is constrained by the bandwidth of the paging disk, which in this test configuration is a single disk shared by all the guest machines.

For comparison purposes, it is instructive to compare the demand paging rates in Windows from the one of the guest machines to the same guest machine running the benchmark workload in a standalone environment where only a single Windows guest was allowed to execute. In Figure 16, the demand paging rate for the ESXAS12B Windows guest during the benchmark is contrasted to the same machine and workload running the standalone baseline. The performance counter data from the standalone run is graphed as a line chart overlaying the data from the memory contention benchmark. In a standalone mode, the Windows guest machine has exclusive access to the virtual disk where the OS paging file resides. Executing during contention mode, the disk where the guest machines’ paging files reside is shared. In standalone mode – where there is no disk contention – Figure 16 shows that the standalone Windows guest machine is able to sustain higher demand paging rates.

Due to the capacity constraint imposed by the bandwidth of the shared paging disk, the most striking comparison in the measurement data shown in Figure 16 is the difference in run-times. The benchmark workload that took about 30 minutes to execute in a standalone mode ran for over 90 minutes when there was memory contention, almost three times longer. Longer execution times are the real performance impact of memory contention in VMware, as in any other operating system that supports virtual memory. The hard page faults that occur inside the guest OS delay the execution of every instruction thread that experiences them, even operating system threads. Moreover, when VMware-initiated swapping is also occurring – more about that in a moment – execution of the guest OS workload is also potentially impacted by page faults whose source is external.

Figure 16. Hard page fault rates for one of the guest machines, comparing the rate of Page Reads/sec with and without memory contention. In standalone mode – where there is no disk contention – the Windows guest machine is able to sustain higher demand paging rates. Moreover, the benchmark workload that took about 30 minutes to execute in a standalone mode ran for over 90 minutes when there was memory contention.

VMware ballooning appears to cause Windows to send low memory notifications to the CLR, which then schedules a garbage collection run that attempts to shrink the amount of virtual memory the benchmark process uses.

Garbage collection inside a managed process.

Finally, let's look inside the process address space for the benchmark application and the memory management performed by the .NET Framework's CLR. The cumulative number of .NET Framework garbage collections that occurred inside the ThreadContentionGenerator process address space is reported for the configuration where there was memory contention in Figure 17. Figure 18 shows the size of the managed heaps inside the ThreadContentionGenerator process address space and the impact of ballooning. Again, the size of the Large Object Heap dwarfs the other managed heaps.

Comparing the usage of virtual memory by the benchmarking process in Figure 18 to the working set pages that are actually resident in physical memory, as reported in Figure 13, it is apparent that the Windows OS is aggressively trimming pages from the benchmarking process address space. Figure 13 shows the working set of the benchmarking process address space constrained to just 2 GB, once the balloon driver inflated. Meanwhile, the size of the Large Object Heap shown in Figure 18 remains above 4 GB. When an active 4 GB virtual memory address space is confined to running within just 2 GB of physical memory, the inevitable result is that the application’s performance will suffer from numerous paging-related delays. We conclude that ballooning successfully transforms the external contention for machine memory that the VMware Host detects into contention for physical memory that the Windows guest machine needs to manage internally.

Comparing Figure 17 to Figure 10, we see that in both situations, the CLR initiated about 1000 Generation 0 garbage collections. During the memory contention test, the benchmark process takes much longer to execute, so these garbage collections are reported over a much larger interval. This causes the shape of the cumulative distribution in Figure 17 to flatten out, compared to Figure 10. The shape of the distribution changes because the execution time of the application elongates.

Figure 17. CLR Garbage collections inside the benchmarking process when there was "external" memory contention.

Comparing the sizes of the Large Object Heap allocated by the ThreadContentionGenerator process address space, Figure 9 from running standalone shows that the size of the Large Object Heap fluctuates from almost 8 GB down to 4 GB. Meanwhile, in Figure 18, the size of the Large Object Heap is constrained to between 4 and 6 GBs, once the VMware balloon driver began to inflate. The size of the Large Object Heap is constrained due to aggressive page trimming by the Windows OS.

Figure 18. The size of the managed heaps inside the benchmarking process address space when there was "external" memory contention. The size of the Large Object Heap dwarfs the sizes of the other managed heaps, but is constrained to between 4 and 6 GBs due to VMware ballooning.

That completes our look at VMware ballooning during a benchmark run where machine memory was substantially over-committed and VMware was forced into responding, using its ballooning technique.

In the next post, we will consider the impact of VMware's random page replacement algorithm, which it calls swapping.

Posted in memory management, VMware, windows-performance; application-responsiveness; application-scalability; software-performance-engineering | No comments

Performance By Design A blog devoted to Windows performance, application responsiveness and scala