June 2013 ~ Performance By Design A blog devoted to Windows performance, application responsiveness and scala

Monday, 24 June 2013

Virtual memory management in VMware: Transparent memory sharing

Posted on 16:19 by Unknown

This is a continuation of a series of blog posts on VMware memory management. The previous post in the series is here.

In this installment, I will discuss the impact and effectiveness of transparent memory sharing, using the performance data that was gathered during a benchmark that stressed VMware's virtual memory management capabilities.

Transparent memory sharing.

Transparent memory sharing is one of the key memory management mechanisms that supports aggressive server consolidation. Dynamically, VMware detects memory pages that are identical within or across guest machine images. When identical pages are detected, VMware maps them to a single page in machine memory. When guest machines are largely idle, transparent memory sharing enables VM to pack guest machine images efficiently into a single hardware platform, especially for machines running the same OS, the same OS version, and the same applications. However, when guest machines are active, the benefits of transparent memory sharing are evidently greatly reduced, as will soon be apparent.

VMware uses a background thread that scans guest machine pages continuously, looking for duplicates. This process is illustrated in Figure 5. Candidates for memory sharing are found by calculating a hash value from the contents of the page and looking for a collision in a Hash Table that is built from the known hash values from other current pages. If a collision is found, then the candidate for sharing is compared to the base page byte by byte. If the contents of the candidate and the base pages match, then VMware points the PTE of the copy to the same page of machine memory backing the base page.

Memory sharing is provisional. VMware uses a Copy on Write mechanism whenever a shared page is modified and can no longer be shared. This is accomplished by flagging the shared page PTE as Read Only. Then, when an instruction executes that attempts to store data in the page, the hardware generates an addressing exception. VMware handles the exception by creating a duplicate, and re-executing the Store instruction that failed against the duplicate. Transparent memory sharing has great potential benefits, but there is some overhead necessary to support the feature. One source of overhead is the processing by the background thread. There are tuning parameters to control the rate at which these background memory scans run, but, unfortunately, there are no associated performance counters reported that would help the system administrator to adjust these parameters. The other source of overhead results from the Copy on Write mechanism, which entails the handling of additional hardware interrupts associated with the soft page faults. There is no metric that provides the rate these additional soft page faults occur either.

Figure 5. Transparent memory sharing uses a background thread to scan memory pages, compute a hash code value from their contents, and compare to other Hash codes that have already been computed. In the case of a collision, the contents of the page that is a candidate for sharing are compared byte by byte to the collided page. If the pages contain identical content, VMware points both pages to same physical memory location.

In the case study, transparent memory sharing is initially extremely effective – when the guest machines are largely idle. Figure 6 renders the Memory Shared performance counter from each of the guest machines as a stacked area chart. At 9 AM, when the guest machines are still idle, almost all the 8 GB granted to three of the machines (ESXAS12C, ESXAS12D, and ESXAS12E) is being shared by pointing those pages to the machine memory pages that are assigned to the 4th guest machine (ESXAS12B). Together, these three guest machines have about 22 GB of shared memory, which allows VMware to pack 4 x 8-GB OS images into a machine footprint of about 10-12 GB.

However, once the benchmark programs start to execute, the amount of shared memory dwindles to near zero. This is an interesting result. With this workload of identically configured virtual machines, even when the benchmark programs are active, there should still be significant opportunities to share identical code pages. But VMware is apparently unable to capitalize much on this opportunity once the guest machines become active. A likely explanation for the diminished returns from memory sharing is simply that the virtual memory management performed by each of the active guest Windows machines leads to the contents of too many virtual memory pages changing too frequently, something which simply overwhelms the copy detection sharing mechanism.[1]

[1]Since the benchmark programs are also consuming CPU resources, another possible explanation for the lack of memory sharing is severe processor contention that prevents the memory scanning thread from being dispatched while the benchmark programs were active. However, the VMware Host reported overall processor utilization of only about 40-60% throughout most of the active benchmarking period, so this hypothesis was rejected. Here is where some resource accounting that can report the memory scan rate or the amount of time the scan thread was active would be quite helpful.

Figure 6. The impact of transparent memory sharing dwindles to near zero when the benchmarking workloads were active.

In the next post in this series: we will dig into VMware's use of ballooning.

Posted in memory management, VMware | No comments

Tuesday, 18 June 2013

Virtual memory management in VMware: a case study

Posted on 10:42 by Unknown

This is a continuation of a series of blog posts on VMware memory management. The previous post in the series is here.

Case Study.

The case study reported here is based on a benchmark using a simulated workload that generates contention for machine memory. VMware ESX Server software was installed on a Dell Optiplex 990 with an Intel i7 quad-core processor and 16GB of RAM. (Hyper-Threading was disabled on the processor through the BIOS.) Then four identical Windows Server 2012 guest machines were configured, each configured to run with 8 GB of physical memory. Each Windows guest is running a 64-bit application benchmark program designated as the ThreadContentionGenerator.exe, which the author developed.

The benchmark program was written using the .NET Framework. The program allocates a very large block of private memory and accesses that memory randomly. The benchmark program is multi-threaded, and updates the allocated array using explicit locking to maintain the integrity of its internal data structures. Executing threads also simulate IO waits periodically by going to sleep, instead of executing reads or writes against the file system to avoid exercising the machine’s physical disks. Performance data from both VMware and the Windows guest machines was gathered at one minute intervals for the duration of the benchmark testing, approximately 2 hours. For comparison purposes, a single guest machine was activated to execute the same benchmark in a standalone environment where there was no contention for machine memory. Running standalone, with no memory contention, the benchmark executes in about 30 minutes.

Memory allocation on demand

Figure 2 tracks three key ESX memory performance metrics during the test: the total Memory Granted to the four guest Windows machines, the total Memory Active for the same four guest Windows machines, and the VMware Host’s Memory Usage counter, reported as a percent of the total machine memory available. The total Memory Granted counter increases at the outset in 8 GB steps as each of the Windows guest machines spins up. The memory benchmarking programs were started at just before 9 AM, and continued executing over the next 2 hour period, finally winding down execution near 11 AM. The benchmark programs drive Active Memory to almost 15 GB, shortly after 9 AM, and overall Memory Usage to 98%. (In this configuration, 2% free memory translates into about 300 MB of available physical memory.)

Notice that the Memory Active counter that purports to measure the guest OS working set of resident pages exhibits some anomalies, presumably associated with the way it is estimated using sampling. There are periodic spikes in the counter when the guest machines have just been activated, but are not active yet in the beginning of the testing period. Toward the end of the benchmark period, after many of the benchmark worker threads have completed, there is another spike, resembling the earlier ones. This later spike shows total guest machine active memory briefly reaching some 20 GB, which, of course, is physically impossible.

Figure 2. Memory Granted, Memory Active and % Memory Used during the benchmark.

As the benchmark programs execute in each of the guest machines, the Memory Granted counter takes a downward plunge from 32 GB down to about 15 GB. The vCenter Performance Counters documentation provides this definition of the counter: “The amount of memory that was granted to the VM by the host. Memory is not granted to the [guest] until it is touched one time and granted memory may be swapped out or ballooned away if the VMkernel needs the memory.” Evidently, during initialization of a Windows Server machine, the OS initially touches every page in physical memory, so initially 8 GB of RAM are granted to each guest machine. But in this case study, there is only 16 GB total physical RAM available. As VMware detects memory contention, the memory granted to each guest machine is evidently reduced through page replacement, using the ballooning and swapping mechanisms.

Figure 3 attempts to show the breakdown of machine memory allocated by adding allotments associated with the VMKernel to the sum of the Active memory consumed by each of the guest machines. The anomalous spike in Active Memory near the end of the benchmark test pushes overall machine memory usage beyond the amount of RAM actually installed, which, as noted above, is physically impossible. This measurement anomaly, possibly associated with a systematic sampling error, is troubling because it makes it difficult in VMware to obtain a precise breakdown of machine memory allocation and usage reliably.

Figure 3. Machine memory allocations, including the areas of memory allocated by the VMKernel.

Figure 3 also shows a dotted line overlay that reports the value of the Memory State counter. The Memory State counter reports the value of memory state at the end of each measurement interval, so these values should be interpreted as sample observations. There were three sample observations when the memory state was “Soft,” indicating ballooning taking place. And there is an earlier sample observation where the memory state was “Hard,” indicating that swapping was triggered.

Figure 4 shows the same counter data as Figure 3, without the Memory Active counter data. We see that the VMware Host management functions consume about 1.5 GB of RAM altogether. This includes the Memory Overhead counter, which reports the space the shadow page tables occupy. The amount of machine memory that the VMware hypervisor consumes remains flat through out the active benchmarking period.

Figure 4. Machine memory areas allocated by the VMKernel, including memory management “overhead.”

In the next post in this series, we will look at the effectiveness of another VMware memory management feature, transparent memory sharing.

Posted in memory management, VMware | No comments

Monday, 10 June 2013

Virtual memory management in VMware.

Posted on 12:08 by Unknown

Server virtualization technology, as practiced by products such as the VMware ESX hypervisor, applies similar virtual memory management techniques in order to operate an environment where multiple virtual guest machines are provided separate address spaces so they can execute concurrently, sharing a single hardware platform. To avoid confusion, in this section machine memory will refer to the actual physical memory (or RAM) installed on the underlying VMware Host platform. Virtual memory will continue to refer to virtual address space a guest OS builds for a process address space. Physical memory will refer to a virtualized view of machine memory that VMware grants to each guest machine. Virtualization adds a second level of memory address virtualization. (A white paper published by VMware entitled "“Understanding Memory Resource Management in VMware® ESX™ Server” is a good reference.)

When VMware spins up a new virtual guest machine, it grants that machine a set of contiguous virtual memory addresses that correspond to a fixed amount of physical memory, as specified by configuration parameters. The fact that this grant of physical memory pages does not reflect a commitment of actual machine memory is transparent to the guest OS, which then proceeds to create page tables and allocate this (virtualized) physical memory to running processes the same as it would if the OS were running native on the hardware. The VMware hypervisor is then responsible for maintaining a second of set of physical:machine memory mapping tables, which VMware calls shadow page tables. Just as the page tables maintained by the OS map virtual addresses to (virtualized) physical addresses, the shadow page tables map the virtualized physical addresses granted to the guest OS to actual machine memory pages, which are managed by the VMware hypervisor.

VMware maintains a set of shadow page tables that map virtualized physical addresses to machine memory addresses for each guest machine that it is executing. In effect, there is a second level of virtual to physical address translation that occurs each time a program executing inside a guest machine references a virtual memory address, once for the guest OS to map the process virtual address to a virtualized physical address and then by the VMware hypervisor to map the virtualized physical address to an actual machine memory address. Server hardware is available that supports this two-phase virtual:physical address mapping, as illustrated in Figure 1. In a couple of white papers, VMware reports this hardware greatly reduces the effort required by the VMware Host software to maintain the shadow page tables.

Figure 1. Two levels of Page Tables are maintained in virtualization Hosts. The first level is the normal set of Page Tables that the guest machines build to map virtual address spaces to (virtualized) physical memory. The virtualization layer builds a second set of shadow Page Tables that are involved in a two-step address translation process to derive the actual machine memory address during instruction execution.

Ballooning.

VMware attempts to manage virtual memory on demand without unnecessarily duplicating all the effort that its client guest machines already expend on managing virtual memory. The VMware hypervisor, which also needs to scale effectively on machines with very large amounts of physical memory, only gathers a minimum amount of information on the memory access patterns of any virtual machine guests that it is currently running. When VMware needs to replenish its inventory of available pages, it attempts to pressure the resident virtual machines to make those decisions by inducing paging within the guest OS, using a technique known as ballooning.

The VMware memory manager intervenes to handle the page faults that occur when a page initially granted to a guest OS is first referenced. This first reference triggers the allocation of machine memory to back the page affected, and results in the hypervisor setting the valid bit of the corresponding shadow Page Table entry. On the basis of setting the PTE valid bit on this first reference, VMware understands that it is an active page. But, following the initial access, VMware does very little to try to understand the reference patterns of the active pages of a guest OS. Neither does it attempt to use an LRU-based page replacement algorithm.

VMware does try to understand how many of the pages allocated to a guest machine are actually active using sampling. At random, it periodically selects a small sample of the guest machine’s active pages and flips the valid bit in the shadow PTE.[1]This is mainly done to try and identify guest machines that are idle and calculate what is known as an Idle machine tax. Pages from idle guest machines are preferred if VMware needs to perform page replacement. If any of those active pages that are flagged as invalid are referenced again, these pages are then soft-faulted back into the guest OS working set with little delay. The percentage of such pages that are re-referenced again within the sampling period is used to estimate the total number of Active pages in the guest machine working set. Note that it is only an estimate.

Using the page fault mechanism described above, VMware assigns free machine memory pages to a guest OS on demand. When the amount of free physical memory available for new guest machine allocation requests drops below 6%, ballooning is triggered. Ballooning is an attempt to induce paging stealing in the guest OS. Ballooning works as follows. VMware installs a balloon driver inside the guest OS and signals the driver to begin to “inflate.” vmmemctl.sysis the VMware balloon device driver software installed inside a guest Windows machine that “inflates” on command. The vmmemctl.sysVMware balloon driver uses a private communications channel to poll the VMware Host once per second to obtain a ballooning target. Waldspurger [7] reports that in Windows, the balloon inflates by calling standard routines that are available to device drivers that need to pin virtual memory pages in memory. The two memory allocation APIs Waldspurger references are MmProbeAndLockPagesand MmAllocatePagesForMDLEx. These APIs specifically allocates pages that remain resident in physical memory until they are explicitly freed by the device driver.

After allocating these balloon pages, which remain empty of any content, the balloon driver sends a return message to the VMware Host, providing a list of the physical addresses of the pages it has acquired. Since these pages will remain unused, the VMware memory manager can delete them from physical memory immediately upon receipt of this reply. So, ballooning itself has no guaranteed immediate impact on physical memory contention inside the guest. The intent, however, is to pin enough guest OS pages in physical memory to trigger the guest machine’s page replacement policy. However, if ballooning does not cause the guest OS machine to experience memory contention, i.e., if the balloon request can be satisfied without triggering the guest machine’s page replacement policy, there will be no visible impact inside the guest machine. If there is no relief from the memory contention, VMware, of course, may continue to increase the guest machine’s balloon target until the guest machine starts to shed pages. We will see how effectively this process works in the next blog entry in this series.

Because inducing page replacement at the guest meachine level using ballooning may not act quickly enough to relieve a machine memory shortage, VMware will also resort to random page replacement from guest OS working sets when necessary. In VMware, this is called swapping. Swapping is triggered when the amount of free physical memory available for new guest machine allocation requests drops below 4%. Random page replacement is one page replacement policy that can be performed without any gathering information about the age of resident pages, and while less optimal than an LRU-based approach, simulation studies show its performance can be reasonably effective.

VMware’s current level of physical memory contention is encapsulated in a performance counter called Memory State. This Memory State variable is set based on the amount of Free memory available. Memory state transitions trigger the reclamation actions reported in Table 1:

State	Value	Free Memory Threshold	Reclamation Action
High	0	³ 6%	None
Soft	1	< 6%	Ballooning
Hard	2	< 4%	Swapping to Disk or Pages compressed
Low	3	<2%	Blocks execution of active VMs > target allocations

Table 1. The values reported in the ESX Host Memory State performance counter.

In monitoring the performance of a VMware Host configuration, the Memory State counter is one of the key metrics to track.

In the case study discussed beginning in the next blog entry, a benchmarking workload was executed that generated contention for machine memory on a VMware ESX server. During the benchmark, we observed the memory state transitioning to both the “soft” and “hard” paging states shown in Table 1, triggering both ballooning and swapping.

[1]According to the “Understanding Memory Resource Management in VMware® ESX™ Server” white paper, ESX selects 100 physical pages randomly from each guest machine and records how many of the pages that were selected were accessed in the next 60 seconds. The sampling rate can be adjusted by changing Mem.SamplePeriod in ESX advanced settings.

Posted in memory management, VMware | No comments

Tuesday, 4 June 2013

VMware virtual memory management

Posted on 17:19 by Unknown

Virtual memory management refers to techniques that operating systems employ to manage the allocation of physical memory resources (or RAM) on demand, transparent to the applications that execute on the machine. Modern operating systems, including IBM’s proprietary mainframe OSes, virtually all flavors of Unix and Linux, as well as Microsoft Windows, have uniformly adopted virtual memory management techniques, ever since the first tentative results from using on-demand virtual memory management demonstrated more effective utilization of RAM, compared to static memory partitioning schemes. All but the simplest processor hardware offer support for virtual memory management.

VMware is a hypervisor, responsible for running one or more virtual machine guests, providing each guest machine with a virtualized set of CPU, memory, disk and network resources that VMware is then responsible for allocating and managing. With regard to managing physical memory, VMware initially grants each virtual machine that is running a virtualized physical address space, the size of which is specified during configuration. From within the guest machine, there is no indication to either the operating system or the process address spaces that execute under the guest OS that physical addresses are virtualized. Unaware that that physical addresses are virtualized, the guest machine OS manages its physical memory in its customary manner, allocating physical memory on demand and replacing older pages with new pages whenever it detects contention for “virtualized” physical memory.

To make it possible for guest machines to execute, VMware provides an additional layer of virtual memory:physical memory mapping for each guest machine. VMware is responsible for maintaining a hardware-specific virtual:physical address translation capability, permitting guest machine instructions to access their full virtualized physical addresses range. Meanwhile, inside the VMware Host, actual physical memory is allocated on demand, as guest machines execute and reference virtualized physical addresses. As actual physical memory fills, VMware similarly must implement page replacement. Unlike the guest OSes it hosts, VMware gathers itself very little information concerning page reference patterns – due to overhead concerns – that would be useful in performing page replacement. Consequently, VMware principal’s page replacement strategy is to try to induce paging inside the guest OS, where, presumably, better informed decisions can be made. This is known as guest machine ballooning.

To support more aggressive consolidation of guest virtual machine images onto VMware servers, VMware also attempts dynamically to identify identical instances of virtual memory pages within the guest machine or across guest machines that would allow them to be mapped to a single copy of a physical memory page, thus saving on overall physical memory usage. This feature is known as transparent memory sharing.

Virtual addressing

Virtual memory refers to the virtualized linear address space that an OS builds and presents to each application. 64-bit address registers, for example, can access a breathtaking range of 2⁶⁴virtual addresses, even though the actual physical memory configuration is much, much smaller. Virtual memory addressing permits applications to be written that can execute (largely) independent of the underlying physical memory configuration.

Transparent to the application process address space, the operating system maintains a set of tables that map virtual addresses to actual physical memory addresses during runtime. This mapping is performed at the level of a page, a block of contiguous memory addresses. A page size of 4K addresses (2¹² bytes) is frequently used, although other page sizes are possible. (Some computer hardware allows the OS to select and use a range of supported page sizes.)

To support virtual memory management, the operating system maintains page tables that map virtual memory addresses to physical memory addresses for each process being executed. The precise form of the page tables that are necessary to perform this mapping is specified by the underlying hardware platform. As a computer program executes on the hardware, the processor hardware performs the necessary translation of virtual memory addresses to physical memory addresses dynamically during run-time. Operating systems functions that support virtual memory management include setting up and maintaining the per process page tables that are used perform this dynamic mapping and instructing the hardware about the location of these memory address translation tables in physical memory, which is accomplished by loading a dedicated control register to point to the process-specific set of address mapping tables. When the execution of one running process blocks, the operating system performs a context switch that loads a different set of page tables to allow for the translation of that process’s valid virtual addresses.

The techniques that allow an operating system to execute multiple processes concurrently and switch between them dynamically are collectively known as multiprogramming. Modern operating systems evolved rapidly to support multiprogramming across multiple processors, where each CPU is capable of accessing the full range of installed physical memory locations.

(Large scale multi-core multiprocessors are frequently configured with more than one memory bank, where the result is a NUMA (non-uniform memory access) architecture. In machines with NUMA characteristics – something that is quite common in blade servers – accessing a location that resides in a remote memory bank takes longer than a local memory access, a fact that can have serious performance implications. For optimal performance on NUMA machines, the OS memory manager must factor in the NUMA topology into memory allocation decisions, something which VMware evidently does. Further discussion of NUMA architectures and the implications for the performance of guest machines is beyond the scope of the current inquiry, however. Single core multiprocessors from Intel have uniform memory access latency, while AMD single-core multiprocessors have NUMA characteristics.)

Virtual memory management allocates memory on demand, which is demonstrably more effective in managing physical RAM than static partitioning schemes where each executing process acquires a fixed set of physical memory addresses for the duration of its execution. In addition, virtual memory provides a secure foundation for executing multiple processes concurrently since each running process has no capability to access and store data in physical memory locations outside the range of its own unique set of dedicated virtual memory addresses. The OS ensures that each virtual address space is mapped to a disjoint set of physical memory pages. The virtual addresses associated with the OS itself represent a set of pages that are shared in common across all of the process address spaces, a scheme that enables threads in each process to call OS services directly, including the system services enabling interprocess communication (or IPC).

The operating system presents each running process with a range of virtual memory addresses to use that often exceeds the size of physical RAM. Virtualizing memory addressing allows applications to be written that are largely unconcerned with the physical limits of the underlying computer hardware, greatly simplifying their construction. Permitting applications to be portable across a wide variety of hardware configurations, irrespective of the amount of physical memory that is actually available for them to execute, is also of considerable benefit.

The virtual:physical memory mapping and translation that occurs during instruction execution is transparent to the application that is running. However, there are OS functions, including setting up and maintaining the Page Tables, which need to understand and utilize physical memory locations. In addition, device driver software, installed alongside and serving as an extension to the OS, that are directly responsible for communicating with all manner of peripheral devices. Device driver software must communicate with those devices using actual physical addresses. Peripheral devices use Direct Memory Access (DMA) interfaces that do not have access to the processor’s virtual address to physical address mapping capability during execution.

Memory over-commitment

Allowing applications access to a range of virtual memory addresses that individually or collectively exceeds the amount of physical memory that is actually available during execution inevitably leads to situations where physical memory is over-committed. When physical memory is over-committed, the operating system implements a page replacement policy that dynamically manages the contents of physical memory, reclaiming a previously allocated physical page and re-purposing it for use backing a an entirely different set of virtual memory addresses, possibly in an entirely different process address space. Dynamically replacing the pages of applications that have not been accessed recently with more recently accessed pages has proven to be an effective way to manage this over-commitment. This is known as demand paging.

Allowing applications to collectively commit more virtual memory pages than are actually present in physical memory, but biasing the contents of physical memory based on current usage patterns, permits operating systems that support virtual memory addressing to utilize physical memory resources very effectively. Over-commitment of physical memory works because applications frequently exhibit phased behavior during execution in which they actively access only a relatively small subset of the overall memory locations they have allocated. The subset of the total number of allocated virtual memory pages that are currently active and resident in physical memory is known as the application’s working set of active pages.

Under virtual memory management, a process address space acquires virtual memory addresses a page at a time, dynamically, on demand. The application process normally requests the OS to allocate a block of contiguous virtual memory addresses for it use. (Since RAM, by definition, is randomly addressable, the process seldom cares where within the address space this specific size block of memory addresses is located. But because fragmentation of the address space can occur, potentially leading to allocation failures when a large enough contiguous block of free addresses is not available to satisfy an allocation request, there is usually some effort on the part of the OS Memory Manager to make virtual allocations contiguous, where possible.)

In Windows, for example, these allocated pages are known as committed pages because the OS has committed to backing the virtual page in either physical memory or in auxiliary memory, which is another name for the paging file located on disk. Windows also has a commit limit, an upper limit on the number of virtual memory pages it is willing to allocate. The commit limit is equal to the sum of size of RAM and the size of the paging file(s).

The Page Table entry, or PTE, the format of which is specified by the hardware, is the basic mechanism used by the hardware and operating system to communicate the current allocation status of a virtual page. Two bits in the PTE, the valid bit and the dirty bit, are the key status indicators. When the PTE is flagged as invalid, it is a signal to the hardware to not perform virtual address translation. When the PTE is marked valid, it will contain the address of the physical memory page that was allocated by the OS that is to be used in address translation. When the PTE is marked invalid for address translation, the remaining bits in the PTE can be used by the operating system. For example, if the page in question currently resides on the paging file, the data necessary to access the page from the paging file are usually stored in the PTE. (Additional hardware-specified bits in the PTE are used to indicate that the page is Read-only, the page size, and other status data associated with the page.)

Initially, when a virtual memory page is first allocated, it is marked as invalid because the OS has not yet allocated it a physical memory page. Once it is accessed, and the OS does allocate a physical memory page for it, the PTE entry is marked as valid, and is updated to reflect the physical memory address that the OS assigned. The hardware sets the “dirty” bit to indicate that an instruction has written or changed data on the page. The OS accesses the dirty bit during page replacement to determine if it is necessary to write the contents of the page to the paging file before the physical memory page can be “re-purposed” for use by a different range of virtual addresses.

Page fault resolution

It is not until an instruction executing inside the process attempts to access a virtual memory address during execution that the OS maps the virtual address to a corresponding physical page in RAM. When the Page Table entry (PTE) used for virtual:physical address translation indicates no corresponding physical memory page has been assigned, an instruction that references a virtual address on that page generates an addressing exception. This condition is known as a page fault. The operating system intercepts this page fault and allocates an available page from physical memory, modifying the corresponding PTE to reflect the change. Once a valid virtual:physical mapping exists, the original failing instruction can be re-executed successfully.

In Windows, in resolving a page fault that results from an initial access, the OS assigns an empty page from its Zero Page list to the process address space that generated the fault and marks the corresponding PTE as valid. These operations are known as Demand Zero page faults.

Page fault resolution is transparent to the underlying process address space, but it does have a performance impact. The instruction thread that was executing when the page fault occurred is blocked until after the page fault is resolved. (In Windows, a Thread Wait State Reason is assigned that indicates an involuntary wait status, waiting for the OS to release the thread again, following page fault resolution.) The operating system attempts to minimize page fault resolution time by maintaining a queue of free physical memory pages that are available to be allocated immediately whenever a demand zero page fault occurs. Resolving a page fault by supplying an empty page from the Zero list is regarded as a “soft” fault in Windows because the whole operation is designed to be handled very quickly and usually does not necessitate a disk operation.

Hard page faultsare any that need to be resolved from disk. When a thread from the process address space first writes data to the page, changing its contents, the hardware flags that page’s PTE “dirty” bit. Later, if a dirty page is “stolen” from the process address during a page trimming scan, the dirty bit provides an indication to the OS that the contents of the page must be written to paging file before the page can be “re-purposed.” When the contents of the page have been written to disk, the PTE is marked, showing its location out in the paging file. If, subsequently, a thread from the original process executes and re-references an address on a previously stolen page, a page fault is generated. During hard page fault resolution, the OS determines from the PTE that the page is currently on disk. It initiates a Page Read operation to the disk that copies the current contents of the page from the paging file into an empty page on the Zero list. When this disk IO operation completes, the OS updates the PTE and re-dispatches the thread that was blocked for the duration of the disk IO.

LRU-based page replacement

Whenever the queue of available physical memory pages on the Zero list becomes depleted, however, the operating system needs to invoke its page replacement policy to replenish it. Page replacement, also known as page stealing or, more euphemistically, page trimming, involves scanning physical memory looking for good candidates for page replacement, based on their pattern of usage.

Specifically, operating systems like Windows or Linux implement page replacement policies that choose to replace pages based on actual memory usage patterns, which requires them to keep track – to a degree – of which virtual memory pages an application allocates that are actually currently in use. A page replacement policy that can identify those allocated pages which are Least Recently Used (LRU) and target them for removal has generally proven quite effective. Most cache management algorithms – and it is quite reasonable to conceptualize physical memory as a “cache” for a virtual address space – in use today use some form of LRU-based page replacement.

In order to identify which allocated pages an application is actually using at a given time, it is necessary for the OS to gather information on page usage patterns. Physical memory hardware provides very basic functions that the OS can then exploit to track physical memory usage. The hardware sets an access bit in the Page Table Entry (PTE) associated with that corresponding range of physical addresses, indicating that an instruction accessed some address resident on the page. (Similarly, the hardware sets a “dirty” bit to indicate that an instruction has stored new data somewhere in the page.)

How the OS uses this information from the PTE access bit to keep track of the age of a page varies from vendor to vendor. For instance, some form of “clock algorithm” that periodically resets the access bits of every page that was recently accessed is the approach used in the IBM mainframe OS. The next clock interval in which the aging code is dispatched scans memory & resets the access bit for any page that was accessed during the previous interval. Meanwhile, the clock aging algorithm increments the unreferenced interval count for any page that was not accessed during the interval. Over time, the distribution of unreferenced interval counts for allocated pages yields a partial order over the age of each page on the machine. This partial order allows the page replacement routine to target the oldest pages on the system for page stealing.

The clock algorithm provides an incremental level of detail on memory usage patterns that is potentially quite useful for performance and capacity planning purposes [3], but it also has some known limitations, especially with regard to performance. One performance limitation is that the execution time of a memory scan varies linearly with the size of RAM. On very large scale machines, with larger amounts of RAM to manage, scanning page table entries is time-consuming. And it is precisely those machines that have the most amount of memory and the least amount of memory contention where the overhead for maintaining memory usage data is the highest.

Windows adopted a form of interval-oriented clock-based page aging algorithm that, hopefully, requires far less resources to run, allowing memory management to scale better for machines with very large amount of RAM to manage. In Windows, the Balance Set Manager is dispatched once per second to “trim” pages aggressively from processes that have working set pages that exceed their target values, which by default are set arbitrarily to low levels. Pages stolen from the address space in this fashion are, in fact, only stolen provisionally. In effect, they are placed in a memory resident cache managed as a FIFO queue called the Standby list. (In some official Windows documentation sources, the Standby list is referred to simply as “the cache.”) When the process references any previously stolen pages that are still resident in the FIFO cache, these pages can be “soft-faulted” back into the process’s working set without the necessity for any IO to the paging disk.

Pages in the Standby list that remain unreferenced are aged during successive page trimming cycles, eventually being pushed to the head of the queue. The Windows OS zero paging thread, which is awakened whenever the Zero list needs replenishing, pulls aged pages from the head of the Standby list, and writes zero values to the page, erasing the previous contents. After being zeroed, the page is then moved to the Zero list, which is used to satisfy any process requests for new page allocations. (Stolen pages that have their “dirty” bit set are detoured first to the Modified List prior to being added to the Standby cache.)

So long as an application’s set of resident virtual memory pages corresponds reasonably well to its working set of active pages, relatively few incidents of hard page faults will occur during execution, and managing virtual memory on demand will have very little impact on the execution time of the application. Moreover, so long as the operating system succeeds in maintaining an adequate inventory of available physical pages in advance of their actual usage by running processes, what page faults do occur can be resolved relatively quickly, minimizing the execution time delays that running processes incur. However, the performance impact of virtual memory management on the execution time of running tasks can be substantial if, for example, the demand for new pages exceeds the supply, or replenishing the inventory of available physical pages forces the OS to trim steal pages that are apt to be accessed again quite soon again once a blocked process is re-dispatched. This situation where the impact of virtual memory management on performance is significant is commonly referred to as thrashing, conjuring up an image of the machine exerting a great deal of effort on behalf of moving many virtual memory pages in and out of physical memory to the detriment of performing useful work.

Posted in memory management, VMware | No comments

Performance Management in the Virtual Data Center: Virtual Memory Management, Part 1

Posted on 17:06 by Unknown

This is the beginning a new series of blog posts that explores the strategies that VMware ESX employs to manage machine memory, focusing on the ones that are designed to support aggressive consolidation of virtual machine guests on server hardware. Server consolidation is one of the prime cost justifications for the use of VMware’s virtualization technology. Typical rack-mounted blade servers that are deployed in data centers contain far more processing power than most application servers require. From a capacity planning perspective, it is simply not cost-effective to configure many server images today to run directly on native hardware.

Virtualization software permits server resources – CPUs, memory, disk and network – to be carved up into functional sub-units and then shared among multiple tenants, known as guestmachines. Aggregating multiple server images onto blade servers using virtualization provides compelling operational benefits, including rapid recovery from failures because it is so quick and easy to spin up a new guest machine using VMware. With current generation processor, disk and networking hardware that was designed with virtualization in mind, guest machine performance approaches the performance of the same applications running on native hardware, but only so long as the virtualization Host itself is not overloaded. If the virtualization host is not adequately provisioned, however, performance issues will arise due to contention for those shared resources.

Diagnosing performance problems in the virtualization environment can, unfortunately, be quite complicated. This is partly due to the fact that the configuration itself can be quite complicated, especially when a typical VMware Host is managing many guest machines. In addition, there are often many VMware Hosts interconnected to a shared disk IO farm and/or networking fabric. When any of this shared hardware infrastructure becomes overloaded and performance suffers, the task of sorting out the root cause of this problem can prove quite daunting.

The focus of this series is on the impact of sharing physical memory, or RAM. To support aggressive server consolidation, the VMware Host grants physical memory to guest machines on demand. By design, VMware allows physical memory to be over-committed, where the overall amount of virtualized physical memory granted to guest machines exceeds the amount of actual machine memory that is available. VMware also looks for opportunities for guest machines to share hardware memory pages when the contents of any two (or more) pages are identical. Identical guest machine pages, once identified, are mapped to a single, common page in RAM.

The outline for this series of blog posts is as follows. I begin with a brief introduction to virtual memory management concepts. This is pretty much a basic review of the topic and the terminology. If it is an area that you already understood well, you should feel comfortable skipping over it.

Next, I discuss the specific approach to virtual memory management used in VMware. In this section, I will stick to information on virtual memory management that is available from published VMware sources. Much of the existing documentation is, unfortunately, very sketchy.

Finally, I will analyze a case study of VMware under stress. The case study vividly illustrates what happens when the VMware hypervisor confronts a configuration of guest machines that demands access to more physical memory addresses than are available on the underlying hardware configuration.

The case study analyzed here proved very instructive. It provides an opportunity to observe the effectiveness of the strategies VMware employs to manage virtual memory and the potential impact of those strategies on the performance of the underlying applications running on virtualized hardware whenever there is significant contention for RAM.

If you are ready to start reading, the first part of this series of blog posts is here.

Posted in memory management, VMware | No comments

Performance By Design A blog devoted to Windows performance, application responsiveness and scala

Monday, 24 June 2013

Virtual memory management in VMware: Transparent memory sharing

Transparent memory sharing.

Tuesday, 18 June 2013

Virtual memory management in VMware: a case study

Case Study.

Memory allocation on demand

Monday, 10 June 2013

Virtual memory management in VMware.

Ballooning.

Tuesday, 4 June 2013

VMware virtual memory management

Virtual addressing

Memory over-commitment

Page fault resolution

LRU-based page replacement

Performance Management in the Virtual Data Center: Virtual Memory Management, Part 1

Popular Posts

Categories

Blog Archive

About Me