February 2011 ~ Performance By Design A blog devoted to Windows performance, application responsiveness and scala

Monday, 28 February 2011

Rules in PAL: the Performance Analysis of Logs tool

Posted on 20:30 by Unknown

In spite of their limitations, some of which were discussed in an earlier blog entry, rule-based bromides for automating computer performance analysis endure. A core problem that the rule-based approach attempts to address is that, with all the sources of performance data that are available, we are simply awash in esoteric data that very few people understand or know how to interpret properly. At a bare minimum, we need expert advice on how to navigate through this morass of data, separating the wheat from the chaff, and transforming the raw data into useful information to aid IT decision-making. Performance rules that filter the data, with additional suggestions about which measurements are potentially important, can be quite helpful.

A good, current example is the Performance Analysis of Logs (PAL) tool, a free Windows counter analysis tool developed by Clint Huffman, a performance analyst of the first rank who works in Microsoft’s Premier Field Engineering team. PAL serves the need of someone like Clint, a hired gun who walks into an unfamiliar environment where they are experiencing problems and wants to be able to size up the situation quickly. PAL uses pre-defined data collection sets to gather a core set of Windows counter data and thresholds to filter the results. Threshold templates are stored in files that can easily be edited and changed. The program potentially fills an important gap; it is capable of analyzing large quantities of esoteric performance measurement data gathered using Perfmon and, when used properly, returns a manageable number of cases that are potentially worth investigating further.

As a diagnostic tool, PAL is deliberately designed merely to scratch at the surface of a complex problem. In the hands of a skilled performance analyst like Cliff, it does an excellent job of quickly digging through the clutter in a very messy room. Understanding that these are mainly filtering rules, heuristics that help a skilled performance analyst to size up a new environment quickly, is a key to using a tool like PAL effectively; that, plus a healthy skepticism about the quality of the analysis that these simple, declarative “expert” performance rules provide, apart from the expertise and judgment of the person wielding the tool.

Setting rule thresholds.

PAL diffuses the problem that “experts” are likely to disagree over what the threshold values for various rules should be by allowing you access to those setting so that they can be easily changed. Experts, of course, love to argue about whether the threshold for the rule on the desirable level of, for example, processor utilization is 70% or 80% or 90% or whatever. IMHO, these arguments are not a productive way to spend time. Rather than obsessing over whether 80% or 90% CPU busy is the proper value for the rule’s firing threshold, I like to transform the debate into a discussion about the reasoning used by the “expert” in selecting that specific target value. This tends to be a much more productive discussion. Knowing what the specific threshold setting depends on is often useful information. If you understand why a specific threshold setting was suggested, it helps you to understand whether or not the Rule is appropriate for your needs, whether the threshold should be adjusted for your specific environment, etc.

Consider the CPU, for example. The rule about excessive levels of CPU utilization is shorthand for what can happen to the performance of your application if, when it needs to use a processor, it finds that the CPUs are all busy with other, higher priority work. When the processors are overloaded, threads are delayed waiting to be scheduled.

If an application thread is Ready to execute, but all the CPUs are currently busy doing higher priority work, the thread is delayed in the OS Scheduler’s Ready Queue. In addition, if the application thread is executing, but some higher priority thread needs to execute (often, as a result of a device Interrupt being received and processed) and no other CPU is available, then the higher priority thread will preemptively interrupt the currently running thread. If this happens frequently enough, the throughput and/or response time of the application will be impacted by these queuing delays that are associated with thread scheduling.

Unfortunately, the amount of queuing delay in thread scheduling is not measured directly in any of the available Windows performance counters. Since thread scheduling delays should only occur when the CPU resources are approaching saturation, measurements of processor utilization, which are readily available, are used as a proxy for detecting thread queuing problems directly. If it were possible to gather measurements of thread execution queuing delay directly, the arguments over what CPU busy threshold to alert based on would surely evaporate. (Note: I described how to use the CSwitch and ReadyThread events in Windows to measure thread execution queue time directly from ETW in an earlier blog post entitled “Measuring Processor Utilization and Queuing Delays in Windows applications.”)

So, the first complication with a CPU busy rule is that processor utilization is really a proxy for a measurement that is not available directly. Fortunately, queuing theory can be invoked to predict mathematically the relationship between processor utilization and thread CPU queue time. A simple m/m/n queuing model predicts the following relationship between processor utilization and queue time, for example:

Figure 1. Response time vs. utilization in a simple m/m/n model.

The chart in Figure 1 shows response time, which is the sum of service time + queue time, rising sharply as the utilization of some resource begins to approach 100%. This particular chart uses the assumption that the average service time of a request (without queuing) is 10 ms., and then calculates the overall response time observed as utilization at the server is varied. At 50% utilization (and this leads to a nice Rule of Thumb), the queue time = the service time, according to the model, which means that overall response time is 2 * the average service time. At 75% utilization, the queue time increases to 3 * service time; an increase to 80% utilization increases the queue time to 4 * the service time, etc.

The overall shape of the response time curve in Figure 1 is one which swoops upward towards infinity, with a characteristic “knee” corresponding to a steep spike in response time as the server approaches saturation. Assuming for the moment that the response time curve depicted in Figure 1 is a realistic representation of OS thread scheduling, you should be able to see how this response time curve motivates formulating a rule to alerts us whenever CPU busy exceeds 75, 80, or 85% busy threshold.

The response time curve in Figure 1 does model the behavior that we often can observe when computer systems begin to experience performance problems. These problems often arise suddenly, as if out of nowhere. The response time curve of the m/m/n model mimics that behavior. Response time remains relatively flat when the system is lightly loaded. But the performance of our computer systems does not degrade gradually. Suddenly, and in the face of some capacity constraint or bottleneck, response times spike. As utilization of the CPU increases, an m/m/n model predicts that thread execution time will elongate due to queuing delays. According to the curve shown in Figure 1, queuing delays are apt to be significant at, say, 80% utilization. A performance rule that fires when the processor is observed running at 80% or higher encapsulates that knowledge.

But in order to formulate a useful diagnostic rule that warns us reliably to watch for potential thread execution time delays when the processor is running close to saturation, we need to become more familiar with an m/m/n queuing model and understand how realistically it can be used to represent what actually goes on during OS thread scheduling. It turns out that the specific response time curve from an m/m/n model, depicted in Figure 1, is not an adequate model for predicting queuing delays in OS thread scheduling.

Multiple CPUs. The first important wrinkle in applying the formula to a threshold rule is how to calculate utilization. In the case of a single CPU (in which case, we are dealing with an m/m/1 model), the calculation is straightforward, and so is the rule. When there are multiple CPUs and a thread can be scheduled for execution on any available engine, however, the utilization value is fed into the model is the probability that all CPUs are busy simultaneously. This is known as the joint probability. A ready thread is only forced to wait on a processor if all CPUs are currently busy. If each processor CPU_n is busy at probablilty p_n, the joint probability that all processors are busy is p₁ * p₂ * … p_n, or pⁿ. For example, if each CPU in a 4-way multiprocessor is 80%, the probability that all CPUs are busy simultaneously is 0.8⁴, or about 41%.

Clearly, a rule threshold that is based on how busy one processor is needs to be adjusted upward significantly when multiple CPUs are configured. At a minimum, the rules engine should calculate the joint probability that all CPUs are busy. Elsewhere, I have blogged about all sorts of interpretation issues involving the % Processor Time counters in Windows. So long as it is understood that the CPU busy threshold is a proxy for CPU queue time delays, it is OK to ignore most of these considerations. How to interpret these CPU utilization measurements on NUMA machines, SMT hardware, processor hardware that dynamically overclocks itself, or virtualization are all extremely funky issues. The fact that OS thread scheduling in Windows uses priority queuing with preemptive scheduling means that the simple response time formula from an m/m/n model is not a very adequate representation of reality. Finally, given that a thread does not compete with itself for access to CPU resources, you actually should calculate CPU utilization relative to all other higher priority executing threads.

At this point, let’s not try to go there. However, if you have to make a specific hardware or software recommendation for a mission-critical application that also involves a good deal of time and money changing hands, all of these murky areas may need to be investigated in detail.

To summarize,
a simple processor utilization rule has value as a filtering rule,
based on the potential response time delays predicted by simple queuing models when there is heavy contention for CPU resources ,
if the rule threshold is based on calculating the joint probability that all CPUs are busy,
and the rule’s firing is understood to be a proxy for a direct measurement of thread execution time delays due to CPU queuing.

More on the Rules in PAL in the next post.

Posted in windows-performance; context switches; application-responsiveness; application-scalability; software-performance-engineering | No comments

Wednesday, 16 February 2011

Watson computer smoking hot at Jeopardy challenge

Posted on 12:18 by Unknown

Well, the contest isn't over yet, but the outcome looks like a foregone conclusion. After two days, the Watson computer is poised to defeat the two human champions it is playing. The computer’s performance has been impressive, to say the least, and has left the human contestants looking dazed and confused.

And who wouldn’t be? The computer was both ruthless & relentless. (There I go, anthromorphising again.) The two human champions were barely able to answer a question or two as Watson virtually ran the board in the 2^nd day of the competition. Watson, which has to generate an answer in real-time, was so successful at beating the human contestants to the punch that it generated speculation about whether the computer had some kind of unfair time advantage from being fed the question electronically. As reported here (thanks, Phillip), according to IBM, Watson actually cedes a slight “reaction time” advantage to the human contestants. Given how successful Watson is in determining the correct answer so quickly, I think it would be more sporting to give the poor, deserving human players an even bigger head start. Hey, give us a break!

After day 1, the computer and one of the contestants were tied, and it looked as if things would get interesting. After Tuesday’s totally one-sided shellacking, though, commentators were reduced to wondering about the few missteps and obvious quirks that the computer did exhibit on occasion. See, for example: http://www.wired.com/epicenter/2011/02/watson-does-well-and-not/, which analyzes the prodigious strengths the program displayed, as well as describing its few weak spots.

I am afraid that the computer is so good at answering Trivia question that the contest isn’t turning into much of a drama. (It is turning into a great promo, though, for the IBM Watson Research lab.)

However, it remains a challenge of mythic proportions, which is very cool. Like John Henry, the steel-driving man vs. a steam-powered machine, or Charlie Chaplin trapped inside the assembly line in “Modern Times.” On Ray Kurzweil’s web site (he is the author of “The Singularity is Near”), I can almost hear the champagne glasses clinking.

Posted in artificial intelligence; automated decision-making; Watson; Jeopardy | No comments

Sunday, 13 February 2011

The Smartest Machine on Earth Plays Jeopardy

Posted on 15:41 by Unknown

I don't know if anyone out there besides me saw the NOVA TV show "Smartest Machine on Earth" about the IBM Research Watson computer. Watson is scheduled to play two human Jeopardy champions on TV on Monday-Wednesday (Feb 14-16) of next week. I thought the show was excellent.

Here's a link to the broadcast: http://www.pbs.org/wgbh/nova/tech/smartest-machine-on-earth.html.

If you are interested in going deeper, the current issue of AI Magazine is devoted to Question Answering, and contains an article by the Watson researchers. After the IBM Deep Blue chess computer successfully challenged the reigning human chess champion in 1997, AI researchers at IBM turned to other “hard problems” in AI. I am not much of a chess player myself, but I enjoyed following the progress of man against machine at the time, and I expect to tune in to watch the new IBM software play Jeopardy next week.

I admit I enjoy the drama of these human vs. computer challenges. A computer that plays Jeopardy models the famous “Turing test” for artificial intelligence coined by mathematician and computer pioneer Alan Turing. Today, the Turing test has been largely supplanted by John Searle’s Chinese room thought experiment, a challenge to the AI research agenda that is taken quite seriously. This, perhaps, explains why IBM is willing to spend millions of dollars on this Jeopardy effort.

Essentially, Searle’s philosophical argument is that humans have minds, while computer programs that perform automated reasoning based on encoded rules do not. Searle’s challenge encapsulates the gulf between syntax in language, which is indisputably governed by formal rules, and semantic knowledge, which may or may not be. The gulf between syntax and semantics is very wide indeed, but it is one that many AI researchers are actively engaged in trying to bridge. (Things like the Semantic Web come to mind.)

Of course, I also found the show relevant in the context of my current blog topic, where I have been discussing rule-based “expert systems” approaches to computer performance analysis. As I have written earlier, I am not a huge fan of the approach, but I do acknowledge some of its benefits, particularly in filtering very large sets of performance-oriented data, like the ones associated with huge server farms, for example. My assessment of the value of the rule-based, automated reasoning approach does appear to square with current academic thinking in the AI world. Today, engineering-oriented approaches dominate much of the current research in AI. The emphasis of the machine learning approach, for example, is on the underlying performance of the system, not the extent to which the cognitive capabilities of humans are modeled or imitated.

The NOVA show on Watson featured several AI luminaries from the academic world. Doug Lenat, a prominent AI researcher at Stanford who is still pursuing the rule-based approach, was on camera. Lenat’s current focus is a reasoning engine in which millions of “common sense” rules are represented in a unique language, derived from the predicate calculus, that he developed called CycL. On the NOVA program, Lenat said that the CYC knowledge base currently consists of more than 6 million assertions in CycL.

A sample CycL assertion looks like this:

(#$implies

      (#$isa ?A #$Animal)

      (#$thereExists ?M

         (#$and (#$mother ?A ?M)

          (#$isa ?M #$FemaleAnimal))))

CycL is certainly interesting as an example of a Knowledge Representation (KR) language. The problem is that, by nature (pun intended), biological categories are messy. If you think about it, the assertion in the example should probably say something about the mother object being the same species as its offspring. This is both an important biological and logical constraint. The assertion I learned in biology is:

Animal a => HasA femaleparent m =>

Where m IsA Animal and a.species == m.species

Which, if you think about that, also implies that a new species coming into existence is a (bio)logical contradiction. I don’t know why creationists don’t argue this, the logical inconsistency seems pretty explicit to me, but, perhaps, their positions aren’t grounded in logic to begin with.

The CycL rule doesn’t even mention animals like snails that are hermaphrodites and can self-fertilize their own eggs, a pretty neat trick, but not entirely unknown in the Animal Kingdom. It turns out that there is more to heaven and earth than is dreamt of in this set of categorical Rules that evaluate as either true or false using an automated reasoning program. Whether individual specimens belong to the same or different species is often in dispute. I remember learning in science class that there were nine planets in our solar system; now astronomers aren’t so sure. Poor Pluto. It has been demoted. There are some people that are devastated by the demotion. Poor Pluto and its acolytes.

In KR, this is known as the problem of ontologies. The problem is the differences between a planet, an asteroid, and a comet are not always clear cut. Worse, we are blind to our own tacit assumptions. A central thesis of cultural anthropology is the extent to which reality is culturally determined. Levi-Srauss on Le sauvage pensee argues that plant and animal classification schemes used by so-called “primitive” societies are no less rigorous than the one we use that originated with Linnaeus. The American linguist (and darling of the Left) George Lakoff also writes about the socially-constructed, culturally-determined “cognitive models” that shape our thinking in “Women, Fire, and Dangerous Things.” We see the world “through a glass, darkly.” We are like the prisoners in the Plato’s cave that mistake the shadows on the wall for reality.

Less philosophically, there are mathematical-logical objections to the automated reasoning approach. The fact that 1^st order logic is Undecidable (after Godel), or that computer programs of arbitrary complexity are subject to the Halting problem (Turing, again) ought to give proponents of the Rule-based approach in AI pause, but it doesn’t seem to. They have faith in mathematical modes of reasoning that I guess I must lack.

Given some of these inherent limitations, however, the trend in AI research today is away from Lenat’s rule-based reasoning approach. For instance, Terry Winograd also appeared in the NOVA show. When he was a graduate student at MIT, Winograd conducted ground-breaking research in AI, building a program called SHRDLU that could carry out simple tasks about a small domain of physical objects (called the Blocks world) using a natural language interface. (For a very amusing account of the origin of the name SHRDLU, see http://hci.stanford.edu/~winograd/shrdlu/name.html.) Winograd’s doctoral dissertation was later published as a book, “Understanding Natural Language” (currently out of print).

Back when I was in graduate school, Winograd’s SHRDLU program was considered one of the great success stories in “strong AI.” But then Winograd, one of the rising stars in AI, subsequently became disenchanted with the mechanistic reasoning approach he used in building SHRDLU, essentially a parser for a context-free grammar with back-tracking, which is a very rigid and limited approximation of natural language speech recognition. Winograd famously repudiated the rule-based reasoning approach to AI in a 2^nd book, “Understanding Computers and Cognition: A New Foundation for Design.” His critique, coming from someone from deep within the orthodoxy, was notorious. But, in fact, if you look at the way computer technology is used in speech recognition today, it is very far removed from the approach Winograd used back in the day. (I am thinking of the statistical approach described in Jelinek, “Statistical Methods for Speech Recognition” that relies on Hidden Markov Models.) These statistical techniques are quite effective in distinguishing human speech, but I doubt anyone would mistake them for simulating or imitating what it is we humans do when we converse with each other.

On the NOVA episode, Winograd demoed a version of Eliza, another celebrated AI program from the sixties that “simulated” conversing with a sympathetic therapist. The syntactically-oriented approach used in Eliza is easy to defeat, as Winograd demonstrated to some comic effect. Unlike Watson, the program could never hope to pass Searle’s Chinese Room test, although maybe today’s computers, several orders of magnitude more powerful, can.

Despite Eliza’s simple-minded capabilities, many human subjects that interacted with Eliza were comfortable having extended “conversations” with the computer, which surprised its author, given how limited a range of human interaction the program imitated. What seems to happen with Eliza is that human subjects project human attributes onto something that exhibits or mimics recognizably human behavior. Cognitive scientists claim we develop in early childhood a “Theory of Mind” that aids us in social interaction with other humans, something clinical researchers noted was absent in autistic children. When we encounter a computer that walks like a duck and quacks like a duck, it is normal for us to assume it is a duck. Similarly, participants in the Eliza naively assume that the computer-generated replies Eliza generates reflected empathy from a recognizably human Mind.

Searle’s Chinese room challenge turns Eliza on its head. It begins with a Skeptic’s perspective: can the computer program present a thoroughly convincing simulation of human interaction? Can it tell a joke, can it be ironic, or coin a metaphor? Can it be intuitive? Can it truly exhibit sympathy? These are human qualities and capabilities that have evolved that may require elements that are not wholly logical.

Finally, Tom Mitchell, one of the prominent researchers in the machine learning school, was featured on the NOVA show. Mitchell wrote the first textbook on the subject in 1997. Several of the recently minted PhDs at Microsoft Research I worked with on computer performance issues trained in Mitchell’s “machine learning” approach. It is a broad term, encompassing a variety of (mainly) statistical techniques for improving the performance of task-oriented computer programs through iteration and feedback (or “training”). The Watson Jeopardy-playing computer is programmed using the machine learning techniques.

The iteration and feedback aspects of the machine learning approach are really trial and error, or more succinctly, error-correcting, procedures. They can not only be quite effective, they do seem to model the incremental and adaptive procedures that biological agents (like homo sapiens) do use to learn a new skill or hone an existing one. The Watson computer trains on Jeopardy questions, and its learning algorithms are modified and adjusted to improve the probability the program will choose the correct answers. Similarly, if you are human and you want to get better at answering questions on the SAT exam, you take an SAT prep course where you practice answering a whole lot of questions from previous exams. Some of what you might learn in the class helps you with the content of the test (like vocabulary and rules of English grammar). But learning about the kinds of questions and the manner in which they are asked – on an exam where questions are often deliberately designed to trick you or confuse you – can also be extremely helpful. Having Watson train on a dataset of existing Jeopardy questions is essentially the same, proven strategy.

In the upcoming televised contest, Watson is competing against two reigning Jeopardy champions, the most skilled human contestants alive. I don’t know whether Watson vs. the human Jeopardy champions is going to be David vs. Goliath or Achilles vs. Hector, but I expect it will be a very intriguing human drama.

Posted in artificial intelligence; automated decision-making; Watson; Jeopardy | No comments

Performance By Design A blog devoted to Windows performance, application responsiveness and scala

Monday, 28 February 2011

Rules in PAL: the Performance Analysis of Logs tool

Wednesday, 16 February 2011

Watson computer smoking hot at Jeopardy challenge

Sunday, 13 February 2011

The Smartest Machine on Earth Plays Jeopardy

Popular Posts

Categories

Blog Archive

About Me