Original Link: https://www.anandtech.com/show/3425
The secrets of Virtual benchmarketing
by Johan De Gelas on August 17, 2008 12:00 AM EST- Posted in
- Virtualization
Marketing and benchmarking can lead to some pretty smart, but very misleading results. As promised, I shall give you a very good example of it. One promoted by some of the smartest people industry in fact. But first, a bit of background.
If you read our article about the nuts and bolds of virtualization, you might remember that paravirtualization is one of the best ideas in the virtualization space. Basically, a paravirtualized guest OS is adapted so that it hands over control to the hypervisor whenever it is necessary. One of the big advantages is that the hypervisor has to intervene much less than hypervisors which do not use paravirtualized guest OSs. Xen, the flagship of paravirtualization, has a lot of other tricks to offer excellent performance, such as the fact that drivers in the guest OS are linked to real native linux drivers (in domain 0).
The basic concept and technology behind Xen are - in theory - superior to other hypervisors which make use of emulation and/or binary translation. It is one of the weapons Xensource (and other Xen based virtualization solutions) can leverage against the undisputed king of virtualization land, VMWare.
Xensource came up with a number of benchmarks which had to proof that Xen was by far superior performancewise. "Near Native Performance" became the new battlecry when the Xen benchmarketing team stormed the VMWare stronghold. Unfortunately, some of the benchmarks are nothing more than marketing, and have little to do with reality.
To be honest, Xen can in fact offer superior performance in some virtualization scenario's. We'll show you in one of our detailed articles. And the other virtualization vendors have and will come up with some pretty insane benchmarks too.
The purpose of exposing this benchmark is to help you recognize bad virtualization benchmarks. We have the greatest respect for Ian Pratt and his team as they made one of the most innovative and valuable contributions to the IT market. But that doesn't mean we don't have to be critical for the benchmarks they present :-)
Here it is:
Yes, it is already a benchmark that is a year old. But it is still a very important benchmark to Xensource. Simon Crossby (CTO) talks about it in a blogpost on July the 2nd 2008:
"....at a time when Xen already offers Linux a typical overhead of under 1% (SPECJBB)..."
So what is wrong? Several things.
No work for the hypervisor.
SPECJBB (2005) hardly touches the hypervisor. Less than 1% of the CPU time is spend in the kernel [1] . To put this in perspective: even a CPU intensive load such as SPECint spends about 5% of it's time in kernel, and a typical OLTP workload makes the OS work for 20 to 30% of the CPU time.
The hypervisor hardly ever intervenes, so a Specjbb test is one the worst ways of showing how powerful your virtualization technology is. It is like sitting in a Porsche in a traffic jam and saying to your companion: "do you feel how much horsepower is available in my newest supercar ?"
No I/O
SPECjbb2005 from SPEC (Standard Performance Evaluation Corporation) evaluates the performance of server side Java by emulating a three-tier client/server system with emphasis on the middle tier. Instead of testing with a possible disk intensive database system, SPECjbb uses tables of objects, implemented by Java Collections, rather than a separate database.
For native testing, that is wonderful. It means that you do not have to setup an extremely expensive disk system (Contrary toTPC-C) and it makes the benchmark a lot more easier and faster to do. SPECJBB 2005 gives you an idea of how your specific CPU+Memory+JVM+JVM tuning combination can perform. In case that your own JAVA application looks a lot like specjbb, keeping the disk system out of the benchmarking is not such a bad idea: you can always size your disksystem later.
In case of a virtualized benchmark scenario, excluding (disk) I/O is a very bad idea. Depending on the virtualization scenario (RAID card, available drivers, 32 vs 64 bit etc.) , the hypervisor overhead of accessing the disk can range from insignificant to "a complete performance disaster".
To make a long story short, Specjbb 2005 produces no disk nor network activity and those are the two main factors why some virtualization solutions stumble and fall. Cutting those out of the benchmark and the value of your "hypervisor comparison" drops like a bank share after a credit crisis.
Native performance is too low
One of the weaknesses of the current Xen 3.x is that it does not support large pages. The use of large pages improves the performance of server workloads, and SPECJBB is no exception. It is well known that large pages can boost the performance of SPECJBB 2005 by about 20%. So it is pretty clear that Large pages were not enabled when Xensource tested native performance: performance was lower than it could be. In the real world, if performance really matters, large pages will be enabled, especially now that both the Windows and Linux platform make it so much easier.
So what does the XenSource SPECJBB benchmark prove? That if the hypervisor sees almost no action and you have an application that does no I/O whatsoever (unlike any realworld server application), virtualized performance is very close to native performance ... that is not well tuned. In other words: the message that this benchmark bring is close to meaningless.
Don't get us wrong. This is not a post against virtualizing your workloads. Most of the virtualized workloads out there perform very close to native performance, and all the goodies that virtualization brings (fast provisioning, incredible cost savings to name a few) make it more than worth to pay a small performance penalty. But quite a few of the benchmarks out there are quite misleading. If a vendor really wants to show how powerful it's hypervisor is, they have to show a benchmark that stresses the hypervisor, not one that leaves it alone. We will be showing quite a few benchmarks soon.
[1] Measured with VMstat in linux in our lab."Evaluating Non-deterministic Multi-threaded Commercial Workloads" (2002) by Alaa R. Alameldeen, Carl J. Mauer, Min Xu, Pacia J. Harper, Milo M. K. Martin, Daniel J. Sorin, Mark D. Hill, David A. Wood shows similar results for SPECJBB 2000.