The Pieces of MRG

4 de dezembro de 2007Red Hat Enterprise MRG Team7 minutos (tempo de leitura)

Contribution by Carl Trieloff, Senior Consulting Software Engineer at Red Hat

Today, the Red Hat crew announced a new offering called “Red Hat Enterprise MRG.” What is MRG? It is an interesting set of technologies (Messaging, Realtime and Grid) which, when combined, we believe will provide unprecedented value, power and flexibility to our customers. We have received feedback from our customers stating that in the same way they where able to get better performance using less hardware and at less cost when moving to Red Hat Enterprise Linux from other operating systems, they expect the same will happen when using MRG. This provides the next big building block in Red Hat’s Linux Automation story. More on this later. First, lets look at some of the pieces.

Realtime

Although the realtime component of MRG is a replacement kernel, we have kept glibc and math libraries the same so that no application changes are needed and there is binary compatibility between the standard kernel and the MRG realtime kernel. Making it a drop-in Realtime isn’t primarily about being quicker, but rather about determinism, i.e. consistency and predictability. So far, we have seen good results on customer workloads ranging from Tibco to Wombat to IBM’s realtime Java implementation. Realtime is not a panacea as it can’t make up for or resolve issues in a badly written application. However, realtime works best when the application has been tuned on stock Red Hat Enterprise Linux and then switched to MRG’s realtime kernel to provide determinism. So tune, then flip in MRG Realtime to get better determinism. Additionally, MRG Realtime has a greater amount of preemption to be able to deliver its determinism. This comes at the cost of slightly higher CPU usage. We have found that realtime works best in boxes up to 80 percent utilized from our testing with the messaging component of MRG. A good example of this is that we have seen a 19 percent run-to-run variability on stock kernel and a <1 percent variance for the same 10 Million message run on a tuned MRG Realtime kernel. These results will vary based on the specifics of the application is doing, but this illustrates the point. </p>

Messaging

The messaging part of MRG is built using the AMQP specification. What is AMQP (Advanced Message Queuing Protocol)? Years back when I started our messaging work at Red Hat, I wanted to build a messaging platform that could be used for both O- level functionality and for enterprise applications. We started by looking at what had been done in-house by many of our customers to solve similar issues. There have been many a project written in the industry in this space. In the end we teamed up with John O’Hara from JPMC. Why? John had been working on a similar vision and the key part was that he had written down what he and his team and contractors had done. Having a document as a solid starting point meant that we could create a “Working Group” to collaboratively write an open, royalty-free specification for messaging that anyone could implement. The working group began with six companies from OS, Network, middleware and end-user domains. Additional companies have joined and continue to join the group. The AMQP working group has made significant progress since its inception, and you can expect to see the next publicized release from the working group soon. If you are hardcore, you can go here and browse the subversion repository and JIRA for AMQP, since all of the AMQP working group’s technical work is done in the open for upcoming releases.

I submitted a project proposal to Apache, Qpid, which is now in incubation and implements AMQP. The Qpid implementation has a diverse set of individuals contributing to it. This project serves as our ‘upstream’ for the messaging component in MRG. MRG takes this code base, tests it, integrates some additional modules and integrates it with the other MRG components to create MRG.

The MRG implementation of AMQP is not just a “me too” implementation–it is defining new ground. If you download the initial MRG beta, you will see not only a broker, but also Java, JMS, C++ and Python clients in the initial beta repo. For tBeta 2, we will add clustering, infiniband plugins, and a management console. Ruby and .NET clients will also be included in MRG for GA.

So what is some of this new ground in MRG Messaging? First off would be the async non-blocking O_DIRECT persistence store for the broker. The Qpid implementation has been tracking the changes to the specification from the AMQP working group as additions have been voted in and been made public. Some of the work done on the to-be-numbered version of AMQP has some benefits for durable messaging, specifically the ability to acknowledge asynchronously in ranges allowing for single message reliability without head-of-line blocking. What does this mean? We have been able to work with the Red Hat kernel team and AIO engineers to create a journal for MRG that can write at around full LUN speeds for sustained message transfers. With its custom-written page caches, this store writes to the SAN (one of our test arrays) at the rate of ~400,000 DMA write blocks per second which equates ~500,000 small durable messages (enqueue/dequeue) per second per LUN using a fraction of 1 CPU core to run the store for this specific disk array in this test. If the disk array is scaled up, these numbers will increase proportionally. (172bytes + data size rounded up to the 128 byte boundary) divided by AIO disk rate is within 5 percent of your durable message rate. Sure it will be slower if DTX (XA distributed transactions) are used, but we are redefining what can be achieved in the single message reliable transfer. We also have high goals for transient message rates which, on which we are still working, and will update that information during the beta. Extensive work is being done by Red Hat and our partners (see the MRG partners page) to characterize MRG Messaging so that you can compare meaningful rates for your specific message size and type – as rates and CPU usage vary based on these.

MRG Messaging has goals for impressive numbers. But, don’t be fooled by just numbers. It is easy to make test results look good. For example, another AMQP implementation recently published a rate of 1.3M transient messages per second. However, if you look at the details on its user list (small print): “each datagram contained 16 OPRA messages, and was sent as one 256 byte AMQP message…. Ingress of 80,000 AMQP messages per second.” And, this was done on a 16CPU Caneland box. Numbers can be made so say anything: batch and multiply by the batch count and round up – enough said.

So what do we look at? The important numbers are the normalized message rate, end-to-end latency and the CPU usage. In addition, these numbers change based on mode (fan-out, topic, p2p, pub-sub, durable, transient, dtx etc). Not only do we optimize to achieve the best rate/latency/CPU profile, but will also provide it so that it can be clustered for scale-up. MRG’s clustering module utilizes technologies from Red Hat Enterprise Linux 5.1 OS-level clustering facilities. The MRG team is working also to set a new baseline of value by working together with the MRG realtime kernel team and our partners. (This is the best crew I have ever worked with) Expect more data from this quest at time of GA.

Some of the other interesting work Red Hat is doing for MRG Messaging is around inifiband native support, the management console and clustering. To avoid a very long post, I will leave those for another day.

Grid

Let me turn to the ‘G’ in MRG now before I put it all together. Grid, well High Performance Computing, Distributed Computing, Cloud Computing, High Throughput Computing, High Productivity Computing whatever you may call it–the ability to schedule work either for time critical or by cycle stealing work loads at scale. Red Hat has worked with the University of Wisconsin to open source Condor jointly under an OSI license, making it possible for Condor to be included in projects like Fedora. In addition, Red Hat is building out a team locally working in tandem with the University of Wisconsin Condor project team. We share a joint vision and believe that by working together, the community at-large will benefit. There is still a lot that can be done around this technology; however, it is clear what can be achieved. If you look at the announcement of the Red Hat/Amazon EC2 deal, you can imagine scheduling critical work from cycle stealing for workloads that are not time-critical to going out to the cloud for rented capacity–all from a single job and management interface. This is only a preview of what is to come.

MRG

MRG is bringing together a new platform, fabric or pick your favorite marketing term. Staying away from buzz words – the point is that this combination of technologies is the next layer to Linux Automation, allowing a much larger set of tasks to be moved onto the Linux platform, into open source and make way for the same type of cost savings and performance increases that moving to Linux has done for our customers.

One simple example of this is that MRG will provide a single management console for messaging and realtime, so not only can you look at your workloads, but also get kernel/OS data overlayed in the same time graphs. If day trading goes wild, or an e-store gets 100x orders on a given day, it is possible to see the rate of acceleration of transaction load, and also see at the same time how your hardware CPU, memory, etc. are responding–plotted against transaction load. The next step is the ability to update the scheduling, or live migrating, of lower priority tasks to make capacity, or cross-utilize infrastructure to deal with load. (We are even reusing MRG’s technology set internally and in other projects around the OS.) MRG has a almost unending set of possibilities to add scale, determinism and flexibility to the ability to run any application anywhere, at anytime. I think of it as the freight train in Linux automation.

For more details, or to sign up for more information on the beta, go here.