Replication between data centers with JBoss Data Grid

Bela Ban, Red Hat

Red Hat Summit Boston, June 2013

What is data center replication ?

What problems does it solve ?

How is it done ?

How do I set it up ?

Demo ?


  • Data center (site): a local cluster of N nodes in which data is stored redundantly on multiple nodes
  • Replication: sending modifications to all nodes which have the data
  • Site master: a node in the site which is responsible for relaying of modifications to other sites
  • xsite: cross-site (traffic), e.g. replication between data centers
  • JDG: JBoss Data Grid, a cluster in which data is stored redundantly (productized Infinispan)
## What is replication between data centers ?
## What problems does it solve ?
## Site failure An entire site failing ? What are the chances of that ? Slim. But if it happens it can have a big (negative) impact on a company !
## What can cause a site failure ? * All nodes crashing at the same time (very unlikely !) * Switches/routers crashing, causing partitions * If we favor CP in CAP, the site may turn read-only * Clients losing access to the site due to connectivity loss * Catastrophic events * Floods, hurricanes (Sandy), terrorist attacks (9/11) * Data centers under water (World Trade One by fire water) * Sandy flooded Lower Manhattan with corrosive saltwater
## Follow the sun * Not only data centers, but also client are located around the world * E.g. clients in LON, BOS, SFO * First the LON clients are up, then BOS, then SFO * Clients want to access their local DC * A LON client accessing the SFO DC has latency of ~30 ms * Requirements * Clients are directed to the data center closest to them * Follow the sun: only one DC is active at any time * Replication between active and passive DCs
## How is xsite replication done ? ### The whole picture

Example of async xsite replication

  • The TX originator completes TX and forwards changes to the local site master (as one message)
  • Site master relays message to backup site master(s)
  • Backup site master starts local TX and applies TX
## Drill down ### A look at two JDG nodes in a site
## How do I set this up ? ### (Configuration)
### Demo * Can be downloaded from []( * Provides XML configuration for sites LON, BOS and SFO * Provides scripts to start nodes in each site * Simply copy the config and start scripts into a JDG installation * Can be run on the same host
### Running lon1: ./ lon1 0 translates into ./ -c lon.xml -Dsite=LON -b \ -Djboss.socket.binding.port-offset=0 -u ### Running lon2: ./ lon2 500 translates into ./ -c lon.xml -Dsite=LON -b \ -Djboss.socket.binding.port-offset=500 -u ### Running bos1: ./ bos1 1000 translates into ./ -c bos.xml -Dsite=BOS -b \ -Djboss.socket.binding.port-offset=1000 -u
## Roadmap
### SiteMaster as bottleneck * All update requests are handled and applied by the SiteMaster * This creates a bottleneck, thread pool exhaustion * ==> SiteMaster forwards update requests to random node within the same site for processing
### State transfer between sites * When a new site is started, grab the state from another site * The new site is only operational when the entire state has been transferred * Pull request submitted, needs to be integrated
### IRAC * Infinispan Reliable Asynchronous Clustering * Synchronous transactional replication is not good * Slow due to latency between sites * Increased risk of TX rollback due to many involved nodes * IRAC provides * (Almost) the speed of async repl with * the reliability of sync transactional replication * Smaller bandwidth requirement between sites * [](
### Summary * Xsite replication backs up a site to one or more sites * Can be used to implement follow-the-sun, or as backup for site failures * Easy to configure * Try it out for yourself with JDG 6.1
### Links * __JDG__: []( * __Infinispan__: []( * __JGroups__: []( * __Demo__: []( * __This talk__: [](