Caché fits into all common high-availability configurations supplied
by operating system providers including Microsoft, IBM, HP, and EMC. Caché
provides easy-to-use, often automatic, mechanisms that integrate easily with
the operating system to provide high availability.
There are four general approaches to system failover. In order of increasing
availability they are:
Each strategy has varying recovery time, expense, and user impact, as
outlined in the following table.
There are variations on these strategies; for example, many large enterprise
clients have implemented hot failover and also use cold failover for disaster
recovery.
It is important to differentiate between failover and disaster recovery.
Failover is
a methodology to resume system availability in an acceptable period of time,
while
disaster recovery is a methodology to resume system
availability when all failover strategies have failed.
If you require further information to help you develop a failover and
backup strategy tailored for your environment, or to review your current practices,
please contact the
InterSystems
Worldwide Response Center (WRC).
With no failover in place your Caché database integrity is still
protected from production system failure. Structural database integrity is
maintained by Caché write image journal (WIJ) technology. Logical integrity
is maintained through global journaling and transaction processing. While
WIJ, global journaling, and transaction processing are optional, InterSystems
highly recommends using them.
If a production system failure occurs, such as a hardware failure, the
database and application are generally unaffected. Disk degradation, of course,
is an exception. Disk redundancy and good backup procedures are vital to mitigate
problems arising from disk failure.
With no failover strategy in place, system failures can result in significant
downtime, depending on the cause and your ability to isolate and resolve it.
If a CPU has failed, you replace it and restart, while application users wait
for the system to become available. For many applications that are not business-critical
this risk may be acceptable. Customers that adopt this approach share the
following common traits:
-
Clear and detailed operational recovery procedures
-
Well-trained, responsive staff
-
Ability to replace hardware quickly
-
Disk redundancy (RAID and/or disk mirroring)
-
Enabled global journaling and WIJ
-
24x7 maintenance contracts with all vendors
-
Expectations from application users who tolerate moderate
downtime
-
Management acceptance of risk of an extended outage
Some clients cannot afford to purchase adequate redundancy to achieve
higher availability. With these clients in mind, InterSystems strives to make
Caché 100% reliable.
A common and often inexpensive approach to recovery after failure is
to maintain a standby system to assume the production workload in the event
of a production system failure. A typical configuration has two identical
computers with shared access to a disk subsystem.
After a failure, the standby system takes over the applications formerly
running on the failed system. Microsoft Windows Clusters, HP MC/Serviceguard,
Tru64 UNIX TruClusters, OpenVMS Clusters, and IBM HACMP provide a common approach
for implementing cold failover. In these technologies, the standby system
senses a heartbeat from the production system on a frequent and regular basis.
If the heartbeat consistently stops for a period of time, the standby system
automatically assumes the IP address and the disk formerly associated with
the failed system. The standby can then run any applications (Caché,
for example) that were on the failed system. In this scenario, when the standby
system takes over the application, it executes a pre-configured start script
to bring the databases online. Users can then reconnect to the databases that
are now running on the standby server. Again, WIJ, global journaling, and
transaction processing are used to maintain structural and data integrity.
Customers generally configure the failover server to mirror the main
server with an identical CPU and memory capacity to sustain production workloads
for an extended period of time. The following diagram depicts a common configuration:
Cold Failover Configuration
Note:
Shadow journaling, where the production journal file is continuously
applied to a standby database, includes inherent latency and is therefore
not recommended as an approach to high availability. Any use of a shadow system
for availability or disaster recovery needs should take these latency issues
into consideration.
The warm failover approach exploits a standby system that is immediately
available to accept user connections after a production system failure. This
type of failover requires the concurrent access to disk files provided, for
example, by OpenVMS clusters and Tru64 UNIX TruClusters.
In this type of failover two or more servers, each running an instance
of Caché and each with access to all disks, concurrently provide access
to all data. If one machine fails, users can immediately reconnect to the
cluster of servers.
A simple example is a group of OpenVMS servers with cluster-mounted
disks. Each server has an instance of Caché running. If one server
fails, the users can reconnect to another server and begin working again.
Warm Failover Configuration
The 600 users on A and C are unaware of B's failure, but the 300 users
that were on the failed server are affected.
The hot failover approach can be complicated and expensive, but comes
closest to ensuring 100% uptime. It requires the same degree of failover as
for a cold or warm failover, but also requires that the state of a running
user process be preserved to allow the process to resume on a failover server.
One approach, for example, uses a three-tier configuration of clients and
servers.
Hot Failover Configuration
Thousands of users on terminal browsers connect through TCP sockets
to a bank of application servers. Each application server has a backup server
ready to automatically start in case of a server failure. In turn, the application
servers are each connected to a bank of data servers, each with its own backup
server.
If a data server fails, any application server waiting for a response
automatically resubmits its request to a different data server while the backup
server is started. Similarly, any user terminal that sends a request to an
application server that fails automatically reissues its request to an alternate
application server.