Applicable release versions: AP 6.1, AP/Unix
|Description||describes a 'hot backup' configuration, where one machine is in a standby mode, ready to take over the load from a failing system.
It is often required to have a system configuration where down time due to a hardware or software failure cannot be tolerated, or must be reduced to a very short time. A solution more affordable than fault tolerance is to double all necessary hardware resources and maintain the data base on two normal systems. This document examines the issues involved in this 'hot backup' configuration, its advantages, its limitations and the system administration procedures.
'Hot Backup' Solution Overview
The 'hot backup' configuration involves two systems: one 'master' system, which is the system in operation, and a 'slave' system, which is in stand-by mode. Both machines are connected by a fast TCP/IP connection. Users are normally connected to the main system. The backup system is also booted, and has a copy of the data base on the main system. The two machines do not need to be absolutely identical: the backup machine just needs the necessary resources (disk, memory, connectivity, ...) to support the application(s).
During normal operations, all updates to the data base on the main system are applied to the backup system, over the network.
In case of a failure of the main system, the users are switched to the backup machine, and the application is restarted. The down time is limited to the switch over time (may be just the time for the terminal concentrators to establish an ethernet connection to the other machine), and the data loss limited to the updates not yet transmitted to the backup machine. This loss is usually limited to a few seconds worth of work.
Note that the backup machine is not necessary idle. Other applications can be loaded on the backup machine. Also, since the backup machine has an exact copy of the data base, it can be used for editing reports, doing the file saves, etc...
- The cost is less than a traditional fault tolerant solution, when the absolute fault tolerance is not required. The second machine does not need to be as powerful as the main system. A slightly slower machine can be used, as long as it can provide an acceptable level of service should the main system become unavailable.
- The backup system is not necessarily idle. As long as the main data base is not updated on the backup machine, it can be used to edit reports, do the file saves, which alleviates the needs of saving the main system, can be used for developments, etc...
- The machines do not have to be physically close to each other. The machines can be in two different locations, which provides protection against major accidents.
- Since the updates are applied at the logical level, as opposed to mirroring of the data on disks, by a system process which is different from the application process used on the main machine, operating system failures are less likely to create corruptions on the backup system.
- The slave system can be the backup of more than one master system. A slave system with a very large disk capacity can act as a on-line archive system for several applications.
- On AP 6.1, the amount of data loss in case of system failure is uncontrolled. If the network bandwidth is sufficient, the amount of lost data will be 'small', but unknown. This can create problems on some applications. This problem is corrected on AP versions 6.2 and later.
- The system administration and recovery procedures require some manual interventions. The system relies heavily on Unix networking which must be understood by the System Administrator.
Main System Failure Recovery
This section outlines the operation required to recover from a failure of the main system. After the failure occurred, the users have been switched to the backup system and the application restarted. The main machine is repaired and must now be set back to the same level as the backup machine.
While the main machine is down, the data base on the backup machine is naturally evolving. To record all changes on the backup machine, all updates are recorded, using the transaction logger mechanism. If the repair time of the main machine is expected to be short (a few hours), the transaction journal can be left on the disk. If the repair time is expected to be longer, it is probably better and safer to write the transactions on tape.
Assuming the main machine's data base has been completeley destroyed, following a multiple disk crash, re-synchronizing the main machine 'simply' involves doing a full save on the backup machine, and restoring it on the main machine, and switching the users back to the main system. The problem is that the file save and restore operation can be very long, taking potentially days. It would obviously be unacceptable to stop the operations during this. Therefore, while the save and restore proceeds, updates to the data base must be logged. On version 6.1 and later, the updates can be stored to tape, since multi-tape is supported. On earlier versions, there is no choice but to do the logging on disk. After the restore has been completed on the main machine, the transactions which have been accumulated during the save/restore operation, are applied to the main data base. During this transaction log load, it is likely that more updates will be done on the backup machine, resulting in more transaction tapes. Depending on the volume of data, there may be a few iterations of this process: load a transaction log tape on the main system while more transaction tape are being created on the backup machine. Eventually, the system will be almost in sync. The users are then disconnected from the backup machine, the very last transactions written on the last tape, and this tape is loaded on the main system. All operations must stop for this short time. Both systems are now in sync. Users can be reconnected to the main machine, and the transaction log across the network can be restarted from the main machine to the backup, and the system is now operational again.
If there is enough disk space on the backup machine, and if the down time of the main system (including the file save/restore) is expected to be 'small', it is possible to leave all the updates on disk. Re-synchronizing the two machines is then simpler: After the restore, start the hot backup process across the network from the BACKUP machine, which now acts as the 'master', TO the MAIN machine, now acting as the 'slave'. This will transfer all the updates made to the backup machine. When the queue is emptied, the users can be switched back to the main machine. This avoid tape manipulation, but involves a higher risk factor, should a major problem occur on the backup system.
Backup System Failure Recovery
If the backup system fails, a procedure similar to the one described for the main system recovery must be applied. The only difference is that the users are never stopped. Essentially, a full save is taken out of the main machine, restored on the backup machines, then all the updates applied to the backup machines. The only impact on normal operations are a higher system load due to the file save, and, obviously, a higher risk, since there is no backup.
Making sure it works
This configuration is usually applied to very large data bases, and making sure everything works and that no data loss occurs is of utmost importance. Network reliability is obviously critical. The various processes (servers) involved in the communication constantly check on each others, assign numbers to the messages on the network and also make sure the transaction logging mechanism itself is operating normally by periodically writing some test data and making sure the updates are sent over. The System Administrator can control the data bases by periodically running some application report, and make sure the results are identical. All network incidents, as well as unusual circumstances, are reported to a predetermined list of users, so that an incident does not stay unnoticed for a long time. The section "hot-backup, TCL", is describing the major system incidents and suggests some corrective actions.
System Setup :
To set up a 'hot backup' system, the System Administrator must do the following steps. Each operation is detailed in the section "hot-backup, TCL".
- Establish a network between the two systems. This network must support TCP/IP (eg Ethernet, Token Ring, etc...). The System Administrator must set the network names of both systems, even though only the receiver's host name is used. The hot backup connection only requires access to TCP. Other elements like NFS, FTP, etc... are not required.
- Determine a free TCP/IP port number. Use the "netstat -a" Unix command to see what is currently in use. A value like 2000 or 3000 is usually safe.
- Load the Pick data base on the main machine. This will include setting the application, the user files, etc...
- On the master system determine which files are going to be set as DL, i.e., for which updates will be sent to the backup machines. It is generally not advised to set the system so that all updates to all files be sent to the backup system. This has the side effect of also mirroring system files. It is better to exclude the 'dm' account from the transaction log. Use the "set-dptr" TCL command to do change the attributes of files and/or accounts.
- Do a save of the main machine, and restore it on the backup system. This can also be done using the network, as detailed in a Advanced Pick Reference Manual section "network save/restore, General". Else, the save/restore can be done on tape.
- Setup the servers on both systems (see the section "Server Setup" in the Advanced Pick Reference Manual documentation "hot-backup, TCL").
- Start the master and slave servers.