tcl.hot-backup Verb: Access/TCL

tcl.hot-backup

Command tcl.hot-backup Verb: Access/TCL
Applicable release versions: AP 6.1, AP/Unix
Category TCL (746)
Description allows the setup and control of a hot backup configuration. Without any arguments a menu is displayed. This is the normal form of operation. See the section "hot backup, General", for a discussion of the important notions of the hot backup configuration.


Using the menus :
All operations are controlled through menus. If the terminal allows it, arrow keys can be used where indicated:
ENTER Validate the highlighted choice.
number From 0 to 9. Select the corresponding choice. '0' selects the option 10.
CTRL-N Move cursor down. (down arrow)
CTRL-B Move cursor up. (up arrow)
CTRL-X Cancel. Applicable only when input is requested.
ESC Quit. Go back to previous menu, or back to TCL. This key can be used to terminate all menus.
Q Quit. Go back to previous menu.
X Exit. Go back to TCL from any menu.

When the cursor is moved to a new field, a short help is displayed in the message area.


Screen layout :
The screen is divided in two sections:
- The menu section, where menus are displayed.
- The message section, where results, messages or help are displayed.


Definitions :
server
A background process. Servers work by pairs: one on each system.

master
The 'master' system is the main system, where users are normally working. By extension, the server running on the master system is the 'master server'.

slave
The 'slave' system is the backup of a master system. By extension, the server running on the slave system is the 'slave server'.

Transaction log queue
All updates to the master system are not transmitted directly to the slave system. Instead, they are put into a 'transaction log queue' on the master system, which is emptied regularly by the master server, in a FIFO manner. In normal operation, this queue should always be very small. When a transaction is extracted from the queue, a transaction number is allocated for it. A transaction is physically removed from the queue only when the slave server has acknowledged it, so that, if anything goes wrong before the item, for example, is stored in the remote file, the transaction can be sent again to the slave.

Log slave updates
Normally, on a slave system, updates made by the slave server itself should not be logged in the transaction log queue. However, it is possible to set the system so that a slave system acts as a master system to another slave, to cascade updates. This is defined by the 'Log slave update' option.

Apply Updates
It is possible to instruct a slave server to NOT modify the database on the slave system. Updates received from the master system are NOT applied to the slave database. Instead they are stored in a temporary file. The purpose is to have a quiescent database on the slave system without having to stop the master system, to be able to do a full save of the data base, for example. Updates can then be re-applied to the slave database at a later time.

File and account auto-create
The slave database is supposed to be an exact image of the master database. However, optionally, the slave server will attempt to create any account or file missing on the slave system.

Helper
On the slave system, long operations, like clearing large files, are not done by the slave server itself, since it would not be able to process any incoming data. The slave server creates phantom processes (helpers) to help it perform these lengthy operations.

Large file creation
Files with a modulo larger than 100 are created in a special way. A smaller file (101) is created by the slave server itself, and a helper process is created to resize the file to its actual size. This has the advantage that the file is created almost instantly, thus becoming available on the slave system, without stopping the slave server. The slave server checks periodically on the helper to make sure it completes.

Temporary hold files
The slave server sometimes creates temporary files on the slave system to hold data it cannot store in their final destination immediately. For example, when a clear file command is received, a helper does the actual clearing, and a temporary hold file is used so that updates to the file being cleared received AFTER the clear command are not also delted.


D Pointer Controls :
In order to be able to write data in files, the slave server does some controls, and changes, on the file-defining D pointers on the slave system:
- Update protections are removed. This is to ensure that the slave server will be able to write into the file.

- CALLX correlatives are changed to use explicit path names to the subroutines. The account and file name where the subroutine resides is found in the cataloged object in the MD of the account where the file is located. If the subroutine is not found, the CALLX correlative is removed.

- Bridges to other files are controlled to use an explicit path to the target file. If the file cannot be found, the bridge is removed.

However, the amount of controls is necessary limited. A-correlatives are not controlled, and, obviously, references to files in BASIC subroutines are not controlled either. Therefore, if precautions are not taken, it is possible that the slave server will abort trying to update such a file. Note that NO data loss would occur, though, because the offending transaction will not be acknowledged and will be retransmitted. Starting the server with the traces ON, will identify which item and file causes the problem.

Installation :
Log on to the 'dm' account and type:
hot-backup
The first time this command is invoked, a server must be defined. The following menu is displayed:
Setup Server
1 New Server

Select and press <RETURN>

Type <RETURN> and fill in the appropriate information as explained in the section 'Setup Server' below.


Setup Server :
The 'Setup Server' menu is brought up automatically the first time 'hot-backup' is installed, or by selecting an option in the main menu.
Server name :
Any alphabetic string of up to 8 characters. This name will reference the server in all the other menus. If there is only one server defined on the Pick virtual machine, this server will always be implied in all commands.
Server type :
Master or Slave server. Depending on the server type, some fields of the remaining of the menu will become unapplicable.
Host name :
Master only. Name of the Unix host where the associated slave server is located.
TCP port number :
For a Slave Server, TCP port number, from 1024 to 32767, on which it is listening. For a master server, TCP port number of the associated slave server.
Protocol : inet
Cannot be modified.
Log slave update :
Slave only. If 'ON', all updates received by the slave are logged in the transaction log queue, to be sent to another system. If 'OFF', the updates are not enqueued. The default is 'OFF'.
Check period :
Master only. Period, in seconds, with which the master server will perform a variety of system checks. 0 disable all checks. A period of 300 seconds (5 minutes) should be suitable for most systems.
TxLog timer :
Master only. Period, in seconds, with which the master server will empty the transaction log queue. If the queue is not empty, then the server empties it continuously. If it becomes empty at some point, the master server goes to sleep for this period. A small value (2 or 3 seconds) is suitable.
Notify list :
List of Pick users or Pick port numbers to notify in case of error, separated by a comma. The port numbers are expressed in a syntax similar to the TCL command 'msg'. For example, to notify the users 'dm' and 'bob', and the line 0, whether it is logged on or not:
dm,bob,!0
Apply updates :
Slave only. 'ON' or 'OFF'. Instruct the slave server to apply the updates it receives from the associated master server immediately (ON) or delay them (OFF) until it receives an explicit command. Normally, this option should be set to 'ON'. Note that, when later disabled, the server definition will be modified automatically. When the slave server starts, it will report the status of this option. Applying updates should not be disabled too long, since the slave data base is not an exact image of the master data base while updates are not applied.
Auto create :
Slave only. 'ON' or 'OFF'. Create automatically any missing account and file on the slave system. It is a good precaution to leave this option ON.
Comments :
Any text.

Confirm (y/n/q):
'y' to confirm the server creation. 'n' to go back to any of the previous fields. 'q' to quit and abandon.


Current Server Selection :
Most menu commands refer to a 'current' server, identified by its name. If there is only one server on the Pick virtual machine, then this server is always implied. If there is more than one server, the following menu is displayed when first entering 'hot-backup':
Operation title
1 server.name
2 server.name
...
Select and press <RETURN>

Type the selected number and <RETURN> to select a server. The 'current' server name is displayed under the hot backup header. In the following, 'selected server' will designate either the only server on the machine, or the current server selected by this menu. To select another 'current' server, hit 'ESC' or 'Q' back to the server selection menu.


Main Menu
1 Status
Display the last messages displayed by the selected server. The title of the message section on screen indicates the server name and additional information. The messages are displayed in reverse chronological order (most recent first). The server does not need to be running for this command.
2 Query Server
Query the selected server. The server must be running. See the section 'Query Server' below for the detail of the information returned.
3 Start Server
Start the selected server. This process displays various information about the starting of the server.
4 Stop Server
Stop the selected server. Confirmation is requested.
5 Show Server
Show the setup of the selected server. Type 'y' return to the main menu. The description of the displayed information is identical to the 'Setup Server' menu above.
6 Show statistics
Transaction statistics. If run on the slave server, the result is displayed immediately, if the server is running. If run on the master server, a request is sent to the slave server and displayed later. Both servers must be running. The response time depends on how busy the slave server is. This command is a quick way of determining whether the communication is established. The statistics are accumulated by the slave server and cleared every time it restarts.
7 Display Queue
Display the status of the transaction log queue. The following information is shown:
n frames in queue
DL or ALL files logged
File updates enqueued for all processes
or
File updates no longer enqueued
Current transaction: nnn
Last ACK'ed transaction: nnn
Current transaction: nnn
Upd itm 'item.id' in 'file.reference'
The number of frames is a rough indication of the amount of data not yet transmitted. The current transaction number is the transaction number being transmitted, if non zero. The last ACK'ed transaction number is the last transaction which was received and successfully processed by the associated slave server. If the last ACK'ed number is not reported, this means the query was done at a moment when all transactions have been acknowledged. The last line is an indication of the first transaction not yet acknowledged by the associated server. This can be an update item, a delete item, etc... In case of problem, this is the first transaction which would be retransmitted.
8 Special operations
Perform less frequently used operations. See the section "Special Operation Menu" below.
9 Transaction log menu
Control the transaction log queue. See the section "Transaction Log Menu" below.


Special Operations
1 Status
Display the last messages displayed by the selected server. Same as on the main menu.
2 Turn traces ON
Turn traces ON on the selected server. Traces are slowing down the server considerably and should not be left ON in normal operations. This is a debug tool only.
3 Turn traces OFF
Disable the traces on the selected server.
4 Setup Server
Setup a new server or modify the selected server. This the section 'Setup Server' above.
5 Delete server
Delete the selected server definition. Confirmation is requested.
6 List permanent log
List the permanent log of all messages logged by all servers. The log is shown in the message area on screen. Use the CTRL-N to see the next logs, and CTRL-B to go back in the log. Type ':' (colon) to enter a special command line which allows setting searching the first log entry after a given date or time, and search for a string. Type 'q' or ESC to return to the special operation menu.
7 Clear permanent log
Clear the permanent log file. A sub menu offers the choice of clearing the whole file, entries older than one week, or older than one month. Confirmation is requested.
8 Stop applying updates
Slave only. Instruct the selected server to stop applying updates coming from its associated master server. The updates are stored in a temporary hold file 'hb.log,apply'. The number of pending updates is shown in response to a 'Query Server' operation from the main menu. Updates can be applied by selecting the next option.
9 (Re)start applying updates
Slave only. Apply the pending updates to the database, if any, and restart the normal state of applying new updates as they arrive.
10 Close opened files
Slave only. The slave server maintains a list of all the files it has opened since it started. If it becomes necessary to delete a file on the slave system, or after recovering from a missing file-of-files item on the master system (see the discussion below about the error message 'Cannot read FOF item'), the slave server must be instructed to close all the files it has opened. This option allows this. If the files are accessed again, the slave server will re-open them.


Transaction Log Menu
The transaction log menu should be used on a system which is setup as a master system. It controls what and how updates are transmitted to the associated slave system.
1 Display queue
Display the status of the transaction log queue. Same as option 7 on the main menu.
2 Log DL files only
Log only the files which have a DL attribute in the D pointer. This option affects the entire system. This mode of operation allows fine control over what is being sent to the other system.
3 Log ALL files
Log updates to all files on the system. Note the slave server will NOT apply any modification to selected files in the 'dm' account on the slave server: abs, accounts, devs, errors, file-of-files, hb.log (hot backup control files), jobs, pibs resizing. This option should be used with caution, since the slave server may create accounts and files to receive data.
4 Stop enqueuing
Stop enqueuing updates. Confirmation is requested. This option should be used with extreme care, since, once selected, all updates to the master data base will NEVER be transmitted to the slave system. The only legitimate use of this option is to stop enqueuing updates following a major problem on the main system which required switching users to the slave system. This event is also logged in the 'errors' file.
5 Start enqueueing
(re)Start enqueuing updates. This event is also logged in the 'errors' file.
6 Clear log queue
Clear the transaction log queue. Confirmation is requested. All queued updates are lost. This event is also logged in the 'errors' file.


Query Server Results :
On a Slave server, the following information is displayed:

Server <servername>. PIB <pib>. PID <pid>. Server type [Slave|Master]. Traces [ON|OFF].
Total [recvd|sent] 289K. {Cur trans 1041.} ACK'ed trans 1039. {updates {NOT} applied.}
{n file(s) failed to open successfully}
{n pending update(s)}
Status <status>

Where:

servername
Name of the server.

pib
Pick port number of the server.

pid
Unix process id of the server.

Cur trans
Slave only. Current transaction number received by the slave server. This number should increase as data is exchanged between the servers.

ACK'ed trans
Current transaction number successfully acknowledge by the slave server. Unless the queue is empty, the ACK'ed transaction number should be slightly less than the transaction number received by the slave.

{updates {NOT} applied}
Slave only. If present, indicates whether updates received from the master system are currently applied to the slave database.

Total [sent|recvd]
Total size, in kilobytes, received for the current day. Note the master reports a slightly higher number than than the slave, due to additional protocol information.

{n file(s) failed to open successfully}
Slave only. If present, indicates that the slave server received requests to open files and was not able to do so. At any given time, there may be a few pending open files, so repeat the command to make sure there are some problems. Examine the permanent log file to get more information.

{n pending updates}
Slave only. If present, indicates that the updates to the slave data base are currently disabled, and that there are 'n' pending updates.

Status
Server status:

Reading network
This is the normal state of the slave server.

Wait response to call
Master only. The master server is attempting to call its associated slave.

Wait incoming call
Slave only. The slave server is waiting for its associated master server to establish a communication.

Stopped
The server has stopped. The previous message in the status or permanent log file would indicate the reason of the termination.

Connected
Intermediate state right after a connection is established.

Writing to network
Sending data to the associated server. If this state persists and no data transfer occurs, this indicates the receiver is not reading data. There is a 30 second time out after which the connection will be shutdown and restarted.

Waiting for ACK
Master only. Wait for an acknowledgment from the associated slave server.

Idle
Master only. The queue is empty and there is no traffic.

Reading queue
Master only. Intermediate state while the master server is extracting messages from the transaction log queue.


Non Menu Operation :
It is possible to perform some operations from TCL by specifying a 'command' on the TCL line. If there is more than one server, the server must be specified by 'server=server.name' argument. This for is useful to perform some automatic commands in macros.

'command':

start
Start the specified server.

stop
stop the specified server.

status
Display the status information (same as option 1 on the main menu) of the specified server.

queue
Display the transaction log queue information (same as option 7 in the main menu).

debug
Force the specified server to enter the BASIC debugger. This option is for testing only. The server can then be debugged using tandem.


Main Error Messages :

This section lists the main error or warning messages that can either be displayed in the message area or logged in the permanent log file.

ERROR: This command applies only to Slave servers
This command can only be used when selecting a slave servers. Some function cannot be performed by master servers.

Not a phantom. Running on PIB xxx. PID yyy
This message can appear when interrogating a server which is either started in foreground (for debugging purpose) or, most likely, which has aborted, or was killed, and left an inconsistent status in the jobs file.

Terminated abormally
The server phantom process has terminated, but without finishing properly. The process was probably logged off, or encountered an fatal error.

Notice from server 'servername'
The message in the message area is coming from the specified server.

Waiting for server 'servername' to start
Wait for the phantom process to start.

Waiting for server 'servername' to initialize
The phantom process has started and now wait for the server to initialize itself.

FAILED: Cannot start phantom process
The phantom process failed to start. Check the scheduler, make sure there are not too many queued jobs. Use 'list-jobs' to determine the cause of the abort.

FAILED: Server 'servername' encountered an error and stopped
The phantom process started normally, but the server could not complete its initialization. Use the 'status' menu option to see the cause of the failure. The most common cause is a 'BIND' error, indicating that the TCP port is already in use. Often, when stopping a server, the TCP connection is not purged immediately and the TCP port number remains in use. This usually cleans up by itself after the various TCP protocol time outs have expired. Stop the master server. Retry a few minutes later. If the BIND error persists, change the port number using the 'setup' option, on BOTH the master and the slave sides.

WARNING: Server 'servername' is already running on PIB xxx
A start server failed because the server appears to be already running. This has no action on a running server. If the server is NOT running, and this message still appears, this might indicate a corruption of the hot backup control file. Make sure the server is not running (using where) and do:
delete hb.log servername

WARNING: Transaction are NOT enqueued
This warning is issued by the master server to indicate that updates to the data base are not being enqueued, and therefore not being transferred to the slave system. The databases are now different.

WARNING: Transaction queue has grown to xxx frames
Periodically, the master server checks if the queue has grown to more than 500 frames above the previous value. This message is an indication that the slave server is not able to keep up. Make sure the input server is not stopped, or has not aborted. This message MAY be normal in exceptional situations where massive updates are being done.

malloc error. err=xxx
The system could allocate memory. This is an indication of a serious Unix problem. Check the swap space.

socket creation failed
The socket library is not linked properly to the Monitor being run by the phantom. Refer to the installation guide to link the proper socket libraries fro implementation for which this library is an optional package.

Cannot call host 'hostname'. Errno=xxx
The master server cannot establish a communication. 'xxx' is an indication of the error. The error code is very implementation dependent. Make sure the host is reachable by using the Unix command 'ping hostname'. If so, the most likely reason is that the slave server is not started or has encountered an error. Use the status menu option on the slave.

Cannot read FOF item 'xxx'
Due to a corruption in the file-of-files file on the master system, the system is unable to identify the file being updated. 'xxx' is the file number (as known on the master system). Data is normally transferred to the slave system but stored in a temporary hold file named "hb.missing,mastername*xxx", where 'mastername' is the name of the master server, and 'xxx' the file number. To correct this situation, the file-of-files file on the master system must be rebuilt by doing a full file save (possibly a dummy save), with the (S) option. The missing fof item can also be rebuilt by hand if the file name can be identified from the content of the file being stored in the temporary hold file on the backup system. The data can then be copied to the real file. Instruct the slave server to close its opened files, using the special operation menu, to force it to open the new file.


Lost+Found File :
When a slave server is unable to process a transaction (for example, the account does not exist and the automatic creation of missing accounts is disabled), the transaction is stored into the trash file 'hb.log,lost+found'. The item-ids of this file are unique timedate stamps. This file can be used to recover the salvaged data. The format of an item is:

1 Item type:
1: Unknown transaction code.
2: Slave server was reset.
3: Internal error.
4: Time out on transaction. These errors can occur when stopping the master server.
5: Bad network message.
6: Cannot perform the operation. The error message usually indicates the cause of the problem.
7: D pointer can not be updated. There is a D pointer like element in the master dictionary or file dictionary preventing the creation of the file. This item was deleted to allow the file creation and its content written starting at the attribute 10.
8: The MD of the account in which the slave server is running (normally 'dm') contains an item which collides with the name of an account. Though this is normally authorized, this is not a good practice. The server removes this item and dump the content of this item.
9: The slave server removed a correlative from a D pointer because it could not resolve a reference to a CALLX subroutine or to a bridge. The lost+found item contains the removed correlative.
10: Transaction could not be processed when applying delayed updates.
11: Incorrect transaction length.
2 Transaction opcode, if applicable, or '?'.
3 Server name.
4 Log internal time.
5 Log internal date..
6 Transaction number.
7 Last ACK'ed transaction number.
8 Message.
9 Time date in external format.
10 Body of the transaction which could not be processed or any data the server could not handle. More than one attribute can be used. This data may be incomplete.
Syntax hot-backup {{command} {server=name}}
Options Q Quiet. Valid only for the non menu operation. Supresses all messages.

V Verbose. Turn traces on when starting the server from the menu.
Example
Examples :

hot-backup
  Enter the main menu.

hot-backup start
  Start the (only) server on the virtual machine.

hot-backup start server=todev
hot-backup start server=fromprod
hot-backup status server=todev
hot-backup status server=fromprod
  On a system which is both slave and master, start a master server 
'todev' and a slave server 'fromprod'. Obtain status from 
both of these servers.
Purpose
Related general.hot.backup