NTP Interleaved On-Wire Protocol

gif

from Pogo, Walt Kelly

If it walks like a duck...

Introduction

The NTP protocol operations in basic clien/server, symmetric and broadcast modes have been described in previous chapters. As pointed out in previous sections of this chapter, the highest accuracy obtainable requires the use of driver or hardware timestamps; however, this requires intervention in the kernel driver or hardware device.

This section describes two new modes, interleaved symmetric and interleaved broadcast, which can improve accuracy by avoiding output queuing delays. In the reference implementation the receive timestamp is captured shortly after the input device interrupt and before adding the buffer to the input queue. This avoids additional jittter due to queuing and buffering overhead. The transmit timestamp is ordinarily captured just before the message digest computaion and then adding the buffer to the output queue. If nothing further were done, there would be additional jitter due to queueing and buffering delays.

The output queuing delays can be avoided by capturing an additional timestamp a shortly after the output device interrupt and sending it sent as the transmit timestamp of the following packet. With the state machine and protocol described in this secton, operation is automatic and transparent to other clients with and without interleaved capabilities.

Interleaved symmetric mode is an extension of basic symmetric mode, while interleaved broadcast mode is an extension of basic broadcast mode. The interleaved modes require servers and peers to retain state, so these modes do not support basic client/server mode. Mode selection can be determined by configuration option or automatically by the protocol. In all modes the protocol can detect lost, duplicate and bogus packets and, in addition, ldetect packets that violate the interleaving rules.

The interleaved modes described in this section have been implemented and tested in the NTP reference implementation. In its present form only software timestamps (softstamps) are used, with the advantage that the actual transmit timestamp is captured upon return from the I/O routine and more accurately represents the on-wire hardware timestamp (hardstamp). Future plans include using driver timestamps or NIC timestamps as the opportunity arises.

The plan for this section is first to describe the protocol state machine, including the packet header variables, state variables and flowcharts. This is followed by a detailed description of each mode and and examples of operation. Finally, an example is presented that shows how the protocol automatically reverts to basic mode when confronted with an NTPv4 implementation that does not support interleaved modes.

One-Step and Two-Step Protocols

With hardware or driver assist the packet itself does not always contain the transmit timestamps, since those timestamps are available only after the packet has been sent. This requires a modification of the protocol to include the timestamps in a later packet.

gif

Figure 2

The two-step method used by IEEE 1588 is shown in Figure 2. The master starts by broadcasting a Sync packet, which the slave receives at T2. Later, the master broadcasts a Follow_Up packet containing T1. At T3 the slave sends a Delay_Req packet to the master, which sends a Delay_Resp packet containg T4. The slave calculates the offset and delay as in Equations (1) and (2) of Chapter 3, but notes the offset has opposite sign. Note that the Sync and Follow_Up packets are sent only once during each round; the Delay_Req/Delay_Resp exchange must be completed for each slave separately.

gif

Figure 3

Figure 3 shows the current NTP basic symmetric one-step on-wire protocol. Each message sent includes two timestamps as shown in parens. A completes two rounds T1 -> T4 and T5 -> T8, while B completes two rounds T3 -> T6 and T7 -> T10.

gif

Figure 4

Figure 4 shows the NTP interleavved symmetric two-step on-wire protocol. Note that the transmit timestamp is sent in the following packet, so that A completes one round T1 -> T8, while B completes one round T3 -> T10. Note that during the T1 -> T8 round A has initiated a new round at T7 and during the T3 -> T10 round B nas initiated a new round at T7, so the rounds run concurrently. The NTP two-step interleaved two-step protocol is described in detail on following sections.

Protocol State Machines

There are two protocolo state machines, one for the transmit process, which runs for each transmit packet, and the other for the receiver processm, which runs for each receive packet. The processes can run in one of four modes, basic symmetric and broadcast modes, and interleaved symmetric and broadcast modes. Ordinarily, the transmit and receive state machines run in the same mode; however, as described later, the receiver can automatically switch modes if the transmitter switches modes.

gif

The table above shows three packet header variables, torg, trec and txmt, and seven state variables rec, dst, aorg, borg, xmt, b and x used in the flowcharts to follow. Some state variables are not used in some modes. By convention in the following, the pseudo-state variable clock designates the system clock at the time of reading.

gif

The above flowcharts show the transmit operations in symmetric and broadcast modes. Note that x serves both as a switch to indicate basic modes or interleaved modes, as well as in interleaved modes a means to alternate the transmit timestamp between aorg and borg. It is set at configuration time to zero for basic or one for interleaved. Note also that in interleaved broadcast mode torg, ordinarily zero in basic broadcast mode, is hijacked to hold the previous transmit timestamp. This is how a broadcast client recognizes whether basic or interleaved mode. Clients conforming to the NTPv4 specification ignore torg and trec.

gif

The above flowcharts show the receive operations in in the various modes. In this and later flow charts, labels with capitals are error stubs. DUPE indicatdes a duplicate packet, SYNC indicates a synchronization error and BOGUS indicates a bogus packet. The underlined actions represnt calls to one of the remaining three routines basic, symmetric or broadcast. The b state variable is a switch used in the calibration phase when the interleaved broadcast client first starts up. Note that the basic symmetric flowchart can also used for the client in basic client/server mode, so the same code can be used for both basic and interleaved operations.

gif

The above flowchart shows the receive operatinos in interleaved symmetric and broadcast modes.

Basic Mode (p. 42)

Basic mode represents the current NTPv4 specification and implementation. Following is an example illustrates typical operation in basic symmetric mode starting from an unsynchronized condition.

gif

The above figure illustrates the most general case where symmetric peers A and B independently measure the offset and delay relative to the other. Each packet transmitted is shown as an arrow with the receive timestamp at the head and the transmit timestamp at the tail. The blue boxes hold timestamps captured from the system clock. The other boxes hold values copied from them, other state variables or packet headers.

The labels shown are after arrival or departure of a packet. When a packet arrives, txmt is copied to rec and the receive timestamp to dst. When a packet departs, rec is copied to torg, dst to trec and the transmit timestamp to org and txmt. After the state variables have been updated, the timestamps are T1 = torg T2 = trec, T3 = txmt and T4 = dst.

In the following descriptions the packet contents are given in the form (torg, trec, txmt). To begin the round at t1, peer A sends (0, 0, t1) to B. Later, B sends (t1, t2, t3) to A. At t4 the header timestamps t1, t2, t3 and dst variable t4 are available to compute the offset and delay of B relative to A. In a similar fashion at t6, t3, t4, t5 and t6 are available to compute the offset and delay of A relative to B. The protocol is said to be synchronized when both A and B have computed the offset and delay; that is, at t6.

In symmetric modes each peer independently polls the other peer, but not necessarily at identical intervals. Thus, one or the other peer might receive none, one or more than one packet between polls. This can result in lost or duplicate packets and even cases where packets overlap each other in flight resulting in out of sequence or bogus packets. In addition, provisions must be made to reset and restart the protocol, if necessary. These details are in the imlementation, but beyond the scope of this discussion.

There are three sanity checks designed to deflect unsynchronized, duplicate or bogus packets. Before the state variables are updated, the packet is discarded as a duplicate if txmt matches rec from the last packet received. After the state variables have been updated, the packet is discarded as unsynchronized if either T1 or T2 are zero. The packet is discarded as bogus if T1 does not match org. Finally, although not shown in the figure, org is set to zero in order to deflect a replay of the first packet. Thus, the round is protected against replays of either the first or second packets.

The protocol is inherently resistant to lost packets and overlapped packets. For instance, if packet t3->t4 is lost, the next set of timestamps t5, t6, t7 and t8, are available to compute offset and delay. If packets t3->t4 and t5->t6 overlap, peer A will discard the latter as bogus and use the next set of timestamps as before.

Interleaved Symmetric Mode

In interleaved modes the transmit timestamp is captured after the packet has been sent, so it is sent in the next following packet. The problem for the protocol is to detect and avoid possible errors if a packet is lost or duplicated. A solution for this is the two-step or interleaved protocol described in this section. In this variant the transmit hardstamp for one packet is actually transmitted in the immediately following packet. The trick, however, is to implement the interleaved protocol without changing the NTP packet header format, without compromising backwards compatibility and without compromising the error recovery properties. Following is a typical example of operation starting from an unsynchronized condition.

gif

The above figure illustrates a typical scenario involving peers A and B. Note that the receive (even-numbered) hardstamps are available immediately after the packet has been received, but the transmit (odd-numbered) hardstamps are available only after the packet has been sent, so are sent in the next following packet.

In contrast to the basic protocol, which requires one complete round to calculate offset and delay, the interleaved round requires two basic rounds. The round that begins at t1 is not complete until t8 and the round that begins at t5 is not complete until t12. However, the rate of offset/delay calculations is the same as the basic protocol. The NTP packet header fields are the same in the interleaved protocol as in the basic protocol, but carry different values.

Each peer requires the five state variables, rec, dst, aorg, borg and xmt. In addition, there is an x bit which alternates between +1 and -1 for each packet sent. It is convenient to interpret the value x = 0 as an instance of the basic mode protocol when both basic and interleaved modes are available.

When a packet is received, trec is copied to rec and the receive hardstamp to dst. When a packet is transmitted, rec is copied to torg and dst to trec. If x = +1, the transmit hardstamp is copied to aorg and borg is copied to txmt. If x = -1, the transmit hardstamp is copied to borg and aorg is copied to txmt. Upon receipt and before the state variables have been updated, the timestamps are T2 = rec, T3 = txmt, and T4 = dst. If x = +1, T1 = aorg; if x = -1, T1 = borg. After this, trec is copied to rec and the receive hardstamp to dst.

There are three sanity checks designed to deflect duplicate, unsynchronized or bogus packets. While not shown in the figure, the xmt state variable is used only to detect duplicate packets. Before the state variables are updated, the packet is discarded as a duplicate if txmt matches xmt from the last packet received. Otherwise, txmt is copied to xmt. After the state variables and timestamps have been updated, the packet is discarded as unsynchronized if either T1, T2 or T3 are zero. The packet is discarded as bogus if torg is nonzero and does not match T4.

Note that the difference between the transmit hardstamp T4 and corresponding softstamp aorg represents the output MD5 digest and operating system delay. If this quantity is less than zero or greater than the maximum roundtrip delay, the packet is misordered and should be discarded. This is called the delay test; in practice, it has proved a very reliable means to detect lost or crossed packets. If the delay test fails, the protocol is restarted, wich requires an interleaved round (two basic rounds) to recover.

In the above figure t8 is the end of the interleaved round that begins at t1. Before the state variables have been updated, the four timestamps t1, t2, t3 and t4 are available to determine the offset and delay of B relative to A. The reader can verify the four timestamps t3, t4, t5 and t6 are available at t10 to determine the offset and delay of A relative to B.

Interleaved Broadcast Mode

In broadcast mode the client responds to the first broadcast received by executing one round of client/server mode protocol in order to calibrate the difference between the offset between the apparent time delivered via the spanning tree topology and the unicast topology. This is complicated both during the calibration round and subsequently in broadcast mode due to possble lost or duplicated packets.

An IEEE 1588 Precision Time Protocol (PTP) master broadcasts a Sync message to the slaves, which capture the receive timestamp T2. Immediately thereafter the master broadcasts a Follow_up message including the transmit timestamp of the Sync message T1. Some time later each client separately sends a Delay_req message to the master and captures the transmit timestamp T3. The master returns a Delay_resp message containing the receive timestamp T4 for the Delay_req message. The client collects these timestamps to calculate the offset and delay as in the NTP protocol with the offset sign inverted.

The principal difference between the interleaved broadcast protocol and PTP is that the broadcast transmit timestamp T1 is actually sent in the following broadcast packet and the remaining timestamps are obtained by the same stateless protocol used in NTP client/server modes, although slightly modified.

gif

The figure above shows a typical scenario where the clock is updated as each broadcast packet arrives. The protocol uses the same packet header format as the basic protocol, but includes the transmit softstamp sent in the packet followed by the corresponding hardstamp sent in the following packet. To emphasize this, the softstamps in the figure are starred. Softstamps are not used in the interleaved protocol; they are included to support both basic and interleaved modes with the same packet stream.

In the current NTP specification and implementation the torg and trec header fields are unused in broadcast mode and ordinarily set to zero. Clients that support interleaved broadcast mode will note that torg is nonzero and in that case set state variable b = 1. The broadcast client now executes a stateless client/server mode exchange ending at t8. Note that, if b = 1 trec is copied to rec for use when the next broadcast packet arrives.

Upon the arrival of the next broadcast packet at t12, the timestamps T1 = torg, T2 = borg, T3 = aorg, and T4 = dst can be used to calculated the offset and delay in the usual way. However, note that the sign of the offset is reversed. If the stateless exchange is not repeated, the client keeps the delay to use as later broadcast packets arrive.

Note that the difference between the transmit hardstamp T4 and corresponding softstamp rec (before rec is updated) represents the output MD5 digest and operating delay. As in the interleaved symmetric delay test, if this quantity is less than zero or greater than the maximum roundtrip delay, the packet is misordered and should be discarded.

Error Detection and Recovery

Recovery from a lost, duplicate or bogus packet can be tricky in the interleaved modes. In interleaved symmetric mode a single lost or duplicate packet is rejected by the various sanity tests, while in either interleaved mode bogus packets are rejected by the delay test.

In the scheme of things, error packets are rare, especially in fast LANs where propagation delays are very small and poll intervals are in the order of many seconds. Tests using the current reference implementation with simulated packet drop probability 0.1 in either interleaved mode show occasional bogus and delay rejections, but no offset or delay errors.

However, there is one instance where this can be useful. It occurs when a peer is initially configured to operate in interleaved mode and is presented with a client capable only of basic mode. This is illustrated in the figure below.

gif

In this figure A is the client and B the server. The client begins operation in basic mode as usual. The server stumbles along until failing the bogus test at t10. As per protocol, it resets and begins operating in basic mode. After synchronizing again, both client and server assume normal operation. A similar case occurs if the server is started in basic mode and the client starts in interleaved mode.

Reference Implementation and Scenarios

As previously mentioned, the interleaved modes have been implemented and tested in the reference implementation. However, a code surveyor might find it difficult to follow the code flow, as it is intertwined with the authentication and rate management systems. There is, however, a simple test program called lev.c which can be used to simulate all modes described in this document, as well as common error scenarios.

Currently, the reference implementation uses only software timestamps (softstamps) captured at or near the device interrupt time. The receive softstamp is captured upon return from the receive-packet interrupt routine and routine before the receive buffer is queued for later processing. The transmit softstamp is captured upon return from the send-packet interrupt routine. Both softstamps are probably as accurate as possible without driver or hardware intervention.

The reference implementation captures a softstamp before the message digest routine and another after the send-packet routine. The difference, called the interleaved or output delay, varies from 16 ms for a dual-core, 2.8 GHz Pentium 4 running FreeBSD 6.1 to 1100 ms for a Sun Blade 1500 running Solaris 10. In the following scenarios the network is a switched Ethernet operating at 100 Mb and the NTP packet is about 1000 bits or 10 ms.

On two identical Pentium 4 machines in symmetric mode, the measured output delay is 16 ms and remaining one-way delay components 45-150 ms. Two LAN segments account for 20 ms, which leaves 25-130 ms for input delay. The RMS jitter is 30-50 ms measured by the clock filter.

On two identical UltraSPARC machines running Solaris 10 in symmetric mode, the measured output delay is 160 ms and remaining one-way delay components 195 ms. Two LAN segments account for 20 ms, which leaves 175 ms for input delay. The RMS jitter is 40-60 ms measured by the clock filter. A natural conclusion is that the output delay is in general quite stable and most of the jitter is contributed by the network and input delay.

Performance Improvements

There are two ways to improve accuracy - driver timestamping and hardware timestamping. In driver timestamping the operating system device driver is augmented to capture driver timestamps (drivestamps) as close to the hardware as possible. If activated by a socket option, receive drivestamps can be saved in the buffer structure for later retrieval. Transmit drivestamps can be returned to the daemon in a socket option or as a pseudo-input packet as if received from the remote server or peer. Current Unix operating systems are aggresively optimized for throughput and not necessarily for accurate timestamp capture, so this is not easy.

The following example is from FreeBSD, but other systems may have a similar structure. Perhaps the most convenient way to implement drivestamps is using the ifnet or if_data structure. A logical place to capture drivestamps may be in the ether_input() and ether_output() routines in if_ethersubr() source code file. While possibly not as accurate as in the driver itself, it has the advantage that the scheme is driver independent.

In the simplest design, the only use for drivestamps or hardstamps is to calculate the amount of time to add to the softstamp to determine the hardstamp, so it doesn't matter whether the timestamp is represented in Unix, NTP or IEEE 1588 format. The simplest way might be to read the TSC/PCC and let the NTP deamon figure it out. What the daemon can do is read the TSC/PCC at softstamp time and use the drivestamp to calculate the actual hardstamp. This requires adjusting for the TSC/PCC rate, which is availble somewhere in the kernel. A similar thing can be done on input.

Hardware timestamping will most likely require intervention in the driver itself. IEEE 1588 NIC designs known to me have provisions to arm an input or output timestamp capture and hold the catured value in a register which can be retrieved in a separate operation. In addition, there are provisions to adjust the frequence of the NIC clock, which is usually a TCXO, and adjustable overflow counter. In driver designs known to me the captured values can be retrieved and the frequency adjusted using ioctls. One design uses the AMD PCNET chip and modified PHY using a FPGA.

The FPGA appears as a separate device from the Ethernet device. The user hangs a read on the FPGA device and waits for an interrupt before reading the hardstamp. The Ethernet driver interrupt routine is modified to test for a IEEE 1588 packet and when found triggers the FPGA interrupt routine. In priciple, the FreeBSD pcn driver could be modified to do the same thing.