TENNIS: A STANDARD FOR DATA FORMATTING, INPUT, AND OUTPUT

Thomas L. Garrard

California Institute of Technology

ABSTRACT

A technique for documenting, formatting, and storing data is introduced and briefly described. The documentation standard, now labeled the tennis standard, allows creation of function libraries and utilities to gain access to the data, greatly facilitating input and output; these functions and utilities are described. A comparison of this standard with the better known HDF and CDF standards is made.

INTRODUCTION

The tennis standard was originally created to allow straightforward documentation of packet-style data*1* which has been concatenated into relatively large physical records for efficiency of storage. The rules for concatenation are simple so that the documentation can be simple. A very important side benefit of the standard is facilitation of creation of a library of functions for purposes such as input and output (i/o) of individual packets from application programs. This library eases i/o for individual users of the data and also protects them from changes in computer hardware and/or operating system.

The intent (advantage) of the standard includes improved ease of documentation of both the current data set and its creation history, efficient data storage in a system independent of operating system and storage medium, and reduced duplication of user programming effort.

A new version of the standard is in progress at this time. It includes independence of computer hardware, operating system, and data storage medium as explicit goals. The data format is described in Parameter Value Language*2* (PVL) in a fashion which is intended to make tennis data sets resemble Standard Formatted Data Units*3* (SFDUs) very strongly. Some effort is made in the new version to accommodate data which is not packet style, i.e., data which contains large and infrequent data units such as images. Library utilities for translation of such data to the more widely known Hierarchical Data Format*4* (HDF) or Common Data Format*5* (CDF) standards are also planned.

Many library functions assume that the data form a time-ordered sequence; these library functions can then operate on time variables.

The standard and the library are designed to encourage the user/application programs to maintain a history or pedigree of the data, such that an audit trail back to the original spacecraft telemetry is always available.

THE STANDARD

The data are described in terms of a hierarchy, for which tennis nomenclature has been adapted since it avoids the preconceptions associated with more common computer-oriented nomenclature. The sequence is

name:	 meaning or example:
bit   	 bit, no likelihood of confusion
point	 variable or word, such as a 2-byte integer or 8-byte real
game	 a group of points with closely related meanings
set  	 a group of games forming a logical record or a packet
match 	 a collection of sets forming a physical record
         if the storage medium allows)
tourney	 collection of matches forming a complete data volume,
         such as a disk file or a magnetic tape.

The standard specifies that each tourney (tape, etc.) begin with a sequence of special sets (metasets) which describe in PVL the format and contents of all the other sets in the tourney. Each set begins with a two character key which identifies the proper description from the metasets. Example sets include individual measurements of magnetic field or signals from a single cosmic ray. The description specifies where to find games within sets and where to find points within games, and also includes textual descriptions of the meaning of the points. Sets are stored sequentially in matches (records, etc.); they are not split across matches (hence the difficulties with very large sets).

Thus, the standard format consists of a simple list of data sets, initiated by a description metaset and occasionally punctuated with pedigree and marker metasets. It is intended for sequential access but (depending on the storage medium) various fast-forward or direct access schemes can be implemented.

Pedigree sets maintain a history of what programs and what input data were used to create the current tourney. Thus programming changes and the like can be traced in complete detail. Marker sets are placed at the beginning and end of each tourney and match, so that structure can be maintained even on media without physical record structure, such as Unix pipes.

THE LIBRARY

The library includes a collection of functions (Unix C, Unix Fortran, and VMS Fortran are currently in progress for the revised standard) which perform i/o with a variety of media including tape, disk, and Unix pipes. The library greatly eases applications programming and is one of the payoffs for the scientists effort invested in creating the description metasets. The operations include, for example,

name:	function:
get_set	     get a pointer to the next set from the input storage medium
put_set	     put a set on the output storage medium
get_game     get a pointer to a game within the current set
put_game
get_point	obvious extensions
put_point
copy_set     move a set from input to output
time_seek    fast-forward input to requested time (if the storage medium allows)

There are also multiple unit equivalents of these functions, so that one may, for instance, read input from two tourneys and merge them into a single output tourney.

    The library also includes a collection of utilities, such as
verify	        print statistics of data content of tourney:\0 start time,
                stop time, gaps, numbers of sets, matches, and bytes, etc.
index_volume	Create a database describing all the tourneys on this disk
                drive:\0 such data as name and start time, types of sets,
                comments, length, etc.
source_get	Print the source language of the program which output this
                tourney and the name of the input tourney of that program.

ADVANTAGES AND DISADVANTAGES

The very tight link between the self-documentation of the data in the metasets and the convenience of using the library functions provides a great deal of motivation for doing the documentation. HDFs, for example, make it easy to do self-documentation; the tennis standard makes it very difficult to avoid doing the self-documentation.

For what I call packet-style data, data which is logically organized in a list of short logical records, the standard greatly facilitates documentation of the data format -- its primary purpose -- and it utilizes storage very efficiently. The ability to utilize media other than disk drives can be a great advantage. It is perfectly feasible to process a 5-gigabyte tourney from (and/or to) 8-mm tape directly, without having to find disk space for it.

Conversely, for data such as very large arrays or images, where the logical record or set is larger than any reasonable physical record or match (8-mm tapes with 32 kB physical records, for example) additional complexity is introduced to fit the data into the tennis standard and fractional overhead is small for any of the standards compared:\0 tennis, NCSA HDF, or NASA CDF.

The maintenance of a pedigree history of the data as an explicit part of the tennis standard eliminates a great deal of confusion when errors or changes must be tracked down. Like self-documentation, it is possible to do this in other standards, but the motivation is less strong.

The tennis standard has independence of computer hardware, computer operating system, and data storage medium as an explicit goal. Although it is not clear to what extent this goal is realizable, it is clear that this standard will be more portable than those which assume that the data is on a random-access disk file. Note also that since the pedigree metaset specifies the computer hardware used to create a data set and the description metaset specifies data types, it is quite straightforward to do automatic translation of, for example, the various floating-point number formats used by VAX, Sun, and IBM PC or standards such as XDR.

Caltech has defined and maintained the standard and the library since about 1978 (with much appreciated advice and support from "partner" institutions, especially Washington University). We now find that one of the major objections to its use is its status as a private rather than public standard. We are exploring avenues for "upgrading" the status of the standard. Presumably our loss of control will be compensated by increased external support for library creation, if we are successful.

CONCLUSIONS

The tennis standard is an extremely useful system for documenting and storing data, especially spacecraft telemetry and related data. For packet-style data, it is probably more appropriate than HDF, and for large logical records, such as images, it should be trivial to translate to HDF.

ACKNOWLEDGEMENTS

This work has been supported by the National Aeronautics and Space Administration under numerous grants and contracts. The library creation is the work of a number of good programmers, among whom Brownlee Gauld stands out. The current expansions of the library to include VMS and Fortran versions is being supported at the University of Maryland and the Langley Research Center.

NOTES AND REFERENCES

1) Packet-style data means data consisting of short logical records occurring sequentially. A very relevant example is the telemetry data from the SAMPEX spacecraft, which meets the packet standard of the Consultative Committee for Space Data Systems: CCSDS 102.0-B-2, Packet Telemetry. CCSDS standards are available from NASA Headquarters, Code OS, Washington, D.C. 20546.

2) PVL is documented in CCSDS 641.0-R-0.2, Parameter Value Language Specification. It is a keyword = value language for naming and expressing data values.

3) SFDUs are documented in CCSDS 620.0-R-1.1, Standard Formatted Data Units -- Structure and Construction Rules. PVL is an important part of the SFDU specification.

4) HDFs are a standard of the National Center for Supercomputing Applications at the University of Illinois. Documentation and software are available by anonymous ftp from ftp.ncsa.uiuc.edu (or 128.174.20.50) or send paper mail to HDF, 152 Computing Applications Bldg., 605 E. Springfield Ave., Champaign, IL 61820.

5) CDFs are a standard of the National Space Science Data Center. Documentation and software are available by anonymous ftp from nssdca.nasa.gov or call the Support Office at (301) 286-9506.