The Virtual Observatory (VO) brings the promise of yet more complex, interrelated, distributed, data; the development of VOTable shows a recognition of the importance of XML. Metadata is seen as a key component of the Virtual Observatory, as well as provenance of information gathered or created by a remote, automated process, using the VO. Complex interrelationships between large numbers of files are likely to become commonplace in the VO. Rather than develop an astronomy specific semantic toolkit it seems sensible to position oneself to be able to use the tools which are being produced in the context of the wider WWW community, such as RDF and related standards.
We see therefore requirements for something flexible, extensible, capable of storing hierarchical information, able to deal with distributed data but usable locally, and with the backing of a sophisticated astronomical data model. In addition it should be open to the new tools and standards which are bound to be produced in the near future outside astronomy. It should also facilitate interoperation of applications--something which will become increasingly difficult as complex structures are generated.
In addition to the data containers such as Table and N-Dimensional array, there is an additional Structure Object which can contain other objects, including other Structure Objects--allowing a hierarchical data structure to be developed.
It is important to remember that we are not talking about any particular data format such as FITS. Instead we are considering conceptual data structures which may be serialised in a number of different ways.
The advantages of separating the data container elements from the structuring elements include allowing the containers to avoid becoming more complicated than necessary. It leaves one free to consider additional metadata, which is recognised as being of fundamental importance for the VO, without being forced to think of encodings which would fit within the constraints of FITS keywords. Instead one is free to consider the use of something like XML, with its promise of a large number of standards and tools as a serialisation mechanism.
Starlink's experience with hierarchical data structures, based on the Hierarchical Data System (HDS), over the past 10 years or more is of use here. This experience shows that one must strike a balance between being very proscriptive in what applications can write out, e.g., simple FITS files, on the one hand, and allowing anarchy on the other. The most common problem was for applications to not understand the relationships between components, leading to erroneous processing, or that pieces of metadata which were not understood were not correctly passed on to downstream applications.
Our experience is that one needs some fairly simple rules which applications must obey, and that there should be some pre-defined components within which to hide additional structures in order to allow common operations to be dealt with uniformly and correctly. An example of this is NDX (based on Starlink's NDF) which is described below. In addition it must be possible for an application to adequately check the validity of a hierarchical file with which it is presented. We refer the reader to the HDX documentation for a full discussion of these rules.
HDX is a particular, simple, Structure Object. From an applications point of view an HDX is a W3C DOM (http://www.w3.org/DOM/) which has a top-level element <hdx>, and which is valid. It is valid if each of the document element's children is either unknown to the HDX system or, if known, is validated by its declared validator (a software component which HDX can find).
The abstract HDX data model has been implemented in a Java data-access library, but others such as a Perl implementation will be produced. Note however that support for the underlying data containers is distinct from the support for the various HDX types which are defined. Further design aims are to have low (or even zero) overhead to the extent that applications can work using, for example, bare FITS files; to be easy to extend the system to support new types; to be easy to extend the system to support new data storage resources, such as new file formats or a database serving an archive; and to be able to implement these in very efficient ways.
Simple operations on NDXs (e.g., ndx1.add(ndx2)) take care of variance, quality, WCS, etc. (where these components are present). Access is available to individual arrays (called NDArray objects) to allow more complex algorithms to be used.
The philosophy and design goals behind NDArray/NDX included being able to process arrays of unlimited size, comprehensive and transparent bad value processing, direct and transparent array access between different formats and location transparent resource naming.
To help to understand the relationship between underlying data containers and NDX, Figure 2 shows an application (Treeview) looking at the same data which is held as FITS, HDS and XML. On the left of the figure one sees the individual components; on the right one sees that all the data is viewable as identical NDX components.