Use of this indexing scheme in more catalogs will enormously simplify cross-matching of objects. Using a new computing paradigm, we recently realized a quantum leap in performance that makes this scheme competitive with bit-interleaving and requires very little memory.
The Flux-space Indexing used is a traditional tree. The space is 5 dimensional, 5 being the number of SDSS-filters. The specialization to astronomical data has been achieved by modeling the location of the main branch in this space and applying the tree subdivisions only to its confined area. The outliers are indexed separately. Most of the interesting data points come directly from the outlier part of the index, with no additional analytical effort.
Additionally, the key-lookup index feature of object-oriented databases is exploited for much-used parameters like flags.
The SDSS Science Archive (SX) makes use of two indexing techniques for user queries. (The SDSS project and Science Archive is described in further detail in articles by A. S. Szalay and by A. R. Thakar in this volume.) The first index is called the Hierarchical Triangular Mesh (HTM), which is used to subdivide the surface of the unit sphere into roughly equal-sized spherical triangles. (This method was advocated by P. Barrett (1994); proposed in detail by A. Szalay, R. Brunner et al., 1996. See also H. Samet, 1990) The HTM, a quad-tree, is used as the spatial index of the SX.
The second major index is an enhanced k-d tree that is used on the 5-dimensional color space of the SDSS data (the dedicated SDSS camera has 5 passband filters). Both indices are used to reduce the total data volume that needs to be scanned for the user query to complete.
Object coordinates are located on the surface of the celestial (unit) sphere, usually given by . For the HTM, we will convert to Cartesian coordinates to simplify our notation. The HTM is a quad-tree, its top nodes are the 8 segments that define the octahedron on the sphere (its sides are given by the intersection of the sphere with the planes). The 6 corners of the octahedron are, enumerated from 0 to 5 as follows: ; ; ; ; ; .
The nodes of the HTM quad-tree are always spherical triangles with great circle segments connecting the corners. The triangles are specified by their vertices, the vertices always given in counterclockwise order. The 8 root nodes are, using the enumeration of the 6 octahedron corners from above: ; ; ; ; ; ; ; .
The quad-tree is obtained by subdividing the triangles into 4 new triangles, using the current corners and the side-midpoints as the corners for the new triangles (as illustrated by figure 1). The new nodes obtained in this fashion are named by appending the index 0,1,2,3 to the parent's name, depending which vertex the new triangle shares with its parent (3 for the middle triangle).
At each level of the HTM, the average triangle area is of their parent triangle's average area, i.e. exactly . There will be no triangle, however, with exactly this area at any but the root level. Already at the first subdivision, the middle triangles have different areas and side-lengths (the angle at each corner of the root triangles is exactly , which is not true anymore for the next level). The scatter of the different triangle areas remains within 70% of the mean area at any level. It can be shown that the ratio of the minimal and maximal triangle areas quickly converges to a constant value (roughly 2).
The length of each triangle side is thus also distributed between a minimal and maximal length for each level. It is easy to see that the minimal side length is given by the subdivisions of the 3 major big circles that define the 8 root nodes, giving minimal side lengths of at level . It can be shown that the maximal side-length converges to times the minimal length. So for example at level 14, the side-lengths are between 20 and 30 arcseconds.
To cross-match two catalogs, we retrieve each object's location from the catalog with the better resolution (i.e. where the ID is stored to a higher level ) and determine the triangles that are in its vicinity at the HTM level of the second catalog (). This is done by finding all the triangles in level that are touched by a sphere segment with origin at and diameter of the minimal side-length of level ; at most 6 triangles. Now we just look up these 6 IDs in the second catalog whether they match any of the objects within.
This procedure is basically an ID lookup in both catalogs. If the ID is key indexed in both catalogs, the cross-matching should be working with the same speed as the lookup of the catalog objects one by one, i.e. we have an order(N) algorithm.
The tree used in the SX differs from standard trees in two respects. The first is that we don't use the standard method to cut the data volume in alternating dimensions but we calculate the bounding box of each cell at each level and cut in the direction with the largest box side. Thus we get a minimal aspect ratio tree. It has been shown recently (Duncan 1999) that trees constructed in this way perform very well as generic indexing trees for various aspects of indexed lookup.
The second enhancement is to model the object's predicted location in color-space (the stellar locus for example) and construct a convex hull of that area. The first ``cut'' of our -tree is deciding whether an object lies inside or outside the predicted area. If it is an outlier, it is stored in one branch, and the rest is indexed using the -tree described above. In such a way, many of the most interesting objects are found using a simple index query.
The physical data layout of the SX makes use of both the HTM and the -tree. The underlying commercial database of SX is Objectivity, which is an object-oriented DBMS. In an Objectivity federation, ``databases'' are single files that contain the objects, grouped into ``containers''. In the SX, we define the databases to be the level-5 leaf nodes of the HTM, (defining the database file name) and we order the objects inside the database according to their -tree leaf node number. This way, if we query for objects confined to a limited area on the 3D sphere, we only need to open the level-5 HTM databases that contain the corresponding area. If a query involves a domain of color space (5-dimensional for SDSS), we do need to open all databases, but the data will be localized close to each other in each of them. This makes very effective use of the hardware and software caching mechanisms.
The current implementation of the HTM is available on the SDSS Science Archive Website in C, C++ and JAVA. The code has been tested on various Unix platforms (Solaris, OSF/1, IRIX, Linux) and on Windows NT/98/95. It should run on other platforms as well. We do want to encourage everyone building catalogs to include the HTM ID with their catalog objects so that cross-matching between catalogs is simplified. The software also comes with a variety of query capabilities; basically any shape on the spherical surface can be submitted to it to get the triangles that are fully contained or partially contained within that surface. The code is freely available.
Barrett, P. 1994, in ASP Conf. Ser., Vol. 77, Astronomical Data Analysis Software and Systems IV, ed. R. A. Shaw, H. E. Payne, & J. J. E. Hayes (San Francisco: ASP), 492
Duncan, C. 1999, Thesis, Johns Hopkins Univ., Dept. of Computer Science
Samet, H. 1990, Applications of Spatial Data Structures and The Design and Analysis of Spatial Data Structures; Addison-Wesley.
Szalay, A. S. 2000, this volume, 405
Szalay, A. S., Brunner, R., et. al. proposal, 1996
Thakar, A. 2000, this volume, 231