Next: Database Systems
Up: Data Management and Pipelines
Previous: The COBRA/CARMA Correlator Data Processing System
Table of Contents - Subject Index - Author Index - Search - PS reprint - PDF reprint

Plante, R. L., Pound, M. W., Mehringer, D. M., Scott, S. L., Beard, A. D., Daniel, P., Hobbs, R., Kraybill, J. C., Wright, M., Leitch, E., Amarnath, N. S., Rauch, K. P., & Teuben, P. J. 2003, in ASP Conf. Ser., Vol. 295 Astronomical Data Analysis Software and Systems XII, eds. H. E. Payne, R. I. Jedrzejewski, & R. N. Hook (San Francisco: ASP), 269

CARMA Data Storage, Archiving, Pipeline Processing, and the Quest for a Data Format

Raymond Plante1, Marc W. Pound2, David M. Mehringer3, Stephen L. Scott4, Andy Beard5, Paul Daniel6, Rick Hobbs7, J. Colby Kraybill8, Melvyn Wright9, Erik Leitch10, N. S. Amarnath11, Kevin P. Rauch12, Peter J. Teuben13

Abstract:

In 2005, the BIMA and OVRO mm-wave interferometers will be merged into a new array, the Combined Array for Research in Millimeter-wave Astronomy (CARMA). Each existing array has its own visibility data format, storage facility, and tradition of data analysis software. The choice for CARMA was to use one of a number of existing formats or devise a format that combined the best of each. Furthermore, it had to address three important considerations. First, the CARMA data format must satisfy the sometimes orthogonal needs of both astronomers and engineers. Second, forcing all users to adopt a single off-line reduction package is not practical; thus, multiple end-user formats are necessary. Finally, CARMA is on a strict schedule to first light; thus, any solution must meet the restrictions of an accelerated software development cycle and take advantage of code reuse as much as possible. We describe our solution in which the pipelined data passes through two forms: a low-level database-based format oriented toward engineers and a high-level dataset-based form oriented toward scientists.

The BIMA Data Archive at NCSA has been operating in production mode for a decade and will be reused for CARMA with enhanced search capabilities. The integrated BIMA Image Pipeline developed at NCSA will be used to produced calibrated visibility data and images for end-users. We describe the data flow from the CARMA telescope correlator to delivery to astronomers over the web and show current examples of pipeline-processed images of BIMA observations.

1. Data Storage

The AIPS++ Measurement Set 2 (MS2) format will be the canonical format for astronomical data products. This will allow CARMA software components requiring high-level science-oriented access to take advantage of the existing functionality of the AIPS++ toolkit. An example of this would be automatic data quality evaluation.

In addition to the visibility data obtained by astronomical observing, the CARMA antennas also produce fast streams of telemetry data, called monitor points. These streams are sampled every half-second and the array as a whole will ultimately contain thousands of monitor points. The monitor data are important for tracking the health of the array, diagnosing problems, and assessing long-term trends. As such, they must be stored in a way that allows easy access and comparison among subsystems. A relational database is a natural solution.

The visibilities from the telescope are initially written as a ``binary brick,'' and subsequently combined with the monitor data to create the MS2 (Figure 1).

2. Archiving

The CARMA Data Archive will be an extension of the BIMA Data Archive currently in use. Within the archive, data are organized in hierarchical collections that reflect how astronomers interact with their data. The broadest collection is a Project covering all data resulting from a single proposal and which can contain a number of different Experiments. Within each Experiment are a number of Trial collections. Data from each observing track will be in its own Trial collection. Processed data are also collected into their own trial collections. As with most archives, users can search and browse the collections through the web.

High level metadata describing the observational experiments are very important for driving the pipeline (see below). These ultimately come from the astronomer during the planning stage and will include science-related information such as the spectral lines of interest and target sensitivity. This information will be used to fill the Observational Programs database (see Figure 1). In addition to being used to schedule the telescope, the metadata will be packaged up with the science (MS2) datasets and shipped to the archive.

New features of the archive include the ability to request the data in any one of the three formats for off-line processing in the AIPS++, Miriad, or Mir packages. This conversion can take place on-the-fly. The converted version will be temporarily cached in the archive in case that format is desired again later. New searching capabilities will also be added to support the searching and downloading of historical engineering data.

Figure 1: Left) The 15 CARMA antennas may operate as one array or, as pictured above, two independent subarrays. For each subarray, visibilities are written out as a binary ``brick'' with header values stored in databases. Monitor points are stored in 3 databases: at the full half-second rate, in 1 minute averages, and averaged to the astronomical integration time. The relevant pieces are put together into an MS2 data file before shipment to the CARMA/NCSA archive, which happens in near real-time. Right) When the data arrive at NCSA, metadata are extracted for entry in the searchable archive. Visibility data are calibrated and imaged using the BIMA Imaging Pipeline (see Figure 2). The astronomer can download the unprocessed visibilities, the calibrated visibilities, and the processed images. The CARMA/NCSA archive will support MS2, MIRIAD, and Mir as export formats for visibilities as well as engineering tables of monitor data. Converters will also be available on-site to allow observers to inspect or analyze the data locally using the respective packages.
\begin{figure}
\plotone{P7.11_1.eps}
\end{figure}

3. Pipeline Processing

The CARMA Pipeline will be an extension of the existing BIMA Image Pipeline; Figure 2 illustrates its different components. Processing is triggered automatically whenever new data arrives in the archive. The pipeline analyzes the metadata associated with the collection to determine what needs to be done. This includes special processing parameters and science-related information provided by the astronomer during the planning stage. After processing, the new products--the calibrated visibilities and deconvolved images--are sent back to the archive to be ingested and made available to astronomers. These new data can trigger additional processing; for example, after all requested observing tracks have been calibrated, new processing is triggered to image and deconvolve from all tracks into a single image cube.

The actual processing is done with AIPS++, enabled for parallel processing, using NCSA SGI and Linux clusters. Users not only have access to the processed data, but also the AIPS++ (Glish) scripts used; this allows them to alter and redo the processing off-line.

The use of Grid-based computing technologies will open up interesting opportunities for distributed computing. For example, we plan to use the Teragrid--a national Grid of distributed tera-flop computing linked via high-speed backbone--to process the data. This will allow processing to be distributed between CalTech and NCSA. We can also use the Grid to set up partial mirrors of the archive at the other consortium sites as well as give users greater access to the Pipeline for reprocessing of data.

Figure 2: When a new data collection arrives in the archive, a message is sent to the Event Server which figures out what processing needs to be done. This is done by retrieving and analyzing metadata about the collection. The metadata is forwarded to the script generator to prepare the scripts by drawing on ``recipes'' in a recipe library. The scripts and instructions on what order they should be run (i.e., ``work-flow'') is sent to the Queue Manager. Through the Data Manager, it retrieves the input data from the archive and submits the scripts and data to the Grid for processing. In practice, serial processing (e.g., calibration) is done on different machines from the parallel parts (e.g., imaging). The resulting data products are then sent back to the archive to be ingested.
\begin{figure}
\plotone{P7.11_2.eps}
\par\end{figure}



Footnotes

... Plante1
NCSA/University of Illinois
... Pound2
University of Maryland
... Mehringer3
NCSA/University of Illinois
... Scott4
Caltech/OVRO
... Beard5
Caltech/OVRO
... Daniel6
Caltech/OVRO
... Hobbs7
Caltech/OVRO
... Kraybill8
University of California, Berkeley
... Wright9
University of California, Berkeley
... Leitch10
University of Chicago
... Amarnath11
University of Maryland
... Rauch12
University of Maryland
... Teuben13
University of Maryland

© Copyright 2003 Astronomical Society of the Pacific, 390 Ashton Avenue, San Francisco, California 94112, USA
Next: Database Systems
Up: Data Management and Pipelines
Previous: The COBRA/CARMA Correlator Data Processing System
Table of Contents - Subject Index - Author Index - Search - PS reprint - PDF reprint