``AstroBrowse'' Workshop Report
  
Providing a Common Search Facility for
Independent Astronomical Data Services

Stephen S. Murray
Smithsonian Astrophysical Observatory

and

Robert J. Hanisch
Space Telescope Science Institute

December 1995

Executive Summary

The astronomical research community has assembled large and diverse data holdings - catalogs, databases, simulations, and digital data archives from both space- and ground-based telescopes - most of which are available on-line via network services. The goal of this workshop was to bring together representatives of major data providers, astronomical researchers, and technology experts to find a way to link these data resources so that astronomers can easily locate data of interest regardless of which of the many facilities actually holds that data. Participants in the workshop reviewed current data resources and technologies and defined the requirements for a common data search facility. An approach to providing this facility that imposes minimal impact on data providers, utilizes existing network access tools, and requires little or no addition to project budgets for implementation, was discussed and developed. A prototype implementation is planned for early 1996.

1.  Introduction

The Astrophysics Science Operations Management Operations Working Group (SOMOWG) recommended to the NASA Astrophysics Science Operations Branch that a workshop be organized in order to bring together data providers, representative users, and network information system specialists to see if there is a low impact (low cost) solution to providing a general data locator function for astronomical data. This workshop was organized by Dr. Stephen Murray (SAO) and Dr. Robert Hanisch (STScI) and held at SAO on 27-28 September 1995. Participants in the workshop are listed below.

Roger Brissenden

SAO/ASC
Cynthia Cheung NASA/GSFC NSSDC
Richard Crutcher NCSA/U Illinois
Daniel Durand DAO/CADC
Jim Fullton CNIDR
Robert Hanisch STScI
George Helou IPAC/NED (e-mail input)
Susan Kleinmann U. Massachusetts
Tom McGlynn NASA/GSFC HEASARC
Stephen Murray SAO
Eric Olson UC Berkeley CEA
Benoit Pirenne ST-ECF
Karen Strom Five College Observatory
Douglas Tody NOAO
Gustaaf van Moorsel NRAO
Michael Van Steenberg NASA/GSFC NSSDC
Archibald Warnock A/WWW Enterprises
Marc Wenger CDS

Table1: Participants in the workshop.

Participants worked to define the primary user requirements for a data locator service, to determine the technology tools needed and available for implementing such a service, and to understand the impact on existing data services (which were required to be minimal - the data locator service is not intended as a replacement for the user interfaces and data services already provided by each data center).

Problem Statement

The diversity and volume of astronomical data available via electronic means has become so large that it is often impossible to keep track of what information exists about a particular object or region of the sky. With current technology (especially the World Wide Web) it should be possible to provide an easy way to determine the existence and location of data sets of interest via queries to a relatively simple network service.

Astronomers have a wealth of network-based research tools available to them, including mission- and facility-based data archive and catalog services (with great depth of information for particular projects) to very broad, loosely coordinated facilities such as the AstroWeb (a volunteer-based effort to maintain a list of WWW sites relevant to astronomy). However, it is not possible to ask a simple question such as ``where can I find optical or UV images of 3C273'' without doing extensive, manual searches on separate data services.

Solution Framework

The purpose of the workshop was to come up with a plan for an enabling technology or standard to allow, but not mandate, interoperability of the various astronomy data holdings scattered around the world.

The point was to define something that each site operating a data server could implement to tie into this loose network. A distributed, heterogeneous approach with a low threshold of acceptance was preferred over a "big systems" approach mandating that things be done a certain way or emphasizing any form of centralization. In other words, the goal was to utilize technology like the World Wide Web, which is simple enough to allow any site to participate, and which is fully distributed.

2.  Current Data Resources

A major component of the workshop was a series of presentations by the participants representing data providers on the services already in existence. All data providers have WWW services that allow at least a basic search capability, and many also provide more specialized interfaces that support complex queries and/or complex data structures. As WWW services improve it is likely to be possible to support even these more complicated user access functions via the WWW. These data services are summarized below, and a more detailed listing is given in the Appendix.

Perhaps the strongest element of commonality amongst the astronomical data providers is that the various user interfaces allow specification of a object position (often provided via either the SIMBAD or NED name resolution services) and object name. This in itself indicates that a unified search facility should not be difficult to achieve.

Archives of both ground- and space-based data sets are of substantial size (exceeding 1 TB and growing at rates of 1 TB/year or more) and have associated catalogs of varying completeness and complexity. Ground-based archive facilities are generally less complete or sophisticated, often acting more as back-up services than true archives, but there is clearly a desire to expand upon these services. Observation catalogs are now available from a number of ground-based observatories.

Data centers also have a large number of astronomical catalogs available and on-line searchable. The CDS, for example, provides over 800 catalogs. Other data providers are beginning to archive reduced data (e.g., the BIMA mm-array consortium now requires observers to provide their reduced, final images to the NCSA/UIUC Digital Image Library), and we expect that simulation data sets will also be available for on-line retrieval.

The problem remains, however, that a general search for data on a given object or given region of the sky requires the astronomer to manually search each data set or catalog.

3.  Enabling Technologies

A number of tools and resources have been developed in the past few years that enable one to integrate network information resources and build distributed systems. The World Wide Web, based on hypertext documents and HTTP servers, is the foremost example of this technology. However, the implementation of a distributed data search and retrieval system requires one to exchange requests for information that are well structured. The ANSI Z39.50 protocol provides a mechanism for the exchange of structured queries and responses. HTML/HTTP is probably adequate for prototyping this system and has the immediate benefit of widespread distribution and support.

A distributed astronomical software documentation service (ASDS) is now being developed by two of the workshop participants (Hanisch and Warnock) using software developed by CNIDR - ISITE. Aspects of this system may be appropriate for use in a distributed data searching service, though ASDS is really designed for text search and retrieval.

The key element for success is defining minimal acceptable standards to be used within these technological frameworks, and which support resolution of queries that establish the existence of data meeting the user's specifications.

4.  Proposed Approach and Implementation Plan

4.1  User Requirements

>From the perspective of an astronomer using the system, a search for data on a given object or region of the sky has to have a least a modest set of qualification in order to avoid irrelevant information. (This is important from the perspective of data providers as well, who will be concerned about handling poorly qualified queries that result in hits against many many records in their catalogs and databases.) The minimum set of qualifications for a query is given in Table 2.

NAME

object name specification
POSITION a,d,r or amin,amax, dmin,dmax
(positions should be available through
name resolution services (NED, SIMBAD)
DATA CLASS pointed observation
catalog
survey
reference data
simulation
DATA TYPE image
spectrum
time series
flux (photometric) measurement
visibility data
BANDPASS gamma-ray
x-ray (high energy, low energy)
UV (EUV, FUV, UV, NUV)
optical
IR
mm/sub-mm
radio
TIME date or range of dates
OBSERVATORY mission, facility, or observatory

Table2: Minimum search qualifications for user interface.

Given a specification of the sort shown in Table 2 (e.g., via an HTML form), queries would be distributed to known data providers. Responses would be returned if a data provider holds data relevant to the query, in the form of a simple yes/no or preferably an indication of the amount of data (e.g., number of hits in the catalog or database) that pertain. The return response gives a hypertext link to the data provider that allows for further refinement of the search criteria and retrieval of the data.

4.2  Technology and Implementation

It was not possible during a 1 1/2 day workshop to fully define all aspects of the system implementation. However, a few basic design principles were generally agreed upon, and details will be fleshed out in the coming weeks. There are three primary components needed in order to implement this type of service:

1. Server locator.
This is a mechanism for obtaining a list of all participating servers, e.g., as the entry point URLs. Two approaches were discussed: (a) a central server that maintains a registry of participating data centers, and (b) a completely distributed approach where each server knows n other servers and the entire network can be determined dynamically with a few network queries. The latter approach is how the WWW works, and has the strength that the system is largely self-maintaining. Having a central server may assure a greater level of completeness but requires an ongoing maintenance effort. It is likely that the initial development of the server locator would be based on a central registry and expanded to a more dynamic configuration later.

2. Data holdings query.
This is a standard query each server would implement which returns a fairly static description of the data holdings available on the server. For example, for each database a number of standard attributes would be returned such as the database name, a textual description of the database, wavelength coverage, type of database, possibly the sky coverage, and so on. This is needed to find out what is out there and to optimize queries. The characteristics of the data holdings would correspond to the categories described in Table 2.

3. Data locator query.
This is a query to find all data objects matching a standard template or profile. A simple profile might be something like a given region of the sky and wavelength region (UV, optical, IR, X-ray, etc.), or an object name. Other attributes such as the name of the database to be searched could be used to refine the query. The minimal response is the number of objects found. A more complete response includes a block of text for each object describing the object found and including one or more URLs to be used to obtain more information about the object. If this URL is selected one enters the server domain and what happens there is up to the particular server, the access protocol doesn't attempt to standardize this.

A key point is that the queries in 2) and 3) above are static and predefined, consisting of a number of fields which may or may not be filled out. Only the filled out fields are used to form the query. These fields and the units they are specified in are an external interface and need not bear any relationship to what is used internally by the server. Some sort of translation may be required before the query can be processed internally by the server. General freeform queries are not permitted, nor is the form in which data files or other information is returned to the user specified, other than the textual responses to the data holdings and data locator queries.

The fields or attributes required for the data holdings and data locator queries still need to be defined. At any given time these will be fixed and predefined, but new fields could be added in the future as the facility evolves. A given server may not support all of the possible fields, in which case they are not used to refine the query.

4.3  Impact on Data Providers

The implementation strategy described above places minimal requirements on data providers. An agreement needs to be reached on interfaces - data profiles and standard/minimal conformance for responses to queries - but there is no need for data providers to modify their existing data access facilities.

Data providers would have to either register their servers with a centralized coordinator, or broadcast the services they provide using an agreed upon syntax. This registry information would need to include

Data providers need to be protected against poorly qualified queries that can result in huge numbers of hits against their databases. A simple way to limit such queries is to limit the number of hits the server will provide (e.g., 50) in response to a single query. In principle, however, the vulnerability here is no different than what already exists given that users can access the data providers' catalogs and databases now, one at a a time.

While we envision participants in the workshop as being the first to (jointly?) implement a client user interface, the use of standard interfaces will allow any number of users and data providers to develop or customize the client applications. Browsers can be made a simple or as sophisticated as desired, or as resources permit.

4.4  Schedule and Resources

The workshop participants agreed that a prototype system could be developed in a matter of a few months time, and that future conferences (ADASS meeting, October 1995; AAS meeting, January 1996) will provide a chance for ongoing discussion. It should be possible to get a prototype going which links a limited number of data providers by early 1996.

The prototype can probably be built using existing resources or with extremely modest augmentations to already funded projects. Participants felt it was important to continue the coordination and planning efforts as exemplified by this first workshop on an annual basis, with meetings planned in conjunction with events such as the ADASS Conference. A modest incremental budget might be needed to support these meetings. Overall, however, the cost of implementing a distributed data locator service should be trivial compared to the cost of acquiring the original data and supporting the mission- and facility-based archives and databases.

Appendix

The table below gives a synopsis of the data services provided by the organizations represented at the workshop, and does not list all data services in astronomy.

Organization

Data Holdings Volume
HEASARC Various high-energy data sets (ROSAT, Einstein, etc.) 400 GB + 2-4 GB/day for XTE operations
Astronomical catalogs, all wavelength regions 144 catalogs
``skymap'' virtual telescope
NSSDC/NDADS Various NASA mission data sets ?? GB
Astronomical catalogs, all wavelength regions ?? catalogs
CEA EUVE data 100 GB reduced, 500 GB raw
NCSA/U Ill BIMA data (mm-array radio telescope) just beginning
Astronomy Digital Image Library just beginning
DAO/CADC HST data (partial copy) 500 GB ??
CFHT archive ?? GB
ST-ECF HST data (full copy) 1800 GB
ESO archives ?? GB
Infrared Space Observatory archive not yet launched, 2000 GB when complete
STScI HST data 1800 GB + 750 GB/yr
VLA FIRST Survey, radio data 10 GB
Digitized Sky Survey 60 GB
NRAO VLA data (visibilities) 1500 GB + 200 GB/yr
NOAO KPNO data 900 GB + 500 GB/yr
SAO/ASC Einstein mission ?? GB
AXAF mission not yet launched, 2000 GB when complete
CDS SIMBAD database (1,000,000 objects), 800 catalogs