Sloan Digital Sky Survey

Work subbmitted to the Grid Workflow Workshop, Global Grid Forum. GGF10, Berlin, 9 March 2004


Extending the SDSS Batch Query System to 
the National Virtual Observatory Grid

María A. Nieto-Santisteban (1)
William O'Mullane (1)
Jim Gray (2)
Nolan Li (1)
Tamás Budavari(1)
Alex Szalay (1)
Aniruddha R. Thakar(1)

(1) The Johns Hopkins University
(2) Microsoft Research


The Sloan Digital Sky Survey science database is approaching 1TB in size.  
While the vast majority of queries normally execute in seconds or minutes, 
this prompt execution time can be disproportionately increased by a small 
fraction of queries that take hours or days to run either because they 
require non-index scans of the largest tables or because they request very 
large result sets.  In response to this, a job submission and tracking system 
has been developed with multiple queues.   The transfer of very large result 
sets from queries over the network is another serious problem.  Statistics 
suggested that much of this data transfer is unnecessary; users would prefer 
to store results locally in order to allow further cross matching and filtering.  
To allow local analysis, a system was developed that gives users their own 
personal database (MYDB) at the portal site.  Users may transfer data to their 
MYDB, and then perform further analysis before extracting it to their own 
machine.  

We now intend to extend the MYDB and asynchronous query ideas to multiple NVO 
nodes. This implies development, in a distributed manner, of several features, 
which have been demonstrated for a single node in the SDSS Batch Query System 
(CasJobs). The generalization of asynchronous queries necessitates some form of 
MYDB storage as well as workflow tracking services on each node and coordination 
strategies among nodes.

María Nieto-Santisteban
Last Modified :Wednesday, November 24, 2004 at 9:29:59 AM , $Revision 1.2 $