Research Contacts

Project Links

DataCutter                                         
 

The continuing increase in the capabilities of high performance computers and continued decreases in the cost of secondary and tertiary storage systems is making it increasingly feasible to generate and archive very large (e.g. petabyte and larger) datasets.  Applications are also increasingly likely to make use of archived data obtained by different types of sensors, including microscopy and radiology related imagery.

Simulation or sensor datasets generated or acquired by one group may need to be accessed over a wide area network by other groups.  Datasets frequently describe data associated with collections of very large structured or unstructured grids where each grid point is associated with several variables.  Applications frequently need only to obtain portions of a dataset.  Required data may correspond to a particular region in multidimensional space.  The application may need to access all data associated in a multidimensional region or it may need only certain variable values at a subsampled set of spatial locations.  In addition, in some cases, applications may require data products obtained by aggregating data in one way or another.  For instance, a user might require time or space averaged data.

DataCutter is a middleware infrastructure, developed by researchers in the University of Maryland Computer Science and the Johns Hopkins Pathology Informatics Departments.  The It enables subsetting and user-defined filtering of multi-dimensional datasets stored in archival storage systems across a wide-area network  DataCutter also provides a core set of services, on top of which application developers can implement more application-specific services or combine with existing Grid services such as meta-data management, resource management, and authentication services.

The main design objective in DataCutter is to provide support for accessing subsets of datasets via range queries and for carrying out user-defined aggregations and transformations for very large datasets in archival storage systems, in a shared distribute computing environment.  To make efficient use of distributed shared resources with the DataCutter framework, the application processing structure is decomposed into a set of processes, called filters.   DataCutter uses these distributed processes to carry out a rich set of queries and application specific data transformations.  Filters can execute anywhere (e.g., on computational farms), but are intended to run on a machine close (in terms of network connectivity) to the archival storage server or within a proxy close to co-located clients.  Filter-based algorithms are designed with predictable resource requirements, which are ideal for carrying out data transformations on shared distributed computational resources.  Another goal of DataCutter is to provide common support for subsetting very large datasets through multi-dimensional range queries.  Very large datasets may result in a large set of large data files, and thus a large space to index.  A single index for such a dataset could be very large and expensive to query and manipulate.  To ensure scalability, DataCutter uses a multi-level hierarchical indexing scheme.

 


Tahsin Kurc, Ph.D.

.

cs.umd.edu/

Alan Sussman, Ph.D.

Joel H. Saltz, M.D., Ph.D.

Michael Beynon

This description of DataCutter was excerpted from "DataCutter: Middleware for Subsetting and Filtering Very Large Multidimensional Datasets on Archival Storage Systems" by Kurc T., Beynon M., Sussman A., and Saltz J.