EIL Homepage

 University of California, Santa Barbara




 

An Overview of the Earth System Science Workbench

James Frew
Rajendra Bose
Donald Bren School of Environmental Science and Management
University of California, Santa Barbara
http://essw.bren.ucsb.edu/
April, 2001

Terms in bold are listed in the Glossary of Terms in section 7.

Table of Contents

1. Introduction

A relatively small number of Earth science data centers, such as the EROS Data Center in Sioux Falls, SD, and the Goddard Spaceflight Center in Greenbelt, MD, now produce and distribute the large data sets derived from environmental models and global satellite imagery.  Myriad research groups use these data sets to generate useful higher-level data products that, in their totality, would pose a challenge for these same few data centers to collect and redistribute.  A recent National Research Council report [1] suggests that, in the future, scientists with useful algorithms for processing Earth observing instrument data should be prepared to manage self-contained data production facilities for their research communities.

The Earth System Science Workbench (ESSW) is a nonintrusive data management infrastructure for these researchers who must also be data producers.  ESSW is designed to support the publishing and archiving of data products.  ESSW is transparent, in that the data processing tasks scientists already perform are exposed to the outside world.  ESSW also adds robustness to computing environments, providing a richer and more stable interface to a hierarchical file system.

These concepts are implemented in ESSW’s Lab Notebook and No Duplicate-Write Once Read Many (ND-WORM) services.  The Lab Notebook logs processes (experiments) and their relationships via a custom application programming interface (API) to XML documents stored in a relational database.  Lab Notebook Tools allow product searching and ordering, and file and metadata management.  The ND-WORM provides a managed storage archive for the Lab Notebook by keeping unique file digests and namespace metadata, also in a relational database.

Ideally, ESSW enables researchers to:

  • store metadata for their experiments with little effort,
  • reveal the lineage of their data products, and
  • manage and control product storage.

1.1. Project Background

The ESSW project, led by James Frew, professor in the Donald Bren School of Environmental Science and Management at the University of California Santa Barbara (UCSB), is one of two dozen original groups selected in December 1997 for participation in the NASA-sponsored Earth Science Information Partner (ESIP) Federation.

The ESSW project at the Bren School uses a local ESSW installation at UCSB to archive and publish Earth science data products generated by several ongoing studies at UCSB and elsewhere. The project goal is to provide proof-of-concept for new methods that will enable existing and future ESIPs to meet the challenges of supplying data products to the research community.

1.2. The Components of ESSW

ESSW is a data management infrastructure for Earth science products.  It consists of two basic components: the Lab Notebook and Labware.

1.2.1. Lab Notebook

The Lab Notebook component of ESSW is the digital analogy to the handwritten laboratory notes of a scientist.  The Lab Notebook is a Java client/server application supported by a relational database.  A Lab Notebook Client uses an API, currently implemented for Perl, to communicate with the Lab Notebook Server.

The Lab Notebook Server functions via LN Daemon and LN Console Java applications that communicate with an LN Database using the Java LN API and the JDBC/ODBC database connectivity protocols. (Note that throughout this paper, LN is our standard abbreviation for Lab Notebook).

1.2.2. Lab Notebook Tools

The metadata stored in the Lab Notebook assists researchers or others with identifying particular data products, as well as tracking the steps leading to creation of a product, or the product lineage.   Lab Notebook Tools are applications allowing Web browser access to this metadata.  The Tools are implemented through common gateway interface (CGI) interaction with the LN Database.

1.2.3. Labware

Labware, similar to the Lab Notebook, is the digital analogy to the collection of equipment and instruments in a scientist’s laboratory.  Labware represents trends in scientific computing such as the use of large online data storage and inexpensive computing clusters, and includes No Duplicate-Write Once Read Many (ND-WORM) services to provide robust file archiving.  Labware ND-WORM services are composed of a Java server application, supporting database, and a dedicated disk storage area.  A Lab Notebook Client can send instructions to the ND-WORM Server Java application.

2. Lab Notebook

The challenge facing the ESSW team at the outset of the project was building a tool for automated metadata and lineage collection to support the widest variety of computational, Earth science-related research applications possible.  Essentially, the team sought to construct a generic, computer-based Lab Notebook to enable existing and future ESIPs to manage and publish Earth science data products.  A lightweight client/server architecture for metadata collection and storage was developed that could work in concert with existing applications on different platforms.  The use of Perl scripting and XML as vehicles for data transfer contribute to the flexibility built into the Lab Notebook.

The following sections elaborate on our use of “science objects” to define the structure of an Earth science model, describe the overall structure of the Lab Notebook, and provide details about how Lab Notebook Clients and Server operate.

2.1. Science Objects

At the conceptual core of the Earth System Science Workbench are generic science objects that serve as a useful framework for defining and collecting metadata for environmental and Earth science models.  Essentially, science objectsare entities in ESSW that represent real world items such as files and data processing routines.

Within ESSW, an experiment is defined as one particular instance, or execution, of some science model.  Each experiment consists of one or more experiment steps.  Each experiment step is a computational process with inputs and outputs.  ESSW designates models, experiments, experiment steps, inputs, and outputs as science objects.

Science objects also contribute to our definition ofscience workflow, the execution of a sequence of experiments or experiment steps.  Researchers map out their experiment flow in terms of the science objects they want the system to track.  A way for scientists to create a visual map of their science workflow is to draw a directed acyclic graph (DAG) showing the linked steps in their experiments (Figure 2.1).  A DAG for a science model is thus composed of science objects that will be assigned metadata descriptions during the execution of an experiment.

2.2. Lab Notebook Client/Server Architecture

In a Lab Notebook installation, a researcher’s workstation acts as a client that communicates with one or more servers through Perl LN Client API calls (Figure 2.2).  The LN Daemon application constructs XML documents with specific metadata values sent from the client, connects with the LN Database, and transfers the XML documents, as well as certain parsed metadata values from them, to database records.  Metadata and lineage can then be accessed through database queries.  The next sections provide more detail about the Lab Notebook client/server architecture.

2.2.1. Lab Notebook Client

A Lab Notebook Client functions through one or more Perl scripts with calls to the custom Perl LN Client API.  The LN Client API is a set of client-side Perl modules, or function libraries, that sends instructions to a server-side Java daemon application, the LN Daemon.

One goal of the ESSW project is to provide automated metadata and lineage collection while minimizing the disruption to pre-existing research computing tasks.  Perl scripts provide a way to “wrap” existing methodologies with the LN Client API calls necessary for Lab Notebook Server functionality.  The decision to use Perl wrappers as a mechanism for automated metadata and lineage collection was based on several factors:  Perl is already widely used within the scientific programming community for computational tasks; Perl software is free and available for most platforms; a scripting language is easier to learn than a full programming language such as Java; and a scripting language such as Perl requires less computing overhead than a full programming language such as Java.  The only software installation required for Lab Notebook Client workstations is Perl 5 as well as the LN Client API Perl modules.

2.2.2. Lab Notebook Server

The Lab Notebook Server collects metadata values from Lab Notebook Clients, constructs XML documents using these values according to predefined metadata templates, and stores the XML documents and parsed metadata values from them.

Two primary components of the Lab Notebook Server are the LN Daemon and a dedicated LN  Database.  The LN Daemon is a Java daemon application residing on an application server that responds to LN Client API calls in client-side Perl scripts. The LN Daemon uses the class libraries of the IBM XML 4J Parser for Java [2] to construct and validate XML documents.  The LN Daemon communicates with the LN Database using the Java LN API, which creates tables and executes SQL queries through a JDBC/ODBC interface.

A third component of the Lab Notebook Server is the LN Console, a Java application providing a command line interface which communicates directly with the LN Database.  The LN Console, viewed as an ESSW administrator’s tool, supports a subset of LN API functions as well as specializations not found in the LN Client API.  In the current implementation of ESSW, the LN Console provides the only way to define science models, submit XML document type definitions (DTDs) to the Lab Notebook, and create metadata templates for science objects based on the submitted DTDs.

2.2.3. The LN Database

A version of ESSW has been developed using the freely available MySQL [3] relational database as the LN Database, although an initial version was developed using the Informix Dynamic Server [4] object-relational database management system.  A set of base tables in the LN Database schema (Figure 2.3) holds details about science models used in the system, and the subsequent uses of the models, or experiments.  For each science object defined in a particular model, a new table is created via SQL by Perl LN Client API commands.  These tables hold metadata in the form of XML documents, and database fields are created for certain XML tags.  Records are then programmatically added to these tables each time an experiment is performed.

2.3. Using the Lab Notebook to Collect Metadata

2.3.1. Overview

As described in section 1, ESSW minimizes disruption to preexisting research computing tasks because of the flexibility built into its implementation.  Researchers map out their experiment flow only in terms of the science objects they want to track.  The DAG created for a science model is thus composed of science objects that will be assigned metadata descriptions each time an experiment is performed (Figure 2.4, (A)).

The metadata describing each object is collected and stored by the Lab Notebook.  The Lab Notebook must therefore know in advance about the types of science objects defined for a science model, prior to use of that model.  The procedure of specifying to the Lab Notebook the name and kind of metadata to capture for a science object is viewed as adding a metadata template to the Lab Notebook (Figure 2.4, (B)) Adding a metadata template to the Lab Notebook, performed with the LN Console, requires reference to a formal XML DTD.

After adding metadata templates, the Lab Notebook is aware of  the science objects and the metadata specifications for each object.  Yet the Lab Notebook does not know the relationships between the various science objects until the first instance of the model, or experiment, has been performed.  The “wrapper” or other script(s) written to execute the experiment performs three or four basic tasks.

The script creates an instance of each science object defined in the DAG for an experiment, with each object automatically assigned a unique object_id (Figure 2.4, (C)), and supplies specific metadata values for each science object instance.  The script also defines how the science objects are linked together, viewed as registering inputs and outputs for experiments and each experiment step (Figure 2.4, (C)).  These linkage definitions allow a researcher to later recall the lineagefor a particular science object.

Optionally, the wrapper may also instruct the Lab Notebook to permanently store an experiment or experiment step output in ESSW’s No Duplicate-Write Once Read Many (ND-WORM) storage area (Figure 2.4, (D)).  ESSW will then record the existence of this permanently stored output file using the uniquely identifying digest calculated for the file. This process is called cataloging a file.  The ND-WORM is discussed in section 3.2.2.

2.3.2. Defining metadata templates using standard elements

In order for the Lab Notebook to support automated metadata collection, researchers need to define metadata templates for the science objects they want to document as they perform their experiments.  These metadata templates are in the form of XML DTDs that provide specifications for individual metadata elements within a template.  Metadata elements are generally attributes of scientific data or processes.  For example, a metadata template might specify that the following six metadata elements be used to represent a particular processing algorithm:  account_used, command_line, control_file, host_name, operating_system, and who.  See section 2.3.4 for a specific example of defining metadata templates.

Rather than forcing researchers to construct their metadata templates from scratch for each model they use, the Lab Notebook will include an element registry in a future release to help manage metadata elements.  Metadata elements will be able to be grouped together in element sets.  The element registry will manage this scientific metadata by providing supporting documentation (e.g., definitions, value domains) for better understanding and sharing of scientific data, and will provide tools for finding, reusing, and creating metadata elements.

The element registry will include controlled vocabularies.  These vocabularies contain lists of terms and their relationships to each other.  The vocabularies are “controlled” in that new terms are added by an administrator according to the ANSI/NISO Z39.19 thesauri standards.  Terms can be associated with element sets, and therefore can provide a standardized means of indexing and searching for sets of elements.  The Lab Notebook will include three controlled vocabularies:  the NASA GCMD’s “sources valids”, “sensor valids”, and “projects valids.”  We are assembling Web-based tools to assist researchers in creating metadata templates with the future Lab Notebook element registry.

2.3.3. Using the Lab Notebook: a step-by-step guide

The following section presents a summary of the steps required for a researcher to perform automated metadata collection for repeated runs of a processing model using ESSW.

Step 1.  Review the model and define the science objects of interest.  Sketching a DAG to represent the model may assist the researcher to decide which science objects to collect metadata for.  Each science object shown in the DAG needs to be assigned a unique name.

Step 2.  Enter the model name into the Lab Notebook.  The name and description of the processing model are entered into the Lab Notebook using the LN Console.

Step 3.  Decide what metadata to collect for each science object.  The researcher may find it useful to examine existing metadata templates or Lab Notebook entries to see what metadata has been defined for similar science objects used in other models.  (The future element registry will serve as a resource for this step.)

Step 4.  Ensure XML DTDs for metadata definitions are submitted to Lab Notebook.  The researcher may need to create new XML DTDs and submit them to the Lab Notebook using the LN Console.  Or, the researcher may use existing DTDs that have been previously submitted to the Lab Notebook.

Step 5.  Add metadata templates for each science object to the Lab Notebook.  The researcher uses the LN Console to add a metadata template for each science object in the model: this defines what metadata to collect for each science object.  Adding a metadata template requires referencing an existing XML DTD in the Lab Notebook.

Step 6.  Write script(s) to perform an experiment.  There are two approaches to using scripts to execute an experiment, i.e., perform a model run.  The first approach is to use a Perl wrapper, and the second approach is to use a metadata ingest technique.

A Perl wrapper script is a text-based script which combines commands necessary to perform an experiment with LN Client API commands to communicate a series of instructions to the Lab Notebook Server.  Running the wrapper performs the experiment, and additionally creates an instance of each science object in the Lab Notebook, associates specific metadata with an object according to the metadata definition, and registers the relationship between objects.

Instead of using a wrapper , the same tasks could also be accomplished by writing some other type of script(s) to perform the experiment.  The script(s) create simple output text files to hold science object metadata in some specified format.  A separate Perl script is then run; this Perl script includes LN Client API commands to use these output text files to “ingest” or import the metadata in the text files for the science objects involved in the experiment.

Step 7.  Modify script(s) to perform subsequent experiments.  If the experiment is structured to the satisfaction of the researcher, performing another experiment may be as easy as making simple modifications to the original Perl wrapper or other scripts.

2.3.4. Case study: tracking radiative transfer experiments

At the Institute for Computational Earth System Science (ICESS) at UCSB, a method, called 2001, exists to compute net surface solar radiative flux for a variety of spectral regions.  2001 is based on simple, physical modeling of the most important radiative processes occurring within the atmosphere, namely scattering and absorption by molecules, clouds, and aerosols [5].

2001 can be thought of as a set of processing steps that begins with the acquisition of satellite derived input data produced by the International Satellite Cloud Climatology Project (ISCCP).  The ISCCP input data exist for every 3 hours, for the entire globe from July 1983 through August 1994, at 2.5 x 2.5 degree spatial resolution.  Instantaneous fields of  shortwave and photosynthetically active radiation (PAR) are computed every 3 hours for the daylight regions of the globe. These are regrided and averaged into daily maps at 2.5 x 2.5 degree resolution. The 2001 model furnishes broadband [6], PAR [7], and UV-A and UV-B fluxes [8] as output.

The 2001 model serves as one of the case studies for applying the tools ESSW provides to publish and archive data products.  As described below, running the 2001 model uses the metadata ingest technique rather than using a Perl wrapper to assemble metadata for each experiment run.  The following sections provide an end-to-end example of how the ESSW Lab Notebook is used according to the steps outlined in section 2.3.3.

Step 1.  Review the model and define the science objects of interest.

A DAG (Figure 2.5) shows how the ICESS researcher working with 2001 processing decided to break the procedure up into three distinct models, each with one processing step.  A total of ten science objects are pictured in the DAG: three models, three model or experiment steps, and four inputs and/or outputs.  As shown in the diagram, the three model objects have the names: “2001_preproc,” “2001_run” and “2001_calcAveDay.”  The three model/experiment step objects have the names: “idl_preprocess,” “run_2001_D1” and “calc_aveday_D1.”  The four input/output objects are also named, but these names are not shown in the figure.

The DAG shows that each 2001 experiment performed actually consists of three experiments that share input and output science objects.

Step 2.  Enter the model name into the Lab Notebook.

The following LN API command entered in the LN Console application creates a model in the Lab Notebook called “2001_preproc” based on the XML document “2001_preproc.xml.”

add metadata 2001_preproc.xml -n 2001_preproc

The contents of “2001_preproc.xml,”an XML DTD that provides a brief description and url for the model, are shown below:

<?xml version="1.0" encoding='UTF-8' ?>
<!DOCTYPE model SYSTEM "template://model">
<model>
<brief>2001 model - preprocessing experiment</brief>
<url>http://www.icess.ucsb.edu/</url>
<DAG_url>http://www.icess.ucsb.edu</DAG_url>
</model>

The two other model names, “2001_run” and “2001_calcAveDay,”are entered into the Lab Notebook in the same fashion.

Step 3.  Decide what metadata to collect for each science object.

The metadata to collect for a particular science object is specified in an XML DTD that has the same name as the science object.  For example, the XML DTD for the model/experiment step “run_2001_D1” that defines the six metadata elements account_used, command_line, control_file, host_name, operating_system, and who is shown below:

<!ELEMENT run_2001_D1 (account_used, command_line, control_file, host_name, operating_system, who?)>
    <!ELEMENT account_used (#PCDATA)>
    <!ATTLIST account_used dbtype CDATA #FIXED "char(15)"
                    description CDATA #FIXED "account name that ran the script">
    <!ELEMENT command_line (#PCDATA)>
    <!ATTLIST command_line dbtype CDATA #FIXED "varchar(128,8)"
                    description CDATA #FIXED "command line used to run 2001 model">
    <!ELEMENT control_file (#PCDATA)>
    <!ATTLIST control_file dbtype CDATA #FIXED "varchar(128,8)"
                    description CDATA #FIXED "control file path and filename">
    <!ELEMENT host_name (#PCDATA)>
    <!ATTLIST host_name dbtype CDATA #FIXED "char(15)"
                    description CDATA #FIXED "host name where the script ran">
    <!ELEMENT operating_system (#PCDATA)>
    <!ATTLIST operating_system dbtype CDATA #FIXED "char(15)"
                    description CDATA #FIXED "installed OS which runs 2001">
    <!ELEMENT who (#PCDATA)>
    <!ATTLIST who dbtype CDATA #FIXED "char(15)"
                    description CDATA #FIXED "who ran the script (if different
                    than account_used)">

This DTD was created using a simple text editor.  Individual DTDs were also created for the six remaining science objects.

Step 4.  Ensure XML DTDs for metadata definitions are submitted to Lab Notebook. 

In the previous step, seven DTDs were created to define the metadata for the seven science objects of interest.  The Lab Notebook needs to know about the existence of these DTDs.  A simple Perl script containing LN Client API commands can be written to add all seven DTDs to the Lab Notebook.  The script is shown below:

add dtdlib D1_2001_product.dtd
add dtdlib calc_aveday_D1.dtd
add dtdlib ftp_data.dtd
add dtdlib idl_preprocess.dtd
add dtdlib input_2001_D1.dtd
add dtdlib inst_D1_output.dtd
add dtdlib run_2001_D1.dtd

Step 5.  Add metadata templates for each science object to the Lab Notebook. 

Metadata templates for the seven science objects are then created in the Lab Notebook with reference to DTDs bearing the same name.  In this case a simple Perl script containing LN Client API commands is written to add templates for all seven science objects to the Lab Notebook.  The script is shown below:

add template D1_2001_product -r D1_2001_product
add template calc_aveday_D1 -r calc_aveday_D1
add template ftp_data -r ftp_data
add template idl_preprocess -r idl_preprocess
add template input_2001_D1 -r input_2001_D1
add template inst_D1_output -r inst_D1_output
add template run_2001_D1 -r run_2001_D1

In practice, steps 2, 4 and 5 are combined into one Perl script.

Step 6.  Write script(s) to perform an experiment.

For 2001 processing, the script(s) that carry out the processing steps for each experiment create simple output text files that hold science object metadata in some specified format.  For example, the preprocessing model “2001_preproc” consists of one experiment step that is executed by an IDL script.  Embedded within one of the loops of the IDL script are instructions to create a text file consisting of lines of text with the processing step name, metadata element name, and metadata value separated by a vertical line character delimiter.  Thus, metadata is collected dynamically during processing.  An excerpt of the portion of the preprocessing IDL script is shown in Appendix 1,[ with the code that creates the text file shown in italics.]

The metadata text file created with the name “idl_preprocess.1993.0501.metadata,” for example, thus contains the metadata for the preprocessing step using ISCCP data from 05/01/1993:

idl_preprocess | account_used | pete
idl_preprocess | host_name | teal
idl_preprocess | idl_version | 5.2.1
idl_preprocess | local_script_file | /home/data168/archive/pete/ISCCP/IDL/New.2000/read-n-write_D1.opr
idl_preprocess | operating_system | SunOS 5.7
idl_preprocess | who | pete

A separate but complementary Perl script is then created that reads in the contents of these metadata text files for each science object and includes LN Client API commands to create entries in the Lab Notebook.  The Perl script uses the output text files to “ingest” or import the metadata in the text files for the science objects involved in the experiment.

An example of a PAR data product resulting from the 2001 model is shown in Figure 2.6.

Step 7.  Modify Perl wrapper to perform subsequent experiments.

2.4. Using Lab Notebook Tools to Access Metadata

Researchers interested in the outcome or details of an experiment, or others trying to locate relevant data products, query the LN Database to perform searches.  The ESSW team chose to create Web-based tools to have experiment results available to the widest audience possible.

2.4.1. Using the Lab Notebook’s Web-based tools

The tools used to access UCSB Lab Notebook entries are Web applications available from the ESSW web site (essw.bren.ucsb.edu).  The “Notebook Tools” service provides search and display capabilities for any model, experiment, or other science object in the Lab Notebook (Figure 2.7).  The metadata for a particular object can be shown (Figure 2.8), or the lineage for a science object can be displayed (Figure 2.9).

2.4.2. Creating the Lab Notebook’s Web-based tools

The Lab Notebook Web-based access tools are implemented as Perl CGI scripts which call ESSW-specific Perl modules.

Lab Notebook queries are performed by calling a custom ESSW Perl module to connect to the LN Database.  The Perl query function is then used to send SQL text statements to the database and receive the resulting record sets.  Some of the HTML pages generated by the CGI scripts include Javascript-based functionality.

3. Labware

Labware refers to a research computing environment configured to take advantage of the services provided by ESSW for the production and distribution of Earth science data products.

3.1. ESSW Project Computing Infrastructure

The Frew Lab at the Bren School of Environmental Science and Management provides the computing infrastructure to support the ESSW project.  The basic project strategy is to explore scalable, inexpensive solutions for meeting the needs of researchers using ESSW from locations both inside and outside of the Frew Lab. The Earth science research efforts using ESSW require an array of dependable services, including the resources to perform intensive processing and the availability of large data storage.

A simplified network diagram of the Frew Lab (Figure 3.1) shows that the bulk of special purpose server machines, as well as our computing cluster and data storage, are housed in a separate server room.  This climate controlled room is equipped with a mid-range uninterruptible power supply (UPS), suitable for power outages less than an hour in duration.

To keep hardware and software costs low while maintaining a high level of performance, we use Linux as an inexpensive alternative to other operating systems, and thus contribute to its development. The user machines in the Lab itself are a mix of Windows and Linux desktop workstations.  In our server implementation plan, Sun systems are used to provide a high degree of stability for database, processing and disk servers.  Other servers, including those responsible for our network information service (NIS) and network file service (NFS), are running Linux.

The design for the Frew Lab computing environment is highly scalable.  Rather than relying on a limited supply of high performance machines to shoulder the burden of our computing services, we have the goal of reaching a comparable level of performance by distributing ESSW services among a greater number of less expensive machines.

Many satellite imagery files and large data sets are stored on our file servers.  Thus our weekly disk backups use a small number of high capacity tapes.  We are using the Legatto Networker 5.5 backup application in conjunction with an eighty tape jukebox which is shared among a number of research groups.

3.2. Data Storage

To meet the data storage needs of Earth science researchers using ESSW, our computing environment features the use of RAID technology to provide access to over a terabyte of disk space without the use of tapes.

3.2.1. Storage concepts

The technology for using compact magnetic tapes for storage needs is improving, and is widely used in many institutions and research groups for large file storage.  Disk space, however, is inexpensive and scalable.  The ESSW project is moving away from the use of tapes as traditional tertiary storage for data access, and instead is using only online, disk based storage (Figure 3.2a and Figure 3.2b).  Hardware and software RAID solutions are improving and are providing an inexpensive way to assemble terabytes of storage space.  Tapes are still used in the ESSW computing environment, but only for backups to insure recovery from system failures.

3.2.2. File archives for the Lab Notebook: ND-WORM services

A common way for an organization to archive data products and other files is to simply use agreed-upon policies for labeling items in a useful manner.  For example, a set of naming conventions could be created to label directories and files using some combination of version or revision numbers, date, time, and location.  The shortcomings of this method quickly become apparent if files are ever mistakenly renamed, moved, modified, overwritten, or corrupted.

More robust archiving may be important for Lab Notebook functionality. For example, as described earlier, the Lab Notebook tracks metadata for science objects defined for a set of computational experiments.  The science objects refer to items such as files and data processing routines, but the actual files used in the experiment are not stored in the LN Database.  For a series of related experiments such as outlined for the 2001 radiative transfer model in section 2.3.4, where the output of one experiment serves as the input for another, the unambiguous identity and location of experiment inputs and outputs becomes crucial.  We need to ensure a researcher can reuse science objects, and thus avoid the creation of a duplicate instance of an existing science object.   Our ND-WORM services have been designed to uniquely identify files and safeguard against undesirable manipulation and duplication.

Labware ND-WORM services are provided by a client/server Java application (Figure 3.3).  We currently use custom Perl functions in client side scripts to construct the client Java commands that communicate with the server.  We are working to add commands to LN Client API that directly access ND-WORM services.

When a file is copied into the dedicated disk storage area, the ND-WORM Server calculates a unique id for the file using the RSA MD5 algorithm [9].  The algorithm takes as input a message of arbitrary length and returns a 32 character string, the digest for the file.   Because every different file has an effectively unique MD5 digest, this character string can be used to reveal different versions of a file with the same name, and thus prevent the storage of duplicates.  True to its name, the ND-WORM does not allow modifications to stored files once they are cataloged, or “checked in” to the system.  Instead, changes can only be made to files that have been “checked out” of the archive.  Figure 3.4 shows how the Lab Notebook works in conjunction with ND-WORM services.

3.3. Computing Cluster

Our Linux-based computing cluster creates an inexpensive solution to distributed, compute-intensive jobs.  As in a Beowulf class machine [10], the cluster consists of commercial off-the-shelf hardware and software, without any custom components (Figure 3.5).

One server node serves as a gateway between the cluster and the ESSW project research group.  The server is responsible for scheduling processing jobs to the other dedicated client nodes in the cluster.  The cluster operates on what is essentially a private network, but the individual cluster machines can access certain user directories and our terabyte file server.

Our researchers submit processing jobs for their science experiments to the server node, which uses free public domain software called PBS (the Portable Batch System) to queue incoming computing jobs.  The client nodes consist of two subgroups of machines: one group is a collection of very inexpensive PCs (available for roughly $500.00 apiece at major retail stores) with fast 400 MHz Intel Celeron processors; the other group consists of more select, expensive PCs equipped with 450 MHz Intel Pentium III processors.

4. Case Studies

Section 2.3.4 discussed tracking global solar radiation data products generated by the 2001 model.  The following sections describe additional ESSW case studies.

4.1. Satellite Imagery Processing

One application of the current implementation of ESSW is to provide our archive of locally-acquired satellite imagery to the Web-based science community.

ICESS maintains a TeraScan ground station at UCSB capable of receiving High-Resolution Picture Transmission (HRPT) image telemetry from the Advanced Very-High Resolution Radiometer (AVHRR) sensors on board the National Oceanic and Atmospheric Administration (NOAA) polar-orbiting satellites [11]. Each AVHRR sensor continuously images a 2000 km swath of the Earth's surface, in five spectral channels (visible, near-infrared, and thermal), at a spatial resolution of 1 km. We acquire both the day and night swaths over the western United States and eastern Pacific Ocean from the NOAA-12 and NOAA-14 satellites. An example of an AVHRR image acquired at ICESS is shown in Figure 4.1.

We currently provide several UCSB research groups and others, including Regional Earth Science Application Centers (RESACs), with a variety of custom, interim AVHRR products. These interim products are then used by scientists to generate other related products for their own use. The AVHRR HRPT telemetry typically undergoes several standard processing steps, including navigation, calibration and registration, to produce the interim AVHRR products we provide to others.  These steps are outlined in the DAG shown in Figure 4.2.

These standard processing steps are accomplished with Perl scripts that wrap TeraScan processing commands.  Metadata collection instructions and Lab Notebook client API commands are now included in these Perl scripts to track AVHRR processing steps in the Lab Notebook installation at UCSB.

A Web-based front end has been developed for custom product generation of the AVHRR data.  With a standard Web browser, a user can specify a temporal, spectral, and/or spatial subset of any of the AVHRR swaths in our archive.  The spatial and spectral subsetting information are stored in an AVHRR profile object to expedite online product ordering (Figure 4.3).

Figure 4.4 shows an example of a subset AVHRR image ordered with ESSW.

4.2. Fractional Snow Cover Maps

One goal of snow research at UCSB is to provide information that will improve understanding of the Earth's hydrology, through the creation and distribution of accurate, precise snow products that require analysis of the spectral reflectance of the surface. Such products include regional fractional snow cover and snow cover albedo.

The Snow Hydrology Group at ICESS uses retrieved reflectance values from AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) [12] as input into the MEMSCAG (Multiple Endmember Snow Covered Area and Grain Size) model to produce maps of subpixel snow-covered area and other data products. These steps are shown in the DAG in Figure 4.5.  Currently, a sample of twelve maps of fractional snow-covered area for portions of the Sierra Nevada from 1994 through 1998 are available through the UCSB ESSW installation.  Figure 4.6 shows an example of this data product for Mammoth Mountain, CA, in 1994.  These maps are available for download as IDL image and JPEG formats.

As in the case of the global solar radiation data products generated by the 2001 model, described in Section 2.3.4, the script(s) that carry out the processing steps for each MEMSCAG experiment create simple output text files that hold science object metadata in some specified format.  A separate Perl script ingests the metadata text files for each science object and includes LN Client API commands to create entries in the Lab Notebook.

5. Continuing Work

One goal of the project is to provide fully customized automated product generation as an intrinsic capability of ESSW.  Although each of the three data products discussed in this paper are currently ordered through separate, custom Web applications, we are planning to create more generic product ordering interfaces.

A future version of the Lab Notebook will include the metadata element registry, discussed in section 2.3.2, which consists of a set of LN Database tables and tools for accessing them.

6. Acknowledgements

This white paper was written with the assistance of:  Mike Colee, Debbie Donahue, Calin Duma, Erik Fields, Ben Klaas, Steve Miley, Jordan Morris, Tom Painter, Mark Pelletier, Pete Peterson, Dave Siegel, and Peter Slaughter.

7. Glossary of Terms

application programming interface (API): a collection of specialized commands created to extend the capabilities of an existing programming or scripting language

adding a metadata template: creating and submitting an XML DTD to the Lab Notebook that defines metadata elements for a particular science object

cataloging a file: submitting a file to the ND-WORM and receiving a unique identifying digest for that file

common gateway interface (CGI): a server-side application that provides a service to a client

controlled vocabulary: a standard set of terms used in a given domain

daemon: a background computing process that “listens” for a some particular computing event; the event that is “heard” by the daemon can then trigger other computing events

digest: a function used to assign a unique identifier for a given file

directed acyclic graph (DAG): a graph is a collection of nodes (points or boxes) and edges (connector lines) often used in computer science to describe a sequence of computing events; a directed graph flows in a particular direction; a directed acyclic graph does not “double back” on itself

document type definition (DTD): a “template” document used to define the elements allowed in a particular type of an XML document

element registry : a set of tables in the LN Database that tracks metadata elements defined by researchers

element set: a set of related metadata elements

experiment: one particular instance, or execution, of a science model; consists of one or more experiment steps

experiment step: a computational process with inputs and outputs within an experiment

lineage: the sequence of processing steps leading to the creation of some data product; can also be thought of as the processing history of a product

metadata element: an XML tag that serves to uniquely identify a data set

metadata ingest: the execution of a script that imports or reads the contents of a custom format text file

Perl: Practical Extraction and Reporting Language, a widely used, freely available scripting language

RAID: Redundant Array of Independent Disks, a disk storage technology

registering inputs and outputs: specifying exising science objects that serve as inputs or outputs to another science object

science object: processing models, experiments, experiment steps, inputs, and outputs

science workflow: the execution of a sequence of experiments or experiment steps

wrapper: a script used to execute some base code without modifying it

XML: eXtensible Markup Language; a standard, or meta-language, that allows user-defined special instructions, or tags, within a text document; this standard can be used to create text documents structured in a custom format, which may be used for data transfer

8. References

[1] National Research Council, Global Environmental Change: Research Pathways for the Next Decade. Washington, DC: National Academy Press, 1999.

[2] "XML Parser for Java: another alphaWorks technology,"  2000, IBM, <http://www.alphaworks.ibm.com/tech/xml4j> (20 Nov 2000).

[3] "MySQL,"  2001, MySQL AB, <http://www.mysql.com> (9 Feb 2001).

[4] "Informix Corporation,"  2001, Informix Corporation, <http://www.informix.com> (9 Feb 2001).

[5] C. Gautier, G. Diak, and S. Masse, "A simple physical model to estimate incident solar radiation at the surface from GOES satellite data," Journal of Applied Meteorology, vol. 19, pp. 1005-1012, 1980.

[6] G. R. Diak and C. Gautier, "Improvements to a simple physical model for estimating insolation from GOES data," Journal of Climate and Applied Meteorology, vol. 22, pp. 505-508, 1983.

[7] R. Frouin, D. Lingner, and C. Gautier, "A simple analytical formula to compute clear sky total and photosynthetically available solar irradiance at the ocean surface," Journal of Geophysical Research, vol. 94(C7), pp. 9731-9742, 1989.

[8] C. Gautier and M. Landsfeld, "Surface solar radiation flux and cloud radiative forcing for the Atmospheric Radiation Measurement (ARM) Southern Great Plains (SGP): a satellite, surface observations, and radiative transfer model study," Journal of Atmospheric Science, 1997.

[9] R. Rivest, "The MD5 Message Digest Algorithm," RFC 1321, MIT Laboratory for Computer Science, Cambridge, MA, April 1992. Available at <http://www.freesoft.org/CIE/RFC/1321/index.htm>

[10] J. Radajewski and D. Eadline, "Beowulf HOWTO,"  1998, <http://beowulf-underground.org/doc_project/HOWTO/english/Beowulf-HOWTO.html> (18 Nov 2000).

[11] K. Kidwell, "NOAA Polar Orbiter Data User's Guide,", U.S. Department of Commerce, Washington DC, 1991.

[12] "AVIRIS Home Page,"  2001, California Institute of Technology, <http://makalu.jpl.nasa.gov/> (21 Apr 2001).

 


Send comments to eilwebmaster@bren.ucsb.edu
UCSB Donald Bren School of Environmental Science and Management University of California, Santa Barbara NASA  ESE UCSB / ICESS