Cross is a library providing a core of functionality involving dependency injection, workflow definition, validation, and execution, as well as efficient data structures for sequential and parallel data processing. It is used by Maltcms and Maui for the specific application domain of chromatography-mass spectrometry data from analytical chemistry, metabolomics, and proteomics.
Cross is implemented in the platform-independent JAVA programming language. The individual modules of Cross are managed and built using Maven.
If you want to use Cross in your own projects, please see the Getting Started page for more details. Cross is dual-licensed under either the Lesser General Public License (L-GPL v3), or, at the licensees discretion, under the terms of the Eclipse Public License (EPL v1).
A workflow in Cross is made up of a sequence of fragment command objects that use file fragments as their in- and output type. The number of in- and output file fragments processed by a fragment command can differ, thus allowing map-reduce-like processing schemes or generally schemes with different or equal parities. The basic configuration of all workflow elements is performed using a Spring Application Context and Spring Beans - based xml configuration, supplemented by runtime properties.
Cross allows fragment commands to define their required variable fragments by adding class-level annotations. Additionally, fragment commands may define which variable fragments they provide. Thus, Cross can validate the accessibility of all variables required by a workflow before the workflow is actually executed. This helps avoid running computationally expensive workflows on invalid data.
Monitoring and Transformation
A workflow monitors the fragment commands it executes and notifies reqistered listeners of various workflow-related events. These include the creation of primary and secondary processing results, as well as general progress information. A workflow logs all completed tasks and their results in a distinct and unique (depending on configuration) self-contained (except for initial input data) output directory. This output directory contains all information necessary to re-run the workflow with the exact same parameters and conditions. Workflows in Cross are therefore self-descriptive and repeatable.
Efficient Data Structures
A file fragment is an aggregation of variable fragment objects, defined by a storage location URI. File fragment objects may reference an arbitrary number of source files, thereby allowing virtual aggregation of processing result variables of previous fragment commands. Shadowing allows file fragments to hide the existence of an upstream variable of the same name from downstream file fragments. DataSource implementations allow different URI extensions to be handled, so that file fragment objects can exist as simple files on disk or within a distributed database system. Custom implementations provide the mapping from binary or textual storage formats to the variable-based abstraction used by Cross.
Caching of Intermediate Results
File fragments have access to a user-defineable caching implementation. Currently, Ehcache (in memory and on disk), as well as a volatile in-memory weak reference hashmap and a non-volatile hashmap cache are available. Caches based on Ehcache can either be volatile or session-persistent, depending on the required use-case. Other cache implementations.
Cross variables have simple String-based names. However, in different contexts, the same variable name could have a different meaning. Thus, Cross supports namespaced controlled vocabularies for specific domains that translate a variable placeholder name to the actual, cv-supported clear name. The cv system also supports deprecation for the evolution of terms.
Cross uses the Mpaxs API for transparent parallelization of Runnable and Callable tasks either within the local virtual machine or on other remote machines that are coordinated through remote method invocation(RMI). Mpaxs therefore provides a standard Executor and Future compatible implementation to allow for easy scale-out of parallel jobs. Scaling up and down with the required amount of parallelization can be managed automatically by Mpaxs for example using its OpenGridEngine (OracleGridEngine) compliant compute host launcher implementation. Mpaxs uses a round-robin scheduling method to utilize all available hosts as fair as possible.