PrIMe Data Model1. INTRODUCTIONThis is a brief description of the current PrIMe Data Model (PDM); a fuller documentation is forthcoming [1]. The Reader should view the current PDM as an initial attempt at organizing data in support of archival, curation, and use of data for chemical kinetics models; further development is anticipated with the involvement of the community. 2. PRINCIPAL CONCEPTS AND REQUIREMENTS OF DATA ORGANIZATION2.1. Depository and Library. PrIMe Data Warehouse consists, conceptually, of two parts: Depository and Library. The Depository is a repository of data provided by the community. The only requirement for the acceptance of the data into the Depository is the data-record completeness, which is checked electronically with the use of data schemas. The Library is a set of data evaluated by the PrIMe Work Groups (PWG). A set of primekinetics codes is designed to assist in both initial submission and PWG-evaluation of the data. 2.2. Single Record for Every Entity—No Data Duplication. There are several aspects to this requirement. First, there is no duplication of the same property values. Imagine if the enthalpy of formation of a species (like OH) is allowed to be recorded in numerous records. In such a case, when this enthalpy value is modified in a new study, one needs to update all the records where this number is stored, which can lead to omissions and hence inconsistency in future use of the data. PDM stores every entity value once and only once, with all other listings referring to the original record. In this way, updating the value requires updating just that one single record. Second, PDM distinguishes between primary and derived properties. For instance, PrIMe species records do not contain species molecular weight—it is calculated on the fly from atomic masses when needed. Thus, as a general rule, PDM strives to record primary properties and delegate evaluation of derived properties to data management software. Finally, the above considerations argue for a single (virtual) Depository overall. 2.3. Preservation of Data Records. The primekinetics data managements software is designed to capture and store the metadata associated with every data submission and data action (like who, when, etc). No PrIMe Data Warehouse record (file) will be erased or modified. Instead, when changed, a new record will be created and the old one moved to data attic. PrIMe DATA ORGANIZATION3.1. PrIMe Index. Every PrIMe data entity is assigned a unique index, primeID. This was necessitated by the fact that even for chemical species the suggested "unique identifier" (like CAS numbers and InChI strings) turned out to be not fully unique for some species, and for some they are still not developed. The primeID is a string composed of one to three lowercase letters sequence, unique to a data record type ("s" for species, "e" for chemical element, "rk" for reaction rate coefficient), followed by an eight digit sequence. For example, species data record for argon atom has primeIDs00000049. 3.2. Data Collections. The PrIMe Warehouse is composed of data collections. Presently there are the following data collections:
Each of the above collections is subdivided into catalog and data collections (directories, folders). The catalog collections contain XML files defining the corresponding entity; e.g., in elements a catalog file defines a chemical element, a species/catalog file defines a chemical species, etc. The data subdirectory is further subdivided into individual directories corresponding to the individual catalog XML files. These individual directories are populated with additional data files, as needed. This organization is illustrated with the following example. 3.2.1. Species Collection. The following figure is a schematic diagram of the PrIMe Warehouse species collection.
Figure 1. A schematic diagram of the PrIMe Warehouse species collection The catalog files are named starting with the letter s followed with the 8-digit number, and this is the PrIMe species primeID (like s00000001 or s00001234). The same primeID is used to name the corresponding subdirectories of species/data. These subdirectories contain data files, which can be of any type, as illustrated in the diagram. 3.3. Data Definition Files. The XML files of the catalog collections and of some data subdirectories will be validated against schemas. XML schemas express shared vocabularies and allow machines to carry out rules made by people. They provide a means for defining the structure, content and semantics of XML documents in more detail. Passing XML validation means that the submitted/created XML documents adhere to the scientific rules set by the PrIMe community. 4. REFERENCES
|