Prepare datasets
Prepare your datasets with proper formats, units and documentation.
Project documentation
README file. You should document the whole history of your dataset in a central place. We propose adding a plain-text “README” file in the Doc folder of your project. Every processing step from data collection to archiving should be documented in this file, together with the date of action.
Oceanography data in NetCDF format compliant with the CF format conventions should contain an “audit trail of processing operations” in the history
attribute of the NetCDF file.
Data format
Data for archiving shall be in a open and standard file format. The most prefered format is JSON. For more specialised applications, NetCDF or CSV are also possible. Spatial vectordata can be delivered in GeoJSON. ESRI shapefiles are only accepted under certain circumstances. For raster or image data, TIFF or GeoTIFF are the prefered formats. Field logs and similar documents should be transfered to some common rich-text format like PDF or HTML. If need be, they may also be scanned and saved as PDF, preferably with automatic text recognition.
All text-based formats should use UTF-8 encoding.
If a proprietary data format is used, any specialised software that is needed for reading and processing of the data must be referenced or, even better, be uploaded together with the data and the software documentation (manual).
File naming
Files should be named in a logical, describing and consistent way. During processing, the file names (or folder structure) should contain an id that is characteristic for the project and the data type, date of original creation, sequence number and, if necessary, a version number or date of last change. See this comic for an exampled of bad file names.
Names of archived files should be self-explaining. An external user should be able to know what is in the file just from the file name.
To prevent errors with different software and operating systems, file names should only contain English letters, numbers, hyphen and underscore. Spaces should be replaced by underscores.
Parameter names and vocabularies
It is important for the quality of the data that all data in the same series follow a consistent field naming and controlled vocabulary. Field names and table structures should be identical among all parts of the dataset. This way, they are comparable and easily archivable.
Measured values should have self-explaining parameter names that are consistent with the common practice in the scientific community. Some examples:
Not
WD
, butwind_from_direction
Not
T10
, butair_temperature_mean_10_min
Parameter names should be explained either as part of the metadata, own file called “units_parameters_vocabulary” or as part of the data itself. For fields that contains not free text but a controlled vocabulary, this vocabulary should be explained as path of the metadata, own file called “units_parameters_vocabulary” or as part of the data itself.
In some scientific communities, standards have been established for data models, controlled vocabularies, and metadata that should be followed. These are, for example:
Units and parameter formats
Units and parameter formats should be consistent throughout the dataset and be described in the metadata, as part of NetCDF or in a separate file called “units_parameters_vocabulary”. Together with header explanation also list of controlled vocabulary should be included with the data. All measured values must be given in SI or derived units. The precision of measurements should be stated in the metadata were applicable. For dates, times, and georeferencing, a consistent format is especially important.
Date and time shall be given in ISO 8601 and always with UTC as time zone. An example would be 2019-03-21T10:57:00Z. The “T” separates date from time and “Z” stands for UTC.
Durations and time resolutions shall be given in ISO 8601 periods. A duration of 3 months and 5 days would be P3M5D and a resolution of 10 minutes PT10M. Note the “T” that indicates time instead of date, so the “M” in the first example (without “T”) means “months”, whereas in the second example, it means “minutes”.
Georeferencing must use decimal grades with WGS84 as datum and height given in meters. Southern latitudes and western longitudes are given with negative sign. Tromsø lies at 69.682778, 18.942778; whereas Troll station lies at -72.011389, 2.535.
Data quality control
Manually registered data must be digitised to an open, preferably text-based format. Data must be checked for consistency, missing values, duplicates, and other possible errors. Unusable data should be removed. The quality of data may be documented in the metadatabase. An example of a quality control scheme is SeaDataNet.