Filter usage

The following sections only show snippets of commands, as there are quite a number of filters available.

Spectral filters#

center - subtracts the column mean from the columns (batch filter)
downsample - extracts every n-th wave number
equi-distance - evenly spaces the wave numbers
log - log-transforms the amplitudes
pca - applies principal components analysis for dimensionality reduction
pls1 - applies the PLS1 partial-least-squares algorithm (batch filter)
rownorm (aka standard-normal-variate) - subtracts mean and divides by standard deviation
savitzky-golay and savitzky-golay2 - the Savitzky-Golay smoothing algorithm
simpls - applies the SIMPLS partial-least-squares algorithm (batch filter)
standardize - column-wise subtracts the column mean and divides by the column stdev (batch filter)

Applying PLS1:

sdc-convert -l INFO -b \
  from-adams \
    -l INFO \
    -i {CWD}/input/*.spec \
  pls1 \
    -l INFO \
    -n 4 \
    -r al.ext_usda.a1056_mg.kg \
  to-adams \
    -l INFO \
    -o {CWD}/output/ \
    --output_sampledata

Using Savitzky-Golay:

sdc-convert -l INFO \
  from-adams \
    -l INFO \
    -i {CWD}/input/*.spec \
  savitzky-golay \
    -l INFO \
    -L 3 \
    -R 5 \
  to-adams \
    -l INFO \
    -o {CWD}/output/ \
    --output_sampledata

Meta-data management#

metadata - allows comparisons on meta-data values and whether to keep or discard a record in case of a match
metadata-from-name - allows extraction of meta-data value from the spectrum name via a regular expression
metadata-to-placeholder - turns meta-data into placeholders, which can be used for redirecting output of writers
split-records - adds the field split to the meta-data of the record passing through, which can be acted on with other filters (or stored in the output)

Splitting data into train/test and training center on the train batch:

sdc-convert -l INFO -b \
  from-adams \
    -l INFO \
    -i {CWD}/input/*.spec \
  split-records \
    --split_names train test \
    --split_ratios 50 50 \
  center \
    -l INFO \
    -k split \
    --batch_order train test \
  to-adams \
    -l INFO \
    -o {CWD}/output/ \
    --output_sampledata

Record management#

A number of generic record management filters are available:

check-duplicate-filenames - when using multiple batches as input, duplicate file names can be an issue when creating a combined output
discard-by-name - discards spectra based on their name, either using explicit names or regular expressions, e.g., excluding quality control samples
max-records - limits the number of records passing through
randomize-records - when processing batches, this filter can randomize them (seeded or unseeded)
record-window - only lets a certain window of records pass through (e.g., the first 1000)
rename - allows renaming of spectra, e.g., prefixing them with a batch number/ID
sample - for selecting a random sub-sample from the stream

Randomizing records and outputting the first 100:

sdc-convert -l INFO \
  from-adams \
    -l INFO \
    -i {CWD}/input/*.spec \
  randomize-records \
    -l INFO \
    -s 1 \
  max-records \
    -l INFO \
    -m 100 \
  to-adams \
    -l INFO \
    -o {CWD}/test/adams/output/ \
    --output_sampledata

Cleaning data#

Using the apply-cleaner batch filter, you can clean batches of data using any of the defined cleaners. The following applies the IQR cleaner to remove spectra that have amplitudes that are considered outliers:

sdc-convert -l INFO -b \
  from-adams \
    -l INFO \
    -i {CWD}/input/*.spec \
  apply-cleaner \
    -l INFO \
    -c "iqr-cl -l INFO -f 4.25" \
  to-adams \
    -l INFO \
    -o {CWD}/output/ \
    --output_sampledata