Filter usage
The following sections only show snippets of commands, as there are quite a number of filters available.
Spectral filters#
center
- subtracts the column mean from the columns (batch filter)downsample
- extracts every n-th wave numberequi-distance
- evenly spaces the wave numberslog
- log-transforms the amplitudespca
- applies principal components analysis for dimensionality reductionpls1
- applies the PLS1 partial-least-squares algorithm (batch filter)rownorm
(akastandard-normal-variate
) - subtracts mean and divides by standard deviationsavitzky-golay
andsavitzky-golay2
- the Savitzky-Golay smoothing algorithmsimpls
- applies the SIMPLS partial-least-squares algorithm (batch filter)standardize
- column-wise subtracts the column mean and divides by the column stdev (batch filter)
Applying PLS1:
sdc-convert -l INFO -b \
from-adams \
-l INFO \
-i {CWD}/input/*.spec \
pls1 \
-l INFO \
-n 4 \
-r al.ext_usda.a1056_mg.kg \
to-adams \
-l INFO \
-o {CWD}/output/ \
--output_sampledata
Using Savitzky-Golay:
sdc-convert -l INFO \
from-adams \
-l INFO \
-i {CWD}/input/*.spec \
savitzky-golay \
-l INFO \
-L 3 \
-R 5 \
to-adams \
-l INFO \
-o {CWD}/output/ \
--output_sampledata
Meta-data management#
metadata
- allows comparisons on meta-data values and whether to keep or discard a record in case of a matchmetadata-from-name
- allows extraction of meta-data value from the spectrum name via a regular expressionmetadata-to-placeholder
- turns meta-data into placeholders, which can be used for redirecting output of writerssplit-records
- adds the fieldsplit
to the meta-data of the record passing through, which can be acted on with other filters (or stored in the output)
Splitting data into train/test and training center
on the train
batch:
sdc-convert -l INFO -b \
from-adams \
-l INFO \
-i {CWD}/input/*.spec \
split-records \
--split_names train test \
--split_ratios 50 50 \
center \
-l INFO \
-k split \
--batch_order train test \
to-adams \
-l INFO \
-o {CWD}/output/ \
--output_sampledata
Record management#
A number of generic record management filters are available:
check-duplicate-filenames
- when using multiple batches as input, duplicate file names can be an issue when creating a combined outputdiscard-by-name
- discards spectra based on their name, either using explicit names or regular expressions, e.g., excluding quality control samplesmax-records
- limits the number of records passing throughrandomize-records
- when processing batches, this filter can randomize them (seeded or unseeded)record-window
- only lets a certain window of records pass through (e.g., the first 1000)rename
- allows renaming of spectra, e.g., prefixing them with a batch number/IDsample
- for selecting a random sub-sample from the stream
Randomizing records and outputting the first 100:
sdc-convert -l INFO \
from-adams \
-l INFO \
-i {CWD}/input/*.spec \
randomize-records \
-l INFO \
-s 1 \
max-records \
-l INFO \
-m 100 \
to-adams \
-l INFO \
-o {CWD}/test/adams/output/ \
--output_sampledata
Cleaning data#
Using the apply-cleaner
batch filter, you can clean batches of data
using any of the defined cleaners. The following applies the IQR cleaner
to remove spectra that have amplitudes that are considered outliers:
sdc-convert -l INFO -b \
from-adams \
-l INFO \
-i {CWD}/input/*.spec \
apply-cleaner \
-l INFO \
-c "iqr-cl -l INFO -f 4.25" \
to-adams \
-l INFO \
-o {CWD}/output/ \
--output_sampledata