DaMiRseq - Data Mining for RNA-seq data: normalization, feature selection
and classification
The DaMiRseq package offers a tidy pipeline of data mining
procedures to identify transcriptional biomarkers and exploit
them for both binary and multi-class classification purposes.
The package accepts any kind of data presented as a table of
raw counts and allows including both continous and factorial
variables that occur with the experimental setting. A series of
functions enable the user to clean up the data by filtering
genomic features and samples, to adjust data by identifying and
removing the unwanted source of variation (i.e. batches and
confounding factors) and to select the best predictors for
modeling. Finally, a "stacking" ensemble learning technique is
applied to build a robust classification model. Every step
includes a checkpoint that the user may exploit to assess the
effects of data management by looking at diagnostic plots, such
as clustering and heatmaps, RLE boxplots, MDS or correlation
plot.