Title: | Ensemble Models of Rank-Based Trees with Extracted Decision Rules |
---|---|
Description: | Fast computing an ensemble of rank-based trees via boosting or random forest on binary and multi-class problems. It converts continuous gene expression profiles into ranked gene pairs, for which the variable importance indices are computed and adopted for dimension reduction. Decision rules can be extracted from trees. |
Authors: | Ruijie Yin [aut],
Chen Ye [aut],
Min Lu [aut, cre] |
Maintainer: | Min Lu <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.21 |
Built: | 2025-03-12 06:02:41 UTC |
Source: | https://github.com/transbioinfolab/ranktreeensemble |
Extract rules from a random forest (rfsrc)
object
extract.rules(object, subtrees = 5, treedepth = 2, digit = 2, pairs = TRUE)
extract.rules(object, subtrees = 5, treedepth = 2, digit = 2, pairs = TRUE)
object |
A random forest |
subtrees |
Number of trees to extract rules |
treedepth |
Tree depth. The larger the number, the longer the extracted rules are. |
digit |
Digit to be displayed in the extracted rules. |
pairs |
Are varibles in |
rule |
Interpretable extracted rules. Note that the performance score displayed is inaccurate based on few samples. |
rule.raw |
Rules directly extracted from trees for prediction purpose |
data |
Data used to grow trees from the argument |
Ruijie Yin (Maintainer,<[email protected]>), Chen Ye and Min Lu
Lu M. Yin R. and Chen X.S. (2023). Ensemble Methods of Rank-Based Trees for Single Sample Classification with Gene Expression Profiles.
data(tnbc) obj <- rforest(subtype~., data = tnbc[1:100,c(1:5,337)]) objr <- extract.rules(obj) objr$rule[,1:3] #### extract rules from a regular random forest library(randomForestSRC) obj2 <- rfsrc(subtype~., data = tnbc[1:100,c(1:5,337)]) objr2 <- extract.rules(obj2, pairs = FALSE) objr2$rule[,1:3]
data(tnbc) obj <- rforest(subtype~., data = tnbc[1:100,c(1:5,337)]) objr <- extract.rules(obj) objr$rule[,1:3] #### extract rules from a regular random forest library(randomForestSRC) obj2 <- rfsrc(subtype~., data = tnbc[1:100,c(1:5,337)]) objr2 <- extract.rules(obj2, pairs = FALSE) objr2$rule[,1:3]
The function computes variable importance for each predictor from a rank-based random forests model or boosting model. A higher value indicates a more important predictor. The random forest implementation was performed via the function vimp
directly imported from the randomForestSRC
package. Use the command package?randomForestSRC
for more information. The boosting implementation was performed via the function relative.influence
directly imported from the gbm
package. For technical details, see the
vignette: utils::browseVignettes("gbm")
.
importance(object, ...)
importance(object, ...)
object |
An object of class |
... |
Further arguments passed to or from other methods. |
For the boosting model, a vector of variable importance values is given. For the random forest model, a matrix of variable importance values is given for the variable importance index for all
the class labels, followed by the index for each class label.
Ruijie Yin (Maintainer,<[email protected]>), Chen Ye and Min Lu
Lu M. Yin R. and Chen X.S. (2023). Ensemble Methods of Rank-Based Trees for Single Sample Classification with Gene Expression Profiles.
data(tnbc) ###################################################### # Random Forest ###################################################### obj <- rforest(subtype~., data = tnbc[,c(1:10,337)]) importance(obj) ###################################################### # Boosting ###################################################### obj <- rboost(subtype~., data = tnbc[,c(1:10,337)]) importance(obj)
data(tnbc) ###################################################### # Random Forest ###################################################### obj <- rforest(subtype~., data = tnbc[,c(1:10,337)]) importance(obj) ###################################################### # Boosting ###################################################### obj <- rboost(subtype~., data = tnbc[,c(1:10,337)]) importance(obj)
The function transforms a dataset with continuous predictors into
binary predictors of ranked pairs
pair(data, yvar.name = NULL)
pair(data, yvar.name = NULL)
data |
A dataset with |
yvar.name |
The column name of the independent variable in |
A data frame with the transformed data. The dependent variable is moved to the last column of the data.
The function is efficiently coded in C++.
Ruijie Yin (Maintainer,<[email protected]>), Chen Ye and Min Lu
Lu M. Yin R. and Chen X.S. (2023). Ensemble Methods of Rank-Based Trees for Single Sample Classification with Gene Expression Profiles.
data(tnbc) datp <- pair(tnbc[101:105,c(1:5,337)],"subtype") datp datp <- pair(tnbc[105:110,1:5]) datp
data(tnbc) datp <- pair(tnbc[101:105,c(1:5,337)],"subtype") datp datp <- pair(tnbc[105:110,1:5]) datp
Obtain predicted values using a random forest (rfsrc)
, random forest extracted rule (rules)
or boosting (gbm)
object. If no new data is provided, it extracts the out-of-bag predicted values of the outcome for the training data.
predict(object, newdata = NULL, newdata.pair = FALSE, ...)
predict(object, newdata = NULL, newdata.pair = FALSE, ...)
object |
An object of class |
newdata |
Test data. If missing, the original training data is used for extracting the out-of-bag predicted values without running the model again. |
newdata.pair |
Is |
... |
Further arguments passed to or from other methods. |
For the boosting (gbm)
object, the cross-validation predicted values are provided if cv.folds>=2
.
value |
Predicted value of the outcome. For the random forest |
label |
Predicted label of the outcome. |
Ruijie Yin (Maintainer,<[email protected]>), Chen Ye and Min Lu
Lu M. Yin R. and Chen X.S. (2023). Ensemble Methods of Rank-Based Trees for Single Sample Classification with Gene Expression Profiles.
data(tnbc) ###################################################### # Random Forest ###################################################### obj <- rforest(subtype~., data = tnbc[1:100,c(1:5,337)]) predict(obj)$label predict(obj, tnbc[101:110,1:5])$label datp <- pair(tnbc[101:110,1:5]) predict(obj, datp, newdata.pair = TRUE)$label ###################################################### # Random Forest Extracted Rule ###################################################### objr <- extract.rules(obj) predict(objr)$label[1:5] predict(obj, tnbc[101:110,1:5])$label ###################################################### # Boosting ###################################################### obj <- rboost(subtype~., data = tnbc[1:100,c(1:5,337)]) predict(obj)$label predict(obj, tnbc[101:110,1:5])$label
data(tnbc) ###################################################### # Random Forest ###################################################### obj <- rforest(subtype~., data = tnbc[1:100,c(1:5,337)]) predict(obj)$label predict(obj, tnbc[101:110,1:5])$label datp <- pair(tnbc[101:110,1:5]) predict(obj, datp, newdata.pair = TRUE)$label ###################################################### # Random Forest Extracted Rule ###################################################### objr <- extract.rules(obj) predict(objr)$label[1:5] predict(obj, tnbc[101:110,1:5])$label ###################################################### # Boosting ###################################################### obj <- rboost(subtype~., data = tnbc[1:100,c(1:5,337)]) predict(obj)$label predict(obj, tnbc[101:110,1:5])$label
The package ranktreeEnsemble
implements an ensemble of rank-based trees in boosting with the LogitBoost cost and random forests on both binary and multi-class problems. It converts continuous gene expression profiles into
ranked gene pairs, for which the variable importance indices are computed and adopted for dimension reduction. Interpretable rules can be extracted from trees.
Ruijie Yin (Maintainer,<[email protected]>), Chen Ye and Min Lu
Lu M. Yin R. and Chen X.S. (2023). Ensemble Methods of Rank-Based Trees for Single Sample Classification with Gene Expression Profiles.
library(ranktreeEnsemble) data(tnbc) ########### performance of Random Rank Forest obj <- rforest(subtype~., data = tnbc[,c(1:10,337)]) obj # variable importance importance(obj) ########### predict new data from Random Rank Forest predict(obj, tnbc[101:110,1:10])$label ########### extract decision rules from rank-based trees objr <- extract.rules(obj) objr$rule[1:5,] predict(objr, tnbc[101:110,1:10])$label ########### filter decision rules with higher performance objrs <- select.rules(objr,tnbc[110:130,c(1:10,337)]) predict(objrs, tnbc[101:110,1:10])$label
library(ranktreeEnsemble) data(tnbc) ########### performance of Random Rank Forest obj <- rforest(subtype~., data = tnbc[,c(1:10,337)]) obj # variable importance importance(obj) ########### predict new data from Random Rank Forest predict(obj, tnbc[101:110,1:10])$label ########### extract decision rules from rank-based trees objr <- extract.rules(obj) objr$rule[1:5,] predict(objr, tnbc[101:110,1:10])$label ########### filter decision rules with higher performance objrs <- select.rules(objr,tnbc[110:130,c(1:10,337)]) predict(objrs, tnbc[101:110,1:10])$label
The function fits generalized boosted models via Rank-Based Trees on both binary and multi-class problems. It converts continuous gene expression profiles into
ranked gene pairs, for which the variable importance indices are computed and adopted for dimension reduction. The boosting implementation was directly imported from the gbm
package. For technical details, see the
vignette: utils::browseVignettes("gbm")
.
rboost( formula, data, dimreduce = TRUE, datrank = TRUE, distribution = "multinomial", weights, ntree = 100, nodedepth = 3, nodesize = 5, shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 1, cv.folds = 5, keep.data = TRUE, verbose = TRUE, class.stratify.cv = TRUE, n.cores = NULL )
rboost( formula, data, dimreduce = TRUE, datrank = TRUE, distribution = "multinomial", weights, ntree = 100, nodedepth = 3, nodesize = 5, shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 1, cv.folds = 5, keep.data = TRUE, verbose = TRUE, class.stratify.cv = TRUE, n.cores = NULL )
formula |
Object of class 'formula' describing the model to fit. |
data |
Data frame containing the y-outcome and x-variables. |
dimreduce |
Dimension reduction via variable importance weighted forests. |
datrank |
If using ranked raw data for fitting the dimension reduction model. |
distribution |
Either a character string specifying the name of the
distribution to use: if the response has only 2 unique values,
|
weights |
an optional vector of weights to be used in the fitting process. It must be positive but does not need to be normalized. |
ntree |
Integer specifying the total number of trees to fit. This is
equivalent to the number of iterations and the number of basis functions in
the additive expansion, which matches |
nodedepth |
Integer specifying the maximum depth of each tree. A value of 1
implies an additive model. This matches |
nodesize |
Integer specifying the minimum number of observations
in the terminal nodes of the trees, which matches |
shrinkage |
a shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction; 0.001 to 0.1 usually work, but a smaller learning rate typically requires more trees. Default is 0.05. |
bag.fraction |
the fraction of the training set observations randomly
selected to propose the next tree in the expansion. This introduces
randomnesses into the model fit. If |
train.fraction |
The first |
cv.folds |
Number of cross-validation folds to perform. If
|
keep.data |
a logical variable indicating whether to keep the data and
an index of the data stored with the object. Keeping the data and index
makes subsequent calls to |
verbose |
Logical indicating whether or not to print out progress and
performance indicators ( |
class.stratify.cv |
Logical indicating whether or not the cross-validation should be stratified by class. The purpose of stratifying the cross-validation is to help avoid situations in which training sets do not contain all classes. |
n.cores |
The number of CPU cores to use. The cross-validation loop
will attempt to send different CV folds off to different cores. If
|
fit |
A vector containing the fitted values on the scale of regression function (e.g. log-odds scale for bernoulli). |
train.error |
A vector of length equal to the number of fitted trees containing the value of the loss function for each boosting iteration evaluated on the training data. |
valid.error |
A vector of length equal to the number of fitted trees containing the value of the loss function for each boosting iteration evaluated on the validation data. |
cv.error |
If |
oobag.improve |
A vector of
length equal to the number of fitted trees containing an out-of-bag estimate
of the marginal reduction in the expected value of the loss function. The
out-of-bag estimate uses only the training data and is useful for estimating
the optimal number of boosting iterations. See |
cv.fitted |
If cross-validation was performed, the cross-validation predicted values on the scale of the linear predictor. That is, the fitted values from the i-th CV-fold, for the model having been trained on the data in all other folds. |
Ruijie Yin (Maintainer,<[email protected]>), Chen Ye and Min Lu
Lu M. Yin R. and Chen X.S. (2023). Ensemble Methods of Rank-Based Trees for Single Sample Classification with Gene Expression Profiles.
data(tnbc) obj <- rboost(subtype~., data = tnbc[,c(1:10,337)]) obj
data(tnbc) obj <- rboost(subtype~., data = tnbc[,c(1:10,337)]) obj
The function implements the ensembled rank-based trees in random forests on both binary and multi-class problems. It converts continuous gene expression profiles into
ranked gene pairs, for which the variable importance indices are computed and adopted for dimension reduction. The random forest implementation was directly imported from the randomForestSRC
package. Use the command package?randomForestSRC
for more information.
rforest(formula, data, dimreduce = TRUE, datrank = TRUE, ntree = 500, mtry = NULL, nodesize = NULL, nodedepth = NULL, splitrule = NULL, nsplit = NULL, importance = c(FALSE, TRUE, "none", "anti", "permute", "random"), bootstrap = c("by.root", "none"), membership = FALSE, na.action = c("na.omit", "na.impute"), nimpute = 1, perf.type = NULL, xvar.wt = NULL, yvar.wt = NULL, split.wt = NULL, case.wt = NULL, forest = TRUE, var.used = c(FALSE, "all.trees", "by.tree"), split.depth = c(FALSE, "all.trees", "by.tree"), seed = NULL, statistics = FALSE, ...) ## convenient interface for growing a rank-based tree rforest.tree(formula, data, dimreduce = FALSE, ntree = 1, mtry = ncol(data), bootstrap = "none", ...)
rforest(formula, data, dimreduce = TRUE, datrank = TRUE, ntree = 500, mtry = NULL, nodesize = NULL, nodedepth = NULL, splitrule = NULL, nsplit = NULL, importance = c(FALSE, TRUE, "none", "anti", "permute", "random"), bootstrap = c("by.root", "none"), membership = FALSE, na.action = c("na.omit", "na.impute"), nimpute = 1, perf.type = NULL, xvar.wt = NULL, yvar.wt = NULL, split.wt = NULL, case.wt = NULL, forest = TRUE, var.used = c(FALSE, "all.trees", "by.tree"), split.depth = c(FALSE, "all.trees", "by.tree"), seed = NULL, statistics = FALSE, ...) ## convenient interface for growing a rank-based tree rforest.tree(formula, data, dimreduce = FALSE, ntree = 1, mtry = ncol(data), bootstrap = "none", ...)
formula |
Object of class 'formula' describing the model to fit. Interaction terms are not supported. |
data |
Data frame containing the y-outcome and x-variables. |
dimreduce |
Dimension reduction via variable importance weighted forests. |
datrank |
If using ranked raw data for fitting the dimension reduction model. |
ntree |
Number of trees. |
mtry |
Number of variables to possibly split at each node. Default is number of variables divided by 3 for regression. For all other families (including unsupervised settings), the square root of number of variables. Values are rounded up. |
nodesize |
Minumum size of terminal node. The defaults are:
survival (15), competing risk (15), regression (5), classification
(1), mixed outcomes (3), unsupervised (3). It is recommended to
experiment with different |
nodedepth |
Maximum depth to which a tree should be grown. Parameter is ignored by default. |
splitrule |
Splitting rule (see below). |
nsplit |
Non-negative integer specifying number of random splits for splitting a variable. When zero, all split values are used (deterministic splitting), which can be slower. By default 10 is used. |
importance |
Method for computing variable importance (VIMP); see
below. Default action is |
bootstrap |
Bootstrap protocol. Default is |
membership |
Should terminal node membership and inbag information be returned? |
na.action |
Action taken if the data contains |
nimpute |
Number of iterations of the missing data algorithm.
Performance measures such as out-of-bag (OOB) error rates are
optimistic if |
perf.type |
Optional character value specifying metric used
for predicted value, variable importance (VIMP), and error rate.
Reverts to the family default metric if not specified.
Values allowed for
univariate/multivariate classification are:
|
xvar.wt |
Vector of non-negative weights (does not have to sum to 1) representing the probability of selecting a variable for splitting. Default is uniform weights. |
yvar.wt |
Used for sending in features with custom splitting. For expert use only. |
split.wt |
Vector of non-negative weights used for multiplying the split statistic for a variable. A large value encourages the node to split on a specific variable. Default is uniform weights. |
case.wt |
Vector of non-negative weights (does not have to sum to 1) for sampling cases. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples. It is generally better to use real weights rather than integers. See the breast data example below illustrating its use for class imbalanced data. |
forest |
Save key forest values? Used for prediction on new data and required by many of the package functions. Turn this off if you are only interested in training a forest. |
var.used |
Return statistics on number of times a variable split?
Default is |
split.depth |
Records the minimal depth for each variable.
Default is |
seed |
Negative integer specifying seed for the random number generator. |
statistics |
Should split statistics be returned? Values can be
parsed using |
... |
Further arguments passed to or from other methods. |
Splitting
Splitting rules are specified by the option splitrule
.
For all families, pure random splitting can be invoked by setting
splitrule="random"
.
For all families, computational speed can be increased using
randomized splitting invoked by the option nsplit
.
See Improving Computational Speed.
Available splitting rules
splitrule="gini"
(default splitrule): Gini
index splitting (Breiman et al. 1984, Chapter 4.3).
splitrule="auc"
: AUC (area under the ROC curve) splitting
for both two-class and multiclass setttings. AUC splitting is
appropriate for imbalanced data. See imbalanced
for
more information.
splitrule="entropy"
: entropy splitting (Breiman et
al. 1984, Chapter 2.5, 4.3).
An object of class (rfsrc, grow)
with the following
components:
call |
The original call to |
family |
The family used in the analysis. |
n |
Sample size of the data (depends upon |
ntree |
Number of trees grown. |
mtry |
Number of variables randomly selected for splitting at each node. |
nodesize |
Minimum size of terminal nodes. |
nodedepth |
Maximum depth allowed for a tree. |
splitrule |
Splitting rule used. |
nsplit |
Number of randomly selected split points. |
yvar |
y-outcome values. |
yvar.names |
A character vector of the y-outcome names. |
xvar |
Data frame of x-variables. |
xvar.names |
A character vector of the x-variable names. |
xvar.wt |
Vector of non-negative weights for dimension reduction which specify the probability used to select a variable for splitting a node. |
split.wt |
Vector of non-negative weights specifying multiplier by which the split statistic for a covariate is adjusted. |
cause.wt |
Vector of weights used for the composite competing risk splitting rule. |
leaf.count |
Number of terminal nodes for each tree in the
forest. Vector of length |
proximity |
Proximity matrix recording the frequency of pairs of data points occur within the same terminal node. |
forest |
If |
membership |
Matrix recording terminal node membership where each column records node mebership for a case for a tree (rows). |
splitrule |
Splitting rule used. |
inbag |
Matrix recording inbag membership where each column contains the number of times that a case appears in the bootstrap sample for a tree (rows). |
var.used |
Count of the number of times a variable is used in growing the forest. |
imputed.indv |
Vector of indices for cases with missing values. |
imputed.data |
Data frame of the imputed data. The first column(s) are reserved for the y-outcomes, after which the x-variables are listed. |
split.depth |
Matrix (i,j) or array (i,j,k) recording the minimal depth for variable j for case i, either averaged over the forest, or by tree k. |
node.stats |
Split statistics returned when
|
err.rate |
Tree cumulative OOB error rate. |
importance |
Variable importance (VIMP) for each x-variable. |
predicted |
In-bag predicted value. |
predicted.oob |
OOB predicted value. |
class |
In-bag predicted class labels. |
class.oob |
OOB predicted class labels. |
Ruijie Yin (Maintainer,<[email protected]>), Chen Ye and Min Lu
Lu M. Yin R. and Chen X.S. (2023). Ensemble Methods of Rank-Based Trees for Single Sample Classification with Gene Expression Profiles.
data(tnbc) ########### performance of Random Rank Forest obj <- rforest(subtype~., data = tnbc[,c(1:10,337)]) obj
data(tnbc) ########### performance of Random Rank Forest obj <- rforest(subtype~., data = tnbc[,c(1:10,337)]) obj
Select rules from a extrat.rules (rules)
object
select.rules(object, data, data.pair = FALSE)
select.rules(object, data, data.pair = FALSE)
object |
An extracted rule |
data |
A validation dataset for selecting rules. |
data.pair |
Is data already converted into binary ranked pairs from the |
rule |
Interpretable selected rules. Note that the performance score displayed is inaccurate based on few samples from the original argument |
rule.raw |
Rules directly extracted from trees for prediction purpose |
data |
Data used to grow trees from the argument |
Ruijie Yin (Maintainer,<[email protected]>), Chen Ye and Min Lu
Lu M. Yin R. and Chen X.S. (2023). Ensemble Methods of Rank-Based Trees for Single Sample Classification with Gene Expression Profiles.
data(tnbc) obj <- rforest(subtype~., data = tnbc[1:100,c(1:5,337)]) objr <- extract.rules(obj) predict(objr, tnbc[101:110,1:5])$label objrs <- select.rules(objr,tnbc[110:130,c(1:5,337)]) predict(objrs, tnbc[101:110,1:5])$label
data(tnbc) obj <- rforest(subtype~., data = tnbc[1:100,c(1:5,337)]) objr <- extract.rules(obj) predict(objr, tnbc[101:110,1:5])$label objrs <- select.rules(objr,tnbc[110:130,c(1:5,337)]) predict(objrs, tnbc[101:110,1:5])$label
Gene expression profiles in triple-negative breast cancer cells with 215 observations and 337 variables. Gene expression values were randomly chosen from the original dataset. The outcome variable is subtype.
data(tnbc)
data(tnbc)
Chen, X., Li, J., Gray, W. H., Lehmann, B. D., Bauer, J. A., Shyr, Y., & Pietenpol, J. A. (2012). TNBCtype: a subtyping tool for triple-negative breast cancer. Cancer informatics, 11, CIN-S9983.
data(tnbc)
data(tnbc)