Apply an imputation method to the data set.
Usage
impute(
dataSet,
imputeType = "LocalMinVal",
reqPercentPresent = 0.51,
k = 10,
rowmax = 0.5,
colmax = 0.8,
maxp = 1500,
rng.seed = 362436069,
rank.max = NULL,
lambda = NULL,
thresh = 1e-05,
maxit = 100,
final.svd = TRUE,
reportImputing = FALSE
)
Arguments
- dataSet
The 2d data set of experimental values.
- imputeType
A character string (default = "LocalMinVal") specifying which imputation method to use:
"LocalMinVal": replace missing values with the lowest value from the protein by condition combination.
"GlobalMinVal": replace missing values with the lowest value found within the entire data set.
"knn": replace missing values using the k-nearest neighbors algorithm (Troyanskaya et al. 2001) .
"seq-knn": replace missing values using the sequential k-nearest neighbors algorithm (Kim et al. 2004) .
"trunc-knn": replace missing values using the truncated k-nearest neighbors algorithm (Shah et al. 2017) .
"nuc-norm": replace missing values using the nuclear-norm regularization (Hastie et al. 2015) .
- reqPercentPresent
A scalar (default = 0.51) specifying the required percent of values that must be present in a given protein by condition combination for values to be imputed for
imputeType = "LocalMinVal"
.- k
An integer (default = 10) indicating the number of neighbors to be used in the imputation when
imputeType
is"knn"
,"seq-knn"
, or"trunc-knn"
.- rowmax
A scalar (default = 0.5) specifying the maximum percent missing data allowed in any row when
imputeType = "knn"
. For any rows with more thanrowmax
*100% missing are imputed using the overall mean per sample.- colmax
A scalar (default = 0.8) specifying the maximum percent missing data allowed in any column when
imputeType = "knn"
. If any column has more thancolmax
*100% missing data, the program halts and reports an error.- maxp
An integer (default = 1500) indicating the largest block of proteins imputed using the k-nearest neighbors algorithm when
imputeType = "knn"
. Larger blocks are divided by two-means clustering (recursively) prior to imputation.- rng.seed
An integer (default = 362436069) specifying the seed used for the random number generator for reproducibility when
imputeType = "knn"
.- rank.max
An integer specifying the restriction on the rank of the solution for
imputeType = "nuc-norm"
. The default is set to one less than the minimum dimension of the dataset.- lambda
A scalar specifying the nuclear-norm regularization parameter for
imputeType = "nuc-norm"
. Iflambda = 0
, the algorithm convergence is typically slower. The default is set to the maximum singular value obtained from the singular value decomposition (SVD) of the dataset.- thresh
A scalar (default = 1e-5) specifying the convergence threshold for
imputeType = "nuc-norm"
, measured as the relative change in the Frobenius norm between two successive estimates.- maxit
An integer (default = 100) specifying the maximum number of iterations before the convergence is reached for
imputeType = "nuc-norm"
.- final.svd
A boolean (default = TRUE) specifying whether to perform a one-step unregularized iteration at the final iteration for
imputeType = "nuc-norm"
, followed by soft-thresholding of the singular values, resulting in hard zeros.- reportImputing
A boolean (default = FALSE) specifying whether to provide a shadow data frame with imputed data labels, where 1 indicates the corresponding entries have been imputed, and 0 indicates otherwise. Alters the return structure.
Value
If
reportImputing = FALSE
, the function returns the imputed 2d dataframe.If
reportImputing = TRUE
, the function returns a list of the imputed 2d dataframe and a shadow matrix showing which proteins by replicate were imputed.
References
Hastie T, Mazumder R, Lee JD, Zadeh R (2015).
“Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.”
Journal of Machine Learning Research, 16(104), 3367—3402.
http://jmlr.org/papers/v16/hastie15a.html.
Kim K, Kim B, Yi G (2004).
“Reuse of Imputed Data in Microarray Analysis Increases Imputation Efficiency.”
BMC bioinformatics, 5, 160.
doi:10.1186/1471-2105-5-160
.
Shah JS, Rai SN, DeFilippis AP, Hill BG, Bhatnagar A, Brock GN (2017).
“Distribution Based Nearest Neighbor Imputation for Truncated High Dimensional Data with Applications to Pre-Clinical and Clinical Metabolomics Studies.”
BMC bioinformatics, 18, 114.
doi:10.1186/s12859-017-1547-6
.
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001).
“Missing Value Estimation Methods for DNA Microarrays.”
Bioinformatics, 17(6), 520–525.
doi:10.1093/bioinformatics/17.6.520
.