Apply an imputation method to the data set.
imputeType = "LocalMinVal",
reqPercentPresent = 0.51,
k = 10,
rowmax = 0.5,
colmax = 0.8,
maxp = 1500,
rng.seed = 362436069,
rank.max = NULL,
lambda = NULL,
thresh = 1e-05,
maxit = 100,
final.svd = TRUE,
reportImputing = FALSE
- dataSet
The 2d data set of experimental values.
- imputeType
A character string (default = "LocalMinVal") specifying which imputation method to use:
"LocalMinVal": replace missing values with the lowest value from the protein by condition combination.
"GlobalMinVal": replace missing values with the lowest value found within the entire data set.
"knn": replace missing values using the k-nearest neighbors algorithm (Troyanskaya et al. 2001) .
"seq-knn": replace missing values using the sequential k-nearest neighbors algorithm (Kim et al. 2004) .
"trunc-knn": replace missing values using the truncated k-nearest neighbors algorithm (Shah et al. 2017) .
"nuc-norm": replace missing values using the nuclear-norm regularization (Hastie et al. 2015) .
- reqPercentPresent
A scalar (default = 0.51) specifying the required percent of values that must be present in a given protein by condition combination for values to be imputed for
imputeType = "LocalMinVal"
.- k
An integer (default = 10) indicating the number of neighbors to be used in the imputation when
, or"trunc-knn"
.- rowmax
A scalar (default = 0.5) specifying the maximum percent missing data allowed in any row when
imputeType = "knn"
. For any rows with more thanrowmax
*100% missing are imputed using the overall mean per sample.- colmax
A scalar (default = 0.8) specifying the maximum percent missing data allowed in any column when
imputeType = "knn"
. If any column has more thancolmax
*100% missing data, the program halts and reports an error.- maxp
An integer (default = 1500) indicating the largest block of proteins imputed using the k-nearest neighbors algorithm when
imputeType = "knn"
. Larger blocks are divided by two-means clustering (recursively) prior to imputation.- rng.seed
An integer (default = 362436069) specifying the seed used for the random number generator for reproducibility when
imputeType = "knn"
.- rank.max
An integer specifying the restriction on the rank of the solution for
imputeType = "nuc-norm"
. The default is set to one less than the minimum dimension of the dataset.- lambda
A scalar specifying the nuclear-norm regularization parameter for
imputeType = "nuc-norm"
. Iflambda = 0
, the algorithm convergence is typically slower. The default is set to the maximum singular value obtained from the singular value decomposition (SVD) of the dataset.- thresh
A scalar (default = 1e-5) specifying the convergence threshold for
imputeType = "nuc-norm"
, measured as the relative change in the Frobenius norm between two successive estimates.- maxit
An integer (default = 100) specifying the maximum number of iterations before the convergence is reached for
imputeType = "nuc-norm"
.- final.svd
A boolean (default = TRUE) specifying whether to perform a one-step unregularized iteration at the final iteration for
imputeType = "nuc-norm"
, followed by soft-thresholding of the singular values, resulting in hard zeros.- reportImputing
A boolean (default = FALSE) specifying whether to provide a shadow data frame with imputed data labels, where 1 indicates the corresponding entries have been imputed, and 0 indicates otherwise. Alters the return structure.
reportImputing = FALSE
, the function returns the imputed 2d dataframe.If
reportImputing = TRUE
, the function returns a list of the imputed 2d dataframe and a shadow matrix showing which proteins by replicate were imputed.
Hastie T, Mazumder R, Lee JD, Zadeh R (2015).
“Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.”
Journal of Machine Learning Research, 16(104), 3367—3402.
Kim K, Kim B, Yi G (2004).
“Reuse of Imputed Data in Microarray Analysis Increases Imputation Efficiency.”
BMC bioinformatics, 5, 160.
Shah JS, Rai SN, DeFilippis AP, Hill BG, Bhatnagar A, Brock GN (2017).
“Distribution Based Nearest Neighbor Imputation for Truncated High Dimensional Data with Applications to Pre-Clinical and Clinical Metabolomics Studies.”
BMC bioinformatics, 18, 114.
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001).
“Missing Value Estimation Methods for DNA Microarrays.”
Bioinformatics, 17(6), 520–525.