Skip to contents

Apply an imputation method to the data set.

Usage

impute(
  dataSet,
  imputeType = "LocalMinVal",
  reqPercentPresent = 0.51,
  k = 10,
  rowmax = 0.5,
  colmax = 0.8,
  maxp = 1500,
  rng.seed = 362436069,
  rank.max = NULL,
  lambda = NULL,
  thresh = 1e-05,
  maxit = 100,
  final.svd = TRUE,
  reportImputing = FALSE
)

Arguments

dataSet

The 2d data set of experimental values.

imputeType

A character string (default = "LocalMinVal") specifying which imputation method to use:

  1. "LocalMinVal": replace missing values with the lowest value from the protein by condition combination.

  2. "GlobalMinVal": replace missing values with the lowest value found within the entire data set.

  3. "knn": replace missing values using the k-nearest neighbors algorithm (Troyanskaya et al. 2001) .

  4. "seq-knn": replace missing values using the sequential k-nearest neighbors algorithm (Kim et al. 2004) .

  5. "trunc-knn": replace missing values using the truncated k-nearest neighbors algorithm (Shah et al. 2017) .

  6. "nuc-norm": replace missing values using the nuclear-norm regularization (Hastie et al. 2015) .

reqPercentPresent

A scalar (default = 0.51) specifying the required percent of values that must be present in a given protein by condition combination for values to be imputed for imputeType = "LocalMinVal".

k

An integer (default = 10) indicating the number of neighbors to be used in the imputation when imputeType is "knn", "seq-knn", or "trunc-knn".

rowmax

A scalar (default = 0.5) specifying the maximum percent missing data allowed in any row when imputeType = "knn". For any rows with more than rowmax*100% missing are imputed using the overall mean per sample.

colmax

A scalar (default = 0.8) specifying the maximum percent missing data allowed in any column when imputeType = "knn". If any column has more than colmax*100% missing data, the program halts and reports an error.

maxp

An integer (default = 1500) indicating the largest block of proteins imputed using the k-nearest neighbors algorithm when imputeType = "knn". Larger blocks are divided by two-means clustering (recursively) prior to imputation.

rng.seed

An integer (default = 362436069) specifying the seed used for the random number generator for reproducibility when imputeType = "knn".

rank.max

An integer specifying the restriction on the rank of the solution for imputeType = "nuc-norm". The default is set to one less than the minimum dimension of the dataset.

lambda

A scalar specifying the nuclear-norm regularization parameter for imputeType = "nuc-norm". If lambda = 0, the algorithm convergence is typically slower. The default is set to the maximum singular value obtained from the singular value decomposition (SVD) of the dataset.

thresh

A scalar (default = 1e-5) specifying the convergence threshold for imputeType = "nuc-norm", measured as the relative change in the Frobenius norm between two successive estimates.

maxit

An integer (default = 100) specifying the maximum number of iterations before the convergence is reached for imputeType = "nuc-norm".

final.svd

A boolean (default = TRUE) specifying whether to perform a one-step unregularized iteration at the final iteration for imputeType = "nuc-norm", followed by soft-thresholding of the singular values, resulting in hard zeros.

reportImputing

A boolean (default = FALSE) specifying whether to provide a shadow data frame with imputed data labels, where 1 indicates the corresponding entries have been imputed, and 0 indicates otherwise. Alters the return structure.

Value

  • If reportImputing = FALSE, the function returns the imputed 2d dataframe.

  • If reportImputing = TRUE, the function returns a list of the imputed 2d dataframe and a shadow matrix showing which proteins by replicate were imputed.

References

Hastie T, Mazumder R, Lee JD, Zadeh R (2015). “Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.” Journal of Machine Learning Research, 16(104), 3367—3402. http://jmlr.org/papers/v16/hastie15a.html.

Kim K, Kim B, Yi G (2004). “Reuse of Imputed Data in Microarray Analysis Increases Imputation Efficiency.” BMC bioinformatics, 5, 160. doi:10.1186/1471-2105-5-160 .

Shah JS, Rai SN, DeFilippis AP, Hill BG, Bhatnagar A, Brock GN (2017). “Distribution Based Nearest Neighbor Imputation for Truncated High Dimensional Data with Applications to Pre-Clinical and Clinical Metabolomics Studies.” BMC bioinformatics, 18, 114. doi:10.1186/s12859-017-1547-6 .

Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001). “Missing Value Estimation Methods for DNA Microarrays.” Bioinformatics, 17(6), 520–525. doi:10.1093/bioinformatics/17.6.520 .