Imputation

Preliminary

## load R package
library(msDiaLogue)
## preprocessing
fileName <- "../tests/testData/Toy_Spectronaut_Data.csv"
dataSet <- preprocessing(fileName,
                         filterNaN = TRUE, filterUnique = 2,
                         replaceBlank = TRUE, saveRm = TRUE)
## transformation
dataTran <- transform(dataSet, logFold = 2)
## normalization
dataNorm <- normalize(dataTran, normalizeType = "quant")

Examples

For example, to impute the NA value of dataNorm using impute.min_local(), set the required percentage of values that must be present in a given protein by condition combination for values to be imputed to 51%.

Note: There is no rule in the field of proteomics for filtering based on percentage of missingness, similar to there being no rule for the number of replicates required to draw a conclusion. However, reproducible observations make conclusions more credible. Setting the reqPercentPresent to 0.51 requires that any protein be observed in a majority of the replicates by condition in order to be considered. For 3 replicates, this would require 2 measurements to allow imputation of the 3rd value. If only 1 measurement is seen, the other values will remain NA, and will be filtered out in a subsequent step.

dataImput <- impute.min_local(dataNorm, reportImputing = FALSE,
                              reqPercentPresent = 0.51)

R.Condition	R.Replicate	NUD4B_HUMAN	A0A7P0T808_HUMAN	A0A8I5KU53_HUMAN	ZN840_HUMAN	CC85C_HUMAN	TMC5B_HUMAN	C9JEV0_HUMAN	C9JNU9_HUMAN	ALBU_BOVIN	CYC_BOVIN	TRFE_BOVIN	KRT16_MOUSE	F8W0H2_HUMAN	H0Y7V7_HUMAN	H0YD14_HUMAN	H3BUF6_HUMAN	H7C1W4_HUMAN	H7C3M7_HUMAN	TCPR2_HUMAN	TLR3_HUMAN	LRIG2_HUMAN	RAB3D_HUMAN	ADH1_YEAST	LYSC_CHICK	BGAL_ECOLI	CYTA_HUMAN	KPCB_HUMAN	LIPL_HUMAN	PIP_HUMAN	CO6_HUMAN	BGAL_HUMAN	SYTC_HUMAN	CASPE_HUMAN	DCAF6_HUMAN	DALD3_HUMAN	HGNAT_HUMAN	RFFL_HUMAN	RN185_HUMAN	ZN462_HUMAN	ALKB7_HUMAN	POLK_HUMAN	ACAD8_HUMAN	A0A7I2PK40_HUMAN	NBDY_HUMAN	H0Y5R1_HUMAN
100pmol	1	10.37045	11.406514	10.956950	8.392426	8.710518	8.610420	7.829510	8.023133	16.75777	12.96499	13.97388	10.51096	9.136271	10.231965	10.048461	8.179306	8.279169	9.874410	14.201118	7.001503	8.832972	9.978488	15.16303	13.62766	14.44005	8.964155	9.574185	8.517979	6.420716	6.764393	12.07953	14.76033	6.004586	7.670711	10.129049	10.681337	7.242036	9.727210	9.376507	7.109682	7.393910	7.530379	10.370449	NA	NA
100pmol	2	11.40651	12.964987	9.727210	8.517979	8.832972	8.710518	7.242036	8.023133	16.75777	13.62766	14.20112	NA	8.964155	10.048461	10.231965	7.829510	7.670711	9.574185	10.681337	7.393910	8.610420	10.129049	15.16303	13.97388	14.44005	9.136271	9.978488	8.279169	6.764393	6.420716	12.07953	14.76033	7.109682	8.179306	9.874410	10.956950	7.001503	10.370449	8.392426	6.004586	9.376507	7.530379	10.510962	NA	NA
100pmol	3	10.32522	11.893804	10.851852	7.868171	8.887142	8.610420	6.429596	8.646475	16.75777	12.81909	13.94448	NA	9.027184	9.940284	9.504539	8.074082	7.698334	10.467264	14.178809	7.413486	8.433160	10.022816	15.15272	13.55168	14.42169	8.758911	9.812352	8.213284	NA	6.777937	11.29130	14.74334	6.004586	8.311138	9.649565	10.097660	7.009435	10.195272	8.550204	7.121143	7.262869	7.555404	10.625304	9.244669	NA
100pmol	4	10.51096	12.079525	10.956950	8.279169	8.964155	8.610420	6.004586	8.023133	16.75777	12.96499	13.97388	NA	9.136271	10.048461	9.978488	7.829510	7.670711	9.376507	14.440054	7.829510	8.517979	10.231965	15.16303	13.62766	14.20112	8.710518	10.129049	8.179306	NA	7.109682	11.40651	14.76033	6.420716	7.530379	8.832972	10.681337	6.764393	9.874410	8.392426	7.001503	7.242036	7.393910	10.370449	9.727210	9.574185
200pmol	1	10.27762	12.256403	10.413259	8.356482	7.375266	8.476493	7.098768	8.149770	16.75777	13.10393	14.00189	10.74088	9.255807	10.077232	9.798890	8.142561	7.974610	10.164570	13.700017	7.644405	8.253204	9.927121	15.17286	14.22236	14.77650	8.577167	10.010298	8.782531	6.004586	7.222195	11.51625	14.45755	6.412260	7.506546	9.641587	10.556023	6.751494	8.669939	9.040857	7.792691	6.993948	8.905805	11.057043	NA	9.504539
200pmol	2	10.53171	11.078642	10.729201	8.356482	8.732804	8.476493	7.026553	8.267212	16.75777	12.49496	13.38772	NA	9.012072	10.120238	9.798890	8.149770	7.964140	9.954832	14.130668	7.608925	8.620541	10.229206	15.13045	13.88102	14.70670	8.866513	10.377729	8.386314	NA	7.145874	11.59517	14.38206	6.004586	7.766206	9.827232	9.232358	6.448756	9.504539	8.520402	6.807164	7.455730	7.307825	9.827232	10.036651	9.658384
200pmol	3	11.05704	8.142561	12.256403	8.356482	8.905805	8.782531	6.412260	8.669939	16.75777	14.00189	13.70002	10.74088	9.255807	10.413259	10.164570	7.792691	7.506546	10.010298	NA	7.375266	8.476493	10.277619	15.17286	14.22236	14.77650	8.577167	9.927121	8.253204	NA	6.993948	13.10393	14.45755	6.004586	7.098768	10.077232	11.516245	6.751494	9.798890	9.040857	7.974610	7.222195	7.644405	10.556023	9.641587	9.504539
200pmol	4	10.72920	8.520402	11.595175	8.732804	7.964140	9.012072	6.448756	8.149770	16.75777	13.38772	13.88102	NA	9.504539	9.954832	10.229206	8.267212	6.807164	10.036651	NA	7.455730	8.253204	10.531713	15.13045	14.13067	14.70670	9.232358	10.377729	8.386314	NA	7.026553	12.49496	14.38206	6.004586	7.608925	10.120238	11.078642	7.145874	9.658384	8.866513	7.766206	7.307825	8.620541	9.827232	NA	9.504539
50pmol	1	10.72920	9.232358	7.766206	8.267212	8.732804	NA	9.658384	7.964140	16.75777	12.49496	13.88102	NA	9.504539	10.120238	10.377729	8.520402	7.608925	9.827232	14.130668	7.455730	8.620541	10.531713	14.70670	13.38772	14.38206	11.078642	10.229206	8.386314	10.036651	7.026553	11.59517	15.13045	9.954832	7.307825	8.298269	9.012072	6.807164	8.866513	6.448756	8.149770	6.004586	7.145874	NA	NA	NA
50pmol	2	10.96831	6.004586	10.662903	8.659793	8.785723	NA	8.190682	8.555305	16.75777	12.30540	13.84672	NA	9.753718	9.581714	10.189464	8.429646	7.035806	10.008606	14.686886	7.159242	9.099265	10.482590	14.36063	13.25790	14.10465	10.086066	10.189464	8.926299	7.806291	7.637117	11.47571	15.11842	8.017625	7.480137	8.298269	10.329078	6.459113	9.911682	9.362666	7.332126	6.004586	6.822962	NA	NA	NA
50pmol	3	11.59517	6.004586	10.729201	6.448756	8.732804	9.232358	8.149770	8.386314	16.75777	13.38772	14.13067	11.07864	9.658384	9.362666	10.377729	8.620541	7.026553	9.954832	9.827232	7.035806	8.429646	10.531713	14.70670	13.88102	14.38206	10.036651	10.229206	9.012072	7.608925	7.455730	12.49496	15.13045	7.964140	7.766206	8.520402	10.120238	7.145874	9.504539	8.267212	8.866513	7.307825	6.807164	NA	NA	NA
50pmol	4	10.96831	10.008606	10.662903	8.298269	8.190682	8.785723	7.806291	8.659793	16.75777	12.30540	13.84672	NA	9.504539	9.362666	10.482590	8.555305	6.004586	9.753718	14.360635	7.035806	8.429646	10.329078	14.68689	13.25790	14.10465	9.911682	10.189464	8.386314	7.332126	7.026553	11.47571	15.11842	7.637117	8.017625	9.581714	10.086066	6.459113	9.099265	8.926299	7.480137	7.159242	6.822962	NA	NA	NA

If reportImputing = TRUE, the returned result structure will be altered to a list, adding a shadow data frame with imputed data labels, where 1 indicates the corresponding entries have been imputed, and 0 indicates otherwise.

After the above imputation, any entries that did not pass the percent present threshold will still have NA values and will need to be filtered out.

dataImput <- filterNA(dataImput, saveRm = TRUE)

where saveRm = TRUE indicates that the filtered data will be saved as a .csv file named filtered_NA_data.csv in the current working directory.

The dataImput is as follows:

R.Condition	R.Replicate	NUD4B_HUMAN	A0A7P0T808_HUMAN	A0A8I5KU53_HUMAN	ZN840_HUMAN	CC85C_HUMAN	C9JEV0_HUMAN	C9JNU9_HUMAN	ALBU_BOVIN	CYC_BOVIN	TRFE_BOVIN	F8W0H2_HUMAN	H0Y7V7_HUMAN	H0YD14_HUMAN	H3BUF6_HUMAN	H7C1W4_HUMAN	H7C3M7_HUMAN	TLR3_HUMAN	LRIG2_HUMAN	RAB3D_HUMAN	ADH1_YEAST	LYSC_CHICK	BGAL_ECOLI	CYTA_HUMAN	KPCB_HUMAN	LIPL_HUMAN	CO6_HUMAN	BGAL_HUMAN	SYTC_HUMAN	CASPE_HUMAN	DCAF6_HUMAN	DALD3_HUMAN	HGNAT_HUMAN	RFFL_HUMAN	RN185_HUMAN	ZN462_HUMAN	ALKB7_HUMAN	POLK_HUMAN	ACAD8_HUMAN
100pmol	1	10.37045	11.406514	10.956950	8.392426	8.710518	7.829510	8.023133	16.75777	12.96499	13.97388	9.136271	10.231965	10.048461	8.179306	8.279169	9.874410	7.001503	8.832972	9.978488	15.16303	13.62766	14.44005	8.964155	9.574185	8.517979	6.764393	12.07953	14.76033	6.004586	7.670711	10.129049	10.681337	7.242036	9.727210	9.376507	7.109682	7.393910	7.530379
100pmol	2	11.40651	12.964987	9.727210	8.517979	8.832972	7.242036	8.023133	16.75777	13.62766	14.20112	8.964155	10.048461	10.231965	7.829510	7.670711	9.574185	7.393910	8.610420	10.129049	15.16303	13.97388	14.44005	9.136271	9.978488	8.279169	6.420716	12.07953	14.76033	7.109682	8.179306	9.874410	10.956950	7.001503	10.370449	8.392426	6.004586	9.376507	7.530379
100pmol	3	10.32522	11.893804	10.851852	7.868171	8.887142	6.429596	8.646475	16.75777	12.81909	13.94448	9.027184	9.940284	9.504539	8.074082	7.698334	10.467264	7.413486	8.433160	10.022816	15.15272	13.55168	14.42169	8.758911	9.812352	8.213284	6.777937	11.29130	14.74334	6.004586	8.311138	9.649565	10.097660	7.009435	10.195272	8.550204	7.121143	7.262869	7.555404
100pmol	4	10.51096	12.079525	10.956950	8.279169	8.964155	6.004586	8.023133	16.75777	12.96499	13.97388	9.136271	10.048461	9.978488	7.829510	7.670711	9.376507	7.829510	8.517979	10.231965	15.16303	13.62766	14.20112	8.710518	10.129049	8.179306	7.109682	11.40651	14.76033	6.420716	7.530379	8.832972	10.681337	6.764393	9.874410	8.392426	7.001503	7.242036	7.393910
200pmol	1	10.27762	12.256403	10.413259	8.356482	7.375266	7.098768	8.149770	16.75777	13.10393	14.00189	9.255807	10.077232	9.798890	8.142561	7.974610	10.164570	7.644405	8.253204	9.927121	15.17286	14.22236	14.77650	8.577167	10.010298	8.782531	7.222195	11.51625	14.45755	6.412260	7.506546	9.641587	10.556023	6.751494	8.669939	9.040857	7.792691	6.993948	8.905805
200pmol	2	10.53171	11.078642	10.729201	8.356482	8.732804	7.026553	8.267212	16.75777	12.49496	13.38772	9.012072	10.120238	9.798890	8.149770	7.964140	9.954832	7.608925	8.620541	10.229206	15.13045	13.88102	14.70670	8.866513	10.377729	8.386314	7.145874	11.59517	14.38206	6.004586	7.766206	9.827232	9.232358	6.448756	9.504539	8.520402	6.807164	7.455730	7.307825
200pmol	3	11.05704	8.142561	12.256403	8.356482	8.905805	6.412260	8.669939	16.75777	14.00189	13.70002	9.255807	10.413259	10.164570	7.792691	7.506546	10.010298	7.375266	8.476493	10.277619	15.17286	14.22236	14.77650	8.577167	9.927121	8.253204	6.993948	13.10393	14.45755	6.004586	7.098768	10.077232	11.516245	6.751494	9.798890	9.040857	7.974610	7.222195	7.644405
200pmol	4	10.72920	8.520402	11.595175	8.732804	7.964140	6.448756	8.149770	16.75777	13.38772	13.88102	9.504539	9.954832	10.229206	8.267212	6.807164	10.036651	7.455730	8.253204	10.531713	15.13045	14.13067	14.70670	9.232358	10.377729	8.386314	7.026553	12.49496	14.38206	6.004586	7.608925	10.120238	11.078642	7.145874	9.658384	8.866513	7.766206	7.307825	8.620541
50pmol	1	10.72920	9.232358	7.766206	8.267212	8.732804	9.658384	7.964140	16.75777	12.49496	13.88102	9.504539	10.120238	10.377729	8.520402	7.608925	9.827232	7.455730	8.620541	10.531713	14.70670	13.38772	14.38206	11.078642	10.229206	8.386314	7.026553	11.59517	15.13045	9.954832	7.307825	8.298269	9.012072	6.807164	8.866513	6.448756	8.149770	6.004586	7.145874
50pmol	2	10.96831	6.004586	10.662903	8.659793	8.785723	8.190682	8.555305	16.75777	12.30540	13.84672	9.753718	9.581714	10.189464	8.429646	7.035806	10.008606	7.159242	9.099265	10.482590	14.36063	13.25790	14.10465	10.086066	10.189464	8.926299	7.637117	11.47571	15.11842	8.017625	7.480137	8.298269	10.329078	6.459113	9.911682	9.362666	7.332126	6.004586	6.822962
50pmol	3	11.59517	6.004586	10.729201	6.448756	8.732804	8.149770	8.386314	16.75777	13.38772	14.13067	9.658384	9.362666	10.377729	8.620541	7.026553	9.954832	7.035806	8.429646	10.531713	14.70670	13.88102	14.38206	10.036651	10.229206	9.012072	7.455730	12.49496	15.13045	7.964140	7.766206	8.520402	10.120238	7.145874	9.504539	8.267212	8.866513	7.307825	6.807164
50pmol	4	10.96831	10.008606	10.662903	8.298269	8.190682	7.806291	8.659793	16.75777	12.30540	13.84672	9.504539	9.362666	10.482590	8.555305	6.004586	9.753718	7.035806	8.429646	10.329078	14.68689	13.25790	14.10465	9.911682	10.189464	8.386314	7.026553	11.47571	15.11842	7.637117	8.017625	9.581714	10.086066	6.459113	9.099265	8.926299	7.480137	7.159242	6.822962

Details

The two primary MS/MS acquisition types implemented in large scale MS-based proteomics have unique advantages and disadvantages. Traditional Data-Dependent Acquisition (DDA) methods favor specificity in MS/MS sampling over comprehensive proteome coverage. Small peptide isolation windows (<3 m/z) result in MS/MS spectra that contain fragmentation data from ideally only one peptide. This specificity promotes clear peptide identifications but comes at the expense of added scan time. In DDA experiments, the number of peptides that can be selected for MS/MS is limited by instrument scan speeds and is therefore prioritized by highest peptide abundance. Low abundance peptides are sampled less frequently for MS/MS and this can result in variable peptide coverage and many missing protein data across large sample datasets.

Data-Independent Acquisition (DIA) methods promote comprehensive peptide coverage over specificity by sampling many peptides for MS/MS simultaneously. Sequential and large mass isolation windows (4-50 m/z) are used to isolate large numbers of peptides at once for concurrent MS/MS. This produces complicated fragmentation spectra, but these spectra contain data on every observable peptide. A major disadvantage with this type of acquisition is that DIA MS/MS spectra are incredibly complex and difficult to deconvolve. Powerful and relatively new software programs like Spectronaut are capable of successfully parsing out which fragment ions came from each co-fragmented peptide using custom libraries, machine learning algorithms, and precisely determined retention times or measured ion mobility data. Because all observable ions are sampled for MS/MS, DIA reduces missingness substantially compared to DDA, though not entirely.

Various imputation methods have been developed to address the missing-value issue and assign a reasonable guess of quantitative value to proteins with missing values. So far, this package provides 10 imputation methods for use:

impute.min_local(): Replaces missing values with the lowest measured value for that protein in that condition.
impute.min_global(): Replaces missing values with the lowest measured value from any protein found within the entire dataset.
impute.knn(): Replaces missing values using the k-nearest neighbors algorithm (Troyanskaya et al. 2001).
impute.knn_seq(): Replaces missing values using the sequential k-nearest neighbors algorithm (Kim, Kim, and Yi 2004).
impute.knn_trunc(): Replaces missing values using the truncated k-nearest neighbors algorithm (Shah et al. 2017).
impute.nuc_norm(): Replaces missing values using the nuclear-norm regularization (Hastie et al. 2015).
impute.mice_cart(): Replaces missing values using the classification and regression trees (Breiman et al. 1984; Doove, van Buuren, and Dusseldorp 2014; van Buuren 2018).
impute.mice_norm(): Replaces missing values using the Bayesian linear regression (Rubin 1987; Schafer 1997; van Buuren and Groothuis-Oudshoorn 2011).
impute.pca_bayes(): Replaces missing values using the Bayesian principal components analysis (Oba et al. 2003).
impute.pca_prob(): Replaces missing values using the probabilistic principal components analysis (Stacklies et al. 2007).

Additional methods will be added later.

Reference

← Previous

Breiman, L., J. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression Trees. New York, NY, USA: Routledge.

Doove, Lisa L., Stef van Buuren, and Elise Dusseldorp. 2014. “Recursive Partitioning for Missing Data Imputation in the Presence of Interaction Effects.” Computational Statistics & Data Analysis 72: 92–104. https://doi.org/10.1016/j.csda.2013.10.025.

Hastie, Trevor, Rahul Mazumder, Jason D. Lee, and Reza Zadeh. 2015. “Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.” Journal of Machine Learning Research 16 (104): 3367—3402. http://jmlr.org/papers/v16/hastie15a.html.

Kim, Ki-Yeol, Byoung-Jin Kim, and Gwan-Su Yi. 2004. “Reuse of Imputed Data in Microarray Analysis Increases Imputation Efficiency.” BMC Bioinformatics 5: 160. https://doi.org/10.1186/1471-2105-5-160.

Oba, Shigeyuki, Masa-aki Sato, Ichiro Takemasa, Morito Monden, Ken-ichi Matsubara, and Shin Ishii. 2003. “A Bayesian Missing Value Estimation Method for Gene Expression Profile Data.” Bioinformatics 19 (16): 2088–96. https://doi.org/10.1093/bioinformatics/btg287.

Rubin, Donald B. 1987. Multiple Imputation for Nonresponse in Surveys. New York, NY, USA: John Wiley & Sons.

Schafer, Joseph L. 1997. Analysis of Incomplete Multivariate Data. New York, NY, USA: Chapman & Hall/CRC.

Shah, Jasmit S., Shesh N. Rai, Andrew P. DeFilippis, Bradford G. Hill, Aruni Bhatnagar, and Guy N. Brock. 2017. “Distribution Based Nearest Neighbor Imputation for Truncated High Dimensional Data with Applications to Pre-Clinical and Clinical Metabolomics Studies.” BMC Bioinformatics 18: 114. https://doi.org/10.1186/s12859-017-1547-6.

Stacklies, Wolfram, Henning Redestig, Matthias Scholz, Dirk Walther, and Joachim Selbig. 2007. “pcaMethods–a Bioconductor Package Providing PCA Methods for Incomplete Data.” Bioinformatics 23 (9): 1164–67. https://doi.org/10.1093/bioinformatics/btm069.

Troyanskaya, Olga, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B. Altman. 2001. “Missing Value Estimation Methods for DNA Microarrays.” Bioinformatics 17 (6): 520–25. https://doi.org/10.1093/bioinformatics/17.6.520.

van Buuren, Stef. 2018. Flexible Imputation of Missing Data. New York, NY, USA: Chapman & Hall/CRC.

van Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. “Mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.

2025-06-11

Preliminary

Examples

Details

Reference