Preprocessing

Preliminary

## load R package
library(msDiaLogue)

Example

## if the raw data is in a .csv file
fileName <- "../tests/testData/Toy_Spectronaut_Data.csv"
dataSet <- preprocessing(fileName,
                         filterNaN = TRUE, filterUnique = 2,
                         replaceBlank = TRUE, saveRm = TRUE)

Note: preprocessing() does not perform a transformation on your data. You still need to use the function transform().

## if the raw data is in an .Rdata file
load("../tests/testData/Toy_Spectronaut_Data.RData")
dataSet <- preprocessing(dataSet = Toy_Spectronaut_Data,
                         filterNaN = TRUE, filterUnique = 2,
                         replaceBlank = TRUE, saveRm = TRUE)
#> Warning: Removed 62 rows containing non-finite outside the scale range
#> (`stat_bin()`).

#> Summary of Full Data Signals (Raw):
#>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#>     20.93    263.87    669.79   6897.92   1963.53 117803.49

#> Levels of Condition: 100pmol 200pmol 50pmol 
#> Levels of Replicate: 1 2 3 4

R.Condition	R.Replicate	NUD4B_HUMAN	A0A7P0T808_HUMAN	A0A8I5KU53_HUMAN	ZN840_HUMAN	CC85C_HUMAN	TMC5B_HUMAN	C9JEV0_HUMAN	C9JNU9_HUMAN	ALBU_BOVIN	CYC_BOVIN	TRFE_BOVIN	KRT16_MOUSE	F8W0H2_HUMAN	H0Y7V7_HUMAN	H0YD14_HUMAN	H3BUF6_HUMAN	H7C1W4_HUMAN	H7C3M7_HUMAN	TCPR2_HUMAN	TLR3_HUMAN	LRIG2_HUMAN	RAB3D_HUMAN	ADH1_YEAST	LYSC_CHICK	BGAL_ECOLI	CYTA_HUMAN	KPCB_HUMAN	LIPL_HUMAN	PIP_HUMAN	CO6_HUMAN	BGAL_HUMAN	SYTC_HUMAN	CASPE_HUMAN	DCAF6_HUMAN	DALD3_HUMAN	HGNAT_HUMAN	RFFL_HUMAN	RN185_HUMAN	ZN462_HUMAN	ALKB7_HUMAN	POLK_HUMAN	ACAD8_HUMAN	A0A7I2PK40_HUMAN	NBDY_HUMAN	H0Y5R1_HUMAN
100pmol	1	1547.983	3168.32568	2819.7874	318.54376	495.5136	456.3309	213.21727	237.1306	111209.7	10737.953	15097.67	1799.391	630.1937	1311.8127	1279.6390	280.6318	299.51523	1154.5566	16461.2012	179.3190	516.1104	1234.587	27599.42	13798.590	23840.03	614.0895	990.5613	440.0417	132.31737	150.6033	3578.014	26872.50	109.55331	211.6450	1292.5234	1963.5321	189.79155	1106.1482	981.11432	180.6320	199.14555	209.7806	NA	NA	NA
100pmol	2	1680.730	4576.37158	1061.9502	404.25836	556.8611	501.0473	184.89574	314.0320	111659.9	10655.384	15840.28	NA	575.0490	1114.2773	1294.9751	271.8160	248.04329	1032.0381	1460.7496	213.1137	492.3771	1186.433	27221.59	13880.411	23963.31	640.2153	1077.4829	364.5241	128.78983	128.2592	3412.794	26742.22	155.37483	348.6104	1066.3511	1509.1512	153.90802	1303.6520	388.65823	122.7458	751.19849	247.3832	1420.1351	NA	NA
100pmol	3	1414.811	4675.13281	2177.8496	275.09167	559.3206	NA	111.24314	501.2060	105982.9	10663.714	15022.21	NA	613.3968	1224.3837	946.0795	309.7599	270.67770	1808.1924	21555.3555	200.7485	342.1992	1227.435	26587.62	13723.719	22957.35	551.6828	1176.7791	319.0364	NA	118.5104	3499.113	26124.20	91.82145	319.1320	1003.3372	1342.4712	143.12419	1352.7024	430.13318	144.6799	171.13177	221.9161	1889.0665	835.6825	NA
100pmol	4	1620.490	3828.19971	2062.8384	385.05573	558.0967	422.0465	84.27336	334.6389	104442.6	10843.115	15160.49	NA	886.5406	1148.7343	1091.7800	NA	229.40149	901.5703	22937.2500	240.7981	418.1846	1190.952	26168.72	13944.603	22311.30	438.5425	1162.6656	351.5390	NA	137.8860	3481.821	25910.39	88.26187	217.7478	489.8084	1721.8601	99.95578	990.6649	393.55930	134.5238	145.17339	216.3736	1610.2407	950.3087	913.3416
200pmol	1	1512.770	4232.05078	2004.8613	338.27777	156.3478	364.5416	146.80331	NA	109245.3	19524.863	21577.97	2212.190	491.7787	1246.4460	1080.4132	270.1487	252.09808	1454.3271	21113.4512	223.8396	313.7860	1176.982	48693.35	24344.188	41234.67	364.7307	1203.0853	385.5154	65.40555	151.0895	3553.484	26261.47	81.22160	185.4865	939.8899	2149.7632	131.13179	381.0588	429.62201	239.4998	145.04378	424.7914	2337.8496	NA	837.8737
200pmol	2	1480.490	3496.84155	2177.9534	NA	550.4083	NA	135.78349	295.8571	113357.5	20072.297	22968.96	NA	669.7894	1068.2001	NA	285.4891	259.50000	1049.7526	25760.0527	190.3054	452.8294	1220.266	49866.29	24742.227	42899.43	633.5656	1234.5601	414.1271	NA	135.8605	3686.869	27638.89	69.56509	250.4035	1020.4291	725.6615	116.20615	877.0164	438.22589	133.4297	160.92671	155.0986	NA	1053.8444	1000.5491
200pmol	3	1555.834	356.43225	2280.6846	379.62103	564.2863	496.0772	103.30424	473.9141	114321.8	20787.127	20720.13	1451.198	586.7260	1378.0652	1194.8448	291.6754	184.18954	1123.7469	NA	174.5702	432.1681	1216.306	50704.73	24803.633	42904.95	446.4135	1082.7312	357.6343	NA	129.0676	3530.710	27101.22	62.08423	136.7023	1171.5715	1675.6870	109.60301	938.3956	568.89239	315.7039	146.75146	198.4779	1397.9890	837.2197	694.5791
200pmol	4	1529.628	350.70822	2223.3093	410.82349	292.9041	522.1325	95.18819	318.4948	116439.8	19924.240	22153.40	NA	539.0703	923.3237	1115.3848	322.9086	97.65465	957.0436	NA	164.7767	NA	1183.197	53744.70	26381.047	43279.84	527.1628	1121.3438	342.5055	NA	121.3068	3751.769	27545.24	70.39470	199.2453	996.0696	1696.6189	125.31519	611.6407	506.49115	204.4332	161.96100	376.5362	895.9138	NA	NA
50pmol	1	1480.210	561.38837	189.9275	264.24271	308.9420	NA	599.90497	192.3859	117803.5	6758.298	12183.81	NA	594.8999	899.5010	1163.1122	291.4431	176.21545	620.2048	14107.1250	152.5492	292.2440	1186.543	16408.28	7169.955	14728.67	2984.7190	1029.7336	288.4770	891.24725	129.7482	3547.950	25668.78	846.95880	146.3040	NA	461.3821	86.84789	373.6308	49.93938	236.2902	20.92994	142.3466	NA	NA	NA
50pmol	2	1486.144	NA	1462.2559	325.74991	351.2331	NA	254.75084	308.6775	110086.7	6721.135	12521.78	NA	582.8912	531.7106	1119.5256	287.1180	103.58258	849.2368	24912.3613	140.6493	362.3117	1260.574	16444.63	7797.536	14736.71	857.5026	NA	361.4482	179.10303	166.8891	3530.004	26351.25	207.83086	165.6463	265.2173	1184.9562	93.91448	768.2026	489.40918	146.9422	88.41573	101.6087	NA	NA	NA
50pmol	3	1468.554	42.51457	1364.9075	83.99377	296.5147	396.0038	257.78970	279.2477	105640.2	6172.877	11926.22	1373.660	569.8922	NA	1067.0791	294.0919	88.48861	738.7719	666.5015	NA	NA	1175.953	16618.11	7432.793	14160.20	916.4893	992.5451	319.6350	128.63672	120.6974	3458.023	26017.54	203.64948	132.5755	291.4759	932.9668	93.50905	547.0935	263.86734	313.0341	111.88376	85.4563	NA	NA	NA
50pmol	4	1497.531	927.07886	1435.5588	275.60831	242.4643	425.7305	197.71338	382.4084	110446.0	6028.398	12021.50	NA	NA	593.1353	1302.1250	339.3387	30.13688	873.1840	15711.3106	142.4270	291.5121	1150.711	16282.51	7543.633	14758.73	886.7808	1138.6193	NA	152.56187	NA	3575.316	25969.99	190.47060	220.1901	676.8246	996.8993	31.57284	523.4712	450.08408	164.1874	143.96025	135.2896	NA	NA	NA

Details

The function preprocessing() takes a .csv file of summarized protein abundances, exported from Spectronaut. The most important columns that need to be included in this file are: R.Condition, R.Replicate, PG.ProteinAccessions, PG.ProteinNames, PG.NrOfStrippedSequencesIdentified, and PG.Quantity. This function will reformat the data and provide functionality for some initial filtering (based on the number of unique peptides). The steps below describe the functions that happen in the Preprocessing code.

1. Loads the raw data

If the raw data is in a .csv file Toy_Spectronaut_Data.csv, specify the fileName to read the raw data file into R.
If the raw data is stored as an .RData file Toy_Spectronaut_Data.RData, first load the data file directly, then specify the dataSet in the function.

2. Filters out identified proteins that exhibit “NaN” quantitative values

NaN, which stands for ‘Not a Number,’ can be found in the PG.Quantity column for proteins that were identified by MS and MS/MS evidence in the raw data, but all peptides from that protein lack an associated integrated peak area or intensity. This usually occurs in low abundance peptides that exhibit intensities close to the limit of detection resulting in poor signal-to-noise (S/N) and/or when there is interference from other co-eluting peptide ions with very similar or identical m/z values that lead to difficulty in parsing out individual intensity profiles.

3. Applies a unique peptides per protein filter

General practice in the proteomics field is to filter out proteins which were identified on the basis of a single peptide. Because approximately 1% of all identified peptides are false positive matches, it’s more likely that 1 peptide was incorrectly identified and that protein ID is incorrect than that, for example, 5 peptides from the same protein were all incorrectly identified and that protein ID is incorrect. We recommend focusing on proteins with 2 or more peptide identifications, as these will be higher confidence. If you have a protein of interest with only 1 peptide identified, contact PMF faculty and we can help you evaluate the evidence from the raw data to determine believability.

4. Adds accession numbers to identified proteins without informative names

Spectronaut reports contain 4 different columns of identifying information:

PG.Genes, which is the gene name (e.g. CDK1).
PG.ProteinAccessions, which is the UniProt identifier number for a unique entry in the online database (e.g. P06493).
PG.ProteinDescriptions, which is the protein name as provided on UniProt (e.g. cyclin-dependent kinase 1).
PG.ProteinNames, which is a concatenation of an identifier and the species (e.g. CDK1_HUMAN).

Every entry in UniProt will have an accession number, but may not have all of the other identifiers, due to incomplete annotation. Because Uniprot includes entries for fragments of proteins and some proteins entries are redundant, a peptide can match to multiple entries for the same protein, which generates multiple possible identifiers in Spectronaut. Further, the ProteinNames entry in Spectronaut can switch formats: the preference is accession number and species, but can also be gene name and species instead.

This option tells msDiaLogue to substitute the accession number for an identifier if it tries to pull an identifier from a column with no information.

Note: Not all proteins can be identified unambiguously. In many cases, the identified peptides can be found in multiple protein sequences, which yields a protein group or protein cluster rather than a single protein identification. When this happens, the accession numbers for all potential matches are concatenated into one string, separated by periods. When you see long strings of multiple identifiers later in your data processing, this is why. Spectronaut sorts these alphanumerically, so you should not assume that the first protein in the list is most likely to be correct (other search algorithms such as MaxQuant, which is used in PMF for most Scaffold-based results, do rank protein cluster IDs by likelihood of correctness).

5. Saves a document to your working directory with all filtered out data, if desired

If saveRm = TRUE, the data removed in step 2 (preprocess_Filtered_Out_NaN.csv) and step 3 (preprocess_Filtered_Out_Unique.csv) will be saved in the current working directory.

As part of the preprocessing(), a histogram of $log_2$ -transformed protein abundances is provided. This is a helpful way to confirm that the data have been read in correctly, and there are no issues with the numerical values of the protein abundances. Ideally, this histogram will appear fairly symmetrical (bell-shaped) without too much skew towards smaller or larger values.

2025-06-11

Preliminary

Example

Details