corresponding author: Prof. Natasa Przulj

**Motivation:** Understanding complex networks of protein-protein interactions (PPIs)
is one of the foremost challenges of the post-genomic era. Due to the recent
advances in experimental bio-technology including yeast-2-hybrid (Y2H),
tandem affinity purification (TAP) and other high-throughput methods
for protein-protein interaction (PPI) detection, huge amounts of PPI net-
work data are becoming available. Of major concern, however, are the
levels of noise and incompleteness. For example, for Y2H screens, it is
thought that the false positive rate could be as high as 64% and the false
negative rate may range from 43% to 71%. TAP experiments are believed
to have comparable levels of noise.

**Results:** We present a novel technique to assess the confidence levels of in-
teractions in PPI networks obtained from experimental studies. We use
it for predicting new interactions and thus for guiding future biological
experiments. This technique is the first to utilize currently the best fit-
ting network model for PPI networks, geometric graphs. Our approach
achieves specificity of 85% and sensitivity of 90%. We use it to assign
confidence scores to physical protein-protein interactions in the human
PPI network downloaded from BioGRID. Using our approach, we pre-
dict 251 interactions in the human PPI network, a statistically significant
fraction of which correspond to protein pairs sharing common GO terms.
Moreover, we validate a statistically significant portion of our predicted
interactions in the HPRD database and the newer release of BioGRID.

The concepts described in this paper were implemented in Matlab. You can download and use the code for free for non-commercial purposes.

The archive "denoising.zip" contains the following files:

- embed_novi.m - implementation of the embedding algorithm.
- assign_scores.m - function that assigns confidence scores to the pairs of nodes in the network.
- EM.m - implementation of the Expectation-Maximization algorithm.

The function “embed_novi.m” accepts 2 parameters: network, represented as an edge list and dimensionality of the target space. The output is the coordinates of the nodes in that space. All node names in the network should be replaced by node numbers. You can use any node numbering in the network; the only requirement is that it should be a consecutive numbering starting from 1. More specifically, “network, represented as an edge list” must be Nx2 matrix of natural numbers, where each row represents an edge. For example, a row “1 2” represent an edge between nodes 1 and 2.

For details about the embedding algorithm see: D.J. Higham, M. Rasajski and N. Przulj in “Fitting a Geometric Graph to a Protein-Protein Interaction Network” Bioinformatics (2008) 24:8, 1093-1099.

**Example usage of assign_scores.m:**

The signature of this function is: [List]=assign_scores(Data,dim,priorEdge,priorNonEdge,delta,d,learnSetSize,stopEps). It can be used like: assign_scores(ChumanBG,5,0.001,0.999,0.4,3,18000,0.01). Where:

- Data=ChumanBG Nx2 matrix. Represents a network as an edge list (see description of embed_novi.m above for details)
- dim=5. Dimensionality of the target space
- priorEdge=0.001. The prior belief about what fraction of all possible pairs of proteins in the network really interact.
- priorNonEdge=0.999
- delta=0.4; The distancs cutoff. That is pairs of nodes with this distance are considered to be non-interactions.
- d=3. Number of Gaussian mixtures to use while learning p(dist|edge) and p(dist|non edge). Recomended value=3.
- learnSetSize=18000. Should be less then or equal to the number of edges in the Data. To big values will slow the learning step significantly.
- stopEps=0.01. When to stop the EM algorithm. Recomended value = 0.01

**NOTE:** Running the code on big networks (more than 20000 edges) will require significant computational and memory resources.