Geometric de-noising of protein-protein interaction networks

by Oleksii Kuchaiev, Marija Rasajski, Desmond J. Higham, and Natasa Przulj

corresponding author: Prof. Natasa Przulj

Abstract

Motivation: Understanding complex networks of protein-protein interactions (PPIs) is one of the foremost challenges of the post-genomic era. Due to the recent advances in experimental bio-technology including yeast-2-hybrid (Y2H), tandem affinity purification (TAP) and other high-throughput methods for protein-protein interaction (PPI) detection, huge amounts of PPI net- work data are becoming available. Of major concern, however, are the levels of noise and incompleteness. For example, for Y2H screens, it is thought that the false positive rate could be as high as 64% and the false negative rate may range from 43% to 71%. TAP experiments are believed to have comparable levels of noise.

Results: We present a novel technique to assess the confidence levels of in- teractions in PPI networks obtained from experimental studies. We use it for predicting new interactions and thus for guiding future biological experiments. This technique is the first to utilize currently the best fit- ting network model for PPI networks, geometric graphs. Our approach achieves specificity of 85% and sensitivity of 90%. We use it to assign confidence scores to physical protein-protein interactions in the human PPI network downloaded from BioGRID. Using our approach, we pre- dict 251 interactions in the human PPI network, a statistically significant fraction of which correspond to protein pairs sharing common GO terms. Moreover, we validate a statistically significant portion of our predicted interactions in the HPRD database and the newer release of BioGRID.

Implementation

The concepts described in this paper were implemented in Matlab. You can download and use the code for free for non-commercial purposes.

Download code!

Readme

The archive "denoising.zip" contains the following files:

The function “embed_novi.m” accepts 2 parameters: network, represented as an edge list and dimensionality of the target space. The output is the coordinates of the nodes in that space. All node names in the network should be replaced by node numbers. You can use any node numbering in the network; the only requirement is that it should be a consecutive numbering starting from 1. More specifically, “network, represented as an edge list” must be Nx2 matrix of natural numbers, where each row represents an edge. For example, a row “1 2” represent an edge between nodes 1 and 2.

For details about the embedding algorithm see: D.J. Higham, M. Rasajski and N. Przulj in “Fitting a Geometric Graph to a Protein-Protein Interaction Network” Bioinformatics (2008) 24:8, 1093-1099.

Example usage of assign_scores.m:

The signature of this function is: [List]=assign_scores(Data,dim,priorEdge,priorNonEdge,delta,d,learnSetSize,stopEps). It can be used like: assign_scores(ChumanBG,5,0.001,0.999,0.4,3,18000,0.01). Where:

NOTE: Running the code on big networks (more than 20000 edges) will require significant computational and memory resources.