For such analyses to be applicable to other biological datasets, we need to understand which properties of the algorithm determine its effectiveness, and design a more general algorithm based on these principles. Both algorithms are based on the idea of detecting correlated mutations between residues in sequence alignments. This is a sound approach, because if a phenotype is controlled by a set of residues, members of the set must mutate to change the phenotype, and therefore, these residues can be detected by looking for groups of sequence positions whose mutations are correlated. Many statistical measures have been suggested that high throughput screening structure quantify the degree of correlation between sequence positions in a multiple sequence alignment, and different authors have suggested weighting these raw correlation scores in different ways. In particular, mutual information and SCA use different metrics for measuring the raw correlation score, and in addition these metrics are differently weighted. This manuscript is organized as follows. We first identify the critical difference that keeps SCA and mutual information from being interchangeable algorithms, which turns out to be the different weights applied to the raw correlation scores. To create an algorithm that works more generally we propose using biological information about the expected conservation level of the phenotype in question to design context specific weighting functions. This approach performs well on both original datasets, so we turn to testing it in more general situations. We first demonstrate that the algorithm performs well on artificial sequences generated through simulations of a simple model of molecular evolution, in which the conservation level of the phenotype is systematically varied. We then demonstrate that it performs well on a biological example in which the phenotype controlling residues have been identified through experiments. Finally, we make testable predictions by applying our algorithm to Cadherins and Protocadherins for which the phenotype-controlling residues have not yet been probed experimentally. Comparing the left panel of Fig. 1C with that of Fig. 1A, we see both algorithms are able to identify the groups of phenotype-controlling residues verified in. Similarly, the right panels of Fig. 1C and Fig. 1A reveal that the hybrid ‘unweighted-SCA’ better identifies the residues shown to control specificity in the HK-RR alignment from, although unweighted SCA clearly performs worse than MI on this alignment. In Fig. S3 in file S1 we further demonstrate that changing the weighting function changes the set of residues that are identified. Thus to a great extent the choice of weighting function, rather than the statistical method used, determines identification of the phenotype-controlling residues. Our analysis finds that use of a weighting function specific to the phenotype and sequence set of interest is crucial to successful identification of phenotype-controlling residues. While perhaps surprising, this observation has a natural theoretical basis.
The challenge is to identify residue to detect correlated mutations in these studies depends on the details of each algorithm
Leave a reply