The authors have declared that no competing interests exist.
Conceived and designed the experiments: KB FSD. Performed the experiments: SS RK. Analyzed the data: KB. Contributed reagents/materials/analysis tools: SS RK. Wrote the paper: KB TL FSD.
The relationship of HIV tropism with disease progression and the recent development of CCR5-blocking drugs underscore the importance of monitoring virus coreceptor usage. As an alternative to costly phenotypic assays, computational methods aim at predicting virus tropism based on the sequence and structure of the V3 loop of the virus gp120 protein. Here we present a numerical descriptor of the V3 loop encoding its physicochemical and structural properties. The descriptor allows for structure-based prediction of HIV tropism and identification of properties of the V3 loop that are crucial for coreceptor usage. Use of the proposed descriptor for prediction results in a statistically significant improvement over the prediction based solely on V3 sequence with 3 percentage points improvement in AUC and 7 percentage points in sensitivity at the specificity of the 11/25 rule (95%). We additionally assessed the predictive power of the new method on clinically derived ‘bulk’ sequence data and obtained a statistically significant improvement in AUC of 3 percentage points over sequence-based prediction. Furthermore, we demonstrated the capacity of our method to predict therapy outcome by applying it to 53 samples from patients undergoing Maraviroc therapy. The analysis of structural features of the loop informative of tropism indicates the importance of two loop regions and their physicochemical properties. The regions are located on opposite strands of the loop stem and the respective features are predominantly charge-, hydrophobicity- and structure-related. These regions are in close proximity in the bound conformation of the loop potentially forming a site determinant for the coreceptor binding. The method is available via server under
Human Immunodeficiency Virus (HIV) requires one of the chemokine coreceptors CCR5 or CXCR4 for entry into the host cell. The capacity of the virus to use one or both of these coreceptors is termed tropism. Monitoring HIV tropism is of high importance due to the relationship of the emergence of CXCR4-tropic virus with the progression of immunodeficiency and for patient treatment with the recently developed CCR5 antagonists. Computational methods for predicting HIV tropism are based on sequence and on structure of the third variable region (V3 loop) of the viral gp120 protein — the major determinant of the HIV tropism. Limitations of the existing methods include the limited insights they provide into the biochemical determinants of coreceptor usage, high computational load of the structure-based methods and low prediction accuracy on clinically derived patient samples. Here we propose a numerical descriptor of the V3 loop encoding the physicochemical and structural properties of the loop. The new descriptor allows for server-based prediction of viral tropism with accuracy comparable to that of established sequence-based methods both on clonal and clinically derived patient data as well as for the interpretation of the properties of the loop relevant for tropism. The server is available under
The entry of the human immunodeficiency virus (HIV) into human cells is initiated by binding of the viral envelope glycoprotein gp120 to the cellular CD4 receptor
It has been shown that in the early, asymptomatic stages of infection mainly R5 viruses are observed, whereas progression towards AIDS is often associated with the emergence of X4 viruses
Computational methods for predicting viral tropism based on the sequence of the V3 loop have been developed
Structures of gp120 including the V3 loop have been determined by x-ray crystallography
The work presented here was motivated by the goal of developing a method for genotypic prediction of viral tropism that is at least as accurate as existing structure-based methods, i.e., more accurate than the widely used sequence-based method
To meet this goal we present a systematic approach to incorporating physicochemical and structural properties of the V3 loop into the prediction of HIV coreceptor usage. We map 54 amino acid indices representing the physicochemical properties of amino acids onto the V3 loop structure and use methods from statistical learning to extract those features that are most informative of coreceptor usage. The extracted set of features represents a small fraction of the initial feature set and models based on this set attain higher prediction accuracy with decreased computational load. Our structural descriptor affords direct interpretation of the features of the V3 loop relevant for viral tropism by pointing to specific physicochemical properties of amino acids in different parts of the loop being predictive of coreceptor usage. We also applied our method to clinically derived (bulk) data and tested its usability for prediction of the MVC therapy outcome.
The structural descriptor of the V3 loop is based on the published structure of the V3 loop
Atoms of the 2B4C V3 loop structure are represented by dots, representative atoms by larger dots. The black line connects representative atoms of the adjacent loop residues. Atoms of each residue are colored according to the “Hydrophobicity factor” amino acid index. Spheres are centered at the representative atoms of the loop residues. The spheres for residues 299, 317 and 323 are shown. The physicochemical features of residues within each sphere were averaged and used as a part of the structural descriptor.
We investigated the average number of residues covered by each sphere and selected the radius of 8 Å based on predefined criteria (see
For comparison we implemented two sequence-based descriptors of the V3 loop. The
In order to reduce the highly redundant feature vector of the structural descriptor of the full model and to investigate which features are informative for coreceptor usage we applied several feature selection procedures: Random Forest (RF)
model | features | AUC | sensitivity |
g2p | 1000 | 0.860 | 0.616 |
aaindex | 2800 | 0.829 | 0.565 |
full | 5544 | 0.847 | 0.587 |
RF(1) | 144 | 0.860 | 0.673 |
RF(5) | 241 | 0.863 | 0.634 |
|
|
|
|
SVM(5) | 362 | 0.879 | 0.674 |
|
|
|
|
RF(1)_SVM(1) | 264 | 0.878 | 0.683 |
RF(5)_SVM(5) | 588 | 0.875 | 0.668 |
RF(1)_Lasso | 226 | 0.883 | 0.652 |
RF(5)_Lasso | 315 | 0.874 | 0.639 |
|
|
|
|
SVM(5)_Lasso | 448 | 0.881 | 0.674 |
RF(1)_SVM(1)_Lasso | 340 | 0.868 | 0.644 |
RF(5)_SVM(5)_Lasso | 653 | 0.883 | 0.685 |
Models are named after the feature selection method with the number in parentheses indicating the percentage cutoff of the ranked features. Sensitivity shown is at specificity of 11/25 rule in the clonal dataset. The clonal model is indicated with a #.
The sets of features selected by the three feature selection methods show a limited overlap. The initial feature set contains small subsets of highly correlated features that pertain to highly correlated amino acid properties in overlapping structure regions (spheres). These features convey the same information to the prediction method and can be therefore selected interchangeably by each method (see
We inspected the predictive performance of models based on combined sets of features selected using different methods (
ROCR of different models are plotted. The legend lists the number of features, AUC and sensitivity at the specificity of the 11/25 rule (6.28%. indicated by the vertical dashed line) in brackets. The clonal model is represented by a black solid line. Vertical segments show the standard deviations of the 10×10-fold cross validation curves of the clonal and g2p models. Comparison of clonal and g2p models via precision-recall curves is shown in the
In our approach we refrain from modelling of the side chains of the V3 loop. There is certain level of imprecision related to modelling of side chains due to the high flexibility and variability of the loop. We use our approach based on spheres as an approximation of the real structure of the loop that is costly to derive computationally and is unreliable. Such approximation of the structure is robust against indels as we observe no relationship of the model performance to the presence of indels in a sequence (
We additionally tested the performance of a model based on a different V3 loop structure (Protein Data Bank (PDB) code 2QAD)
We compared the performance of our method with the performance of previously published structure-based methods for tropism prediction
dataset | model | features | AUC | sensitivity | original method/g2p | |
AUC | sensitivity | |||||
Sander | clonal | 218 | 0.901 | 0.782 | 0.923 | 0.774 |
Dybowski | clonal | 218 | 0.948 | 0.838 | 0.937 | 0.810 |
HOMER | clinical | 59 | 0.774 | 0.463 | 0.743 | 0.451 |
HOMER-clinical | clinical (CD4, VL) | 61 | 0.803 | 0.474 | 0.781 | 0.442 |
The performance of the clonal model on the Sander and Dybowski datasets is compared to the performance of the original methods
We additionally tested the method on clinically derived patient data from the HOMER cohort
Black curves represent the clinical model, red – the g2p method. Dashed curves represent models enhanced with clinical patient information. The vertical dashed line indicates the specificity of the 11/25 on the clinical dataset. Vertical segments show the standard deviations of the 10×10-fold cross validation curves of the clinical and g2p models. The legend presents the results as in
We additionally tested the effect of amino-acid ambiguities on the prediction accuracy of the clinical model and found that the combined information from both types of sequence positions, ambiguous and non-ambiguous is important for tropism prediction (see
As shown by Sing et al.
Finally, we tested the prediction performance of the clinical model on a dataset of sequences collected at therapy start from a German cohort of patients undergoing MVC therapy (
The remaining 48 patients experienced therapy success. 41 of the cases were classified as R5 viruses by the clinical model, which is in accordance with the patient therapy outcome. Out of seven remaining cases that were classified as X4 viruses, two were phenotyped as X4 viruses. For comparison, we predicted tropism of the sequences in this dataset using the g2p model. This sequence-based prediction reported correctly only two therapy failure cases but also 44 therapy success cases which is more than the clinical model predicted. As a measure of the quality of predictions of the MVC dataset we used the Matthews correlation coefficient (MCC), which quantifies the correlation of the observed and predicted binary classification and is suited for datasets with an unbalanced class proportion. Therapy outcome prediction based on structural descriptor showed overall accuracy of MCC = 0.34 comparing favorably with g2p model yielding MCC = 0.29.
Phenotypic characterization was only available for a subset of 28 sequences from the MVC dataset (Trofile dataset). In this subset the phenotype appeared to be the best predictor of the therapy outcome with one correctly predicted therapy failure case out of three and 23 correctly predicted therapy successes out of 25 (MCC = 0.25). The clinical model reported the same number of correctly predicted therapy failure cases and lower number of 20 correctly predicted therapy success cases (MCC = 0.10). The clinical model scored higher than the g2p method that did not report correctly any of the therapy failure cases and predicted correctly 23 therapy successes (MCC = −0.10). Additionally, the clinical model correctly classified all X4 sequences in the Trofile dataset reaching MCC of 0.660 and favorably comparing with the g2p showing MCC of 0.352.
Overall, the phenotype as well as the structural descriptor model and the g2p model trained on clonal data showed a generally lower capacity of detecting therapy outcome compared to the models trained on clinical data. Detailed results of the MVC dataset analysis are provided in
In order to facilitate the interpretation of the large number of selected features we clustered the 56 amino acid indices into four groups (
The vertical line indicates the separation of the tree into four clusters analyzed in this study. Labels of the tree are colored according to the clusters.
By combining amino-acid indices with specific positions on the V3 loop, the proposed features can be interpreted in terms of physicochemical properties along the structure of the loop. The features selected for the clonal model are informative about the coreceptor usage. Their analysis can therefore provide insights into the physicochemical and structural factors of viral tropism.
Features of the clonal model were selected based on two different feature selection methods – Lasso and SVM. Among 218 features in this model seven were selected by both methods. Three of the features describe electrical charge at positions 319–322. Two of the features are structure-related (“Free energy in β-strand region”, “Normalized frequency of turn”) and related to positions 304 and 305. These amino acid indices based on statistical analysis of 3D structures define propensities of amino acids to form β-strands
Both feature selection methods allow for feature ranking based on the feature coefficients in the respective linear models. We inspected the top-scoring features in both rankings (
(A) Distribution of scores of the SVM method. The vertical line indicates the cutoff for the selection of features for the clonal model. The scores of the top-scoring features are listed. (B) Distribution of scores of the Lasso method. Top-scoring features in the distribution are indicated. On both panels, positions of the features mapped on the V3 loop structure are indicated in brackets, labels are colored according to the clusters shown in
Among the top-scoring features selected by both methods we found “Positive charge” at the stem position 322 corresponding to the position 25 in the consensus sequence. Highly ranked features in the SVM scoring include also “Positive charge” at the position 321. Additionally SVM scoring pointed to secondary structure propensities and mutability at the loop stem (“Normalized frequency of coil”, “Normalized frequency of β-sheet unweighted”, “Normalized relative frequency of bend” at positions 307, 319–320 and 324). These indices are based on statistical analyses of secondary structures and statistical models for predicting tertiary structures and define the contributions of different amino acids to the formation of a given structural element
Among the high-ranking features in the Lasso scoring we found predominantly charge indices at the loop stem (“Positive charge”, “Isoelectric point” and “Net charge” at positions 307, 316, 322–323) and at the loop base (“Net charge” at position 300). Additionally we found “Hydrophobicity factor” at the loop base positions 302–303. Two structure-related features based on “Normalized frequency of turn” and “Normalized frequency of β-turn” amino acid indices at the base position 305 were also scored high by the Lasso method. Details of the feature ranking are provided as
Next, we investigated which clusters of indices were significantly overrepresented among selected features. The only cluster significantly overrepresented among the selected features was cluster 2 (p<0.05). Three out of five features of this cluster are also overrepresented individually in the full set of selected features – “Positive charge” (22 features), “Isoelectric point” (14 features) and “Normalized frequency of extended structure” (9 features) that describes the propensity of amino acids to form specific secondary structures
Next, we inspected which of the amino-acid indices most often appear among the selected features of the clonal model and analyzed the distribution of selected features along the V3 loop in a sliding-window approach (
(A) Structure locations of features of the clonal model were mapped on the positions on the reference sequence. Numbers of selected features mapped to adjacent sequence positions were summed and averaged over a sequence window of size three. The resulting distribution of all features is represented by the black line, distributions of features of the four clusters are represented by lines in the relevant colors as defined in
To gain more insight into the two regions observed in the sliding-window test, we mapped the positions of the features of the clonal model on the crystal structure of the V3 loop (
(A) 2B4C V3 structure used in this study. Cα atoms are marked with small black spheres along the loop backbone. Representative atoms are represented by gray spheres with the size proportional to the number of features of the clonal model mapped on the respective V3 position. Positions assigned to core sites are numbered. (B) V3 structure in a bound conformation (2QAD,
Features of the clonal model involving the six amino acid-indices that are significantly overrepresented among the selected features are all found in CS1 or CS2 as well as around residue 324 (see
Physicochemical and structural properties of proteins determine their binding affinities. Prediction methods of HIV-1 coreceptor usage based solely on the V3 sequence do not account for this type of properties nor do they provide the information on loop characteristics that are crucial for the interaction. Prediction models incorporating loop structure can provide such information. However, previously reported structure-based prediction models suffer from limitations in terms of (i) runtime and software complexity – which prevents their accessibility via a tool publicly available online – and (ii) interpretation of the prediction result. Here we present a prediction model of HIV coreceptor usage based on V3 sequence and structure
Our clonal model was developed on a sequence set comprising different HIV-1 subtypes. The limited number of sequences of each subtype and the high variability of the V3 loop sequence which obviates a clear subtype classification advocate use of a common model for all subtypes, an approach also applied by other prediction methods
Our method shows a moderately but significantly higher prediction performance of approximately 3 percentage points over the model based on sequence only
Unlike previously developed structure-based methods
We also assessed the capacity of our method to predict MVC therapy outcome. For the purpose of this validation, we used a cohort of patients treated with MVC. This analysis is limited due to the low number of cases in the MVC dataset. With increasing use of entry inhibitors, therapy outcome data are expected to become more abundant and the capacity to train models predicting therapy outcome will improve. The higher performance of the clinical over the clonal model in predicting therapy outcome suggests that comprehensive datasets appropriate for specific prediction goals can produce more reliable models.
The analysis of features informative of viral tropism points to two critical sites in the loop stem, comprising residues 304, 307 and 319–322, respectively and to position 324 located more closely to the base of the stem. The charge of amino acids at these sites is known to play role in coreceptor binding
The results of other studies of structural features related to HIV tropism are in general accordance with our results. A recently published method
Given the considerable structural flexibility and sequence variability of the V3 loop, individual features of this region distinguishing between the two virus phenotypes are hard to define. We performed a comprehensive analysis of a large number of physicochemical residue characteristics in various locations on the loop and pointed to those that are the most informative of tropism. The resulting method offers higher performance than the standard sequence-based approach with a comparable efficiency and a direct interpretation of structural and physicochemical determinants of tropism. The method has been implemented as a server application within the geno2pheno framework under
To construct the clonal dataset we screened the Los Alamos database
We used the amino-acid indices collected in the AAindex database
The descriptor of the V3 loop was based on the published structure of the V3 loop with PDB
In addition to the set of spheres corresponding to alignment positions additional spheres were positioned at the midpoints of lines connecting centers of each pair of consecutive alignment spheres. This way we obtained a set of 99 spheres – 50 corresponding to alignment positions and 49 positioned in-between consecutive alignment positions. Example spheres are illustrated in
Each V3 sequence position was mapped to a sphere if the corresponding representative atom was located within the given sphere. The details of the selection of the sphere radius and Gaussian smoothing parameter within the spheres are described and illustrated in
The model based on the structural descriptor classifying viruses as R5 or X4 was constructed using a linear SVM
We used two classification methods performing feature ranking: Random Forests (RF)
The HOMER dataset was filtered to contain one randomly chosen sequence per patient, which resulted in a set of 954 sequences out of which 167 comprised X4 viruses. Each sequence in the clinical dataset represents a population of variants genotyped and phenotyped in bulk, an approach used in the routine clinical practice. These sequences contain ambiguous positions with alternative amino acids representing different variants in the population. The ambiguous positions were represented by a balanced average of vectors of indices of all alternative amino acids at a given position. Due to these differences between the clinically and clonally derived data, we repeated the feature selection on this dataset and constructed the
The MVC dataset comprises 53 patient cases under MVC therapy whose therapy outcome can be assessed based on the viral load (VL). We define as therapy success an observed 2log decrease in VL with respect to the level immediately before the therapy start or a VL drop below 50 copies/ml measured three months after the therapy start
dataset | all sequences | X4 sequences | R5 sequences |
clonal | 1188 | 215 | 973 |
Sander | 1357 | 205 | 1152 |
Dybowski | 515 | 151 | 364 |
MVC |
53 | 5 | 48 |
Trofile | 28 | 3 | 25 |
For the MVC dataset in the columns “X4 sequences” and “R5 sequences” the numbers of therapy failures and successes are shown respectively.
Clustering of the 56 amino acid indices was performed in order to facilitate the interpretation of the large number of selected features. As a similarity score among the indices we used the absolute value of their correlation. This way, indices that express the same affinities among amino acids are considered similar. We performed hierarchical clustering of the 56 amino acid indices and computed silhouette values
Choice of sphere radius and Gaussian smoothing parameters. Black histograms represent the distribution of the number of residues included in proximities of a radius indicated on the corresponding plot on the left. Red histograms illustrate the sum of Gaussian normalizing factor per each residue. Mean with variance in brackets of each distribution are indicated in legends.
(TIFF)
Click here for additional data file.
Performance of models based of structural proximities of different radii. ROCR of models based on different radii are plotted. The selected radius of 8 Å is traced with a black solid line. AUC and sensitivity at the specificity of 11/25 rule in brackets are indicated in the legend.
(TIFF)
Click here for additional data file.
Distance on the 3D structure of features selected by different feature selection methods. Plot in D illustrates the overall distance of spheres of the features of the initial feature set. Plots in A–C illustrate the distance of the highly correlated features as defined above. The highly correlated features can be found among features selected by different methods and they pertain to locations in close proximity on the structure, which is the potential reason for the low overlap of features selected by different methods.
(TIFF)
Click here for additional data file.
Correlation of features in the initial feature set. Histograms show distribution of the Pearson correlation of all features of the structural descriptor (left panel), of the features of the clonal model (middle panel) and of the features of the clonal model with the remaining features of the structural descriptor (right panel). Median, percentage of feature pair with correlation >0.5 and >0.75 are indicated in the legend.
(TIFF)
Click here for additional data file.
Comparison of the clonal and g2p models in the precision-recall space. The curves show the relationship between true positive rate (recall) and positive predictive value (precision). Area under the curve shows a higher predictive performance of the clonal model compared to the g2p model.
(TIFF)
Click here for additional data file.
Validation of the clinical model on an external dataset. In order to support the assessment of the performance of the clinical model we used an independent dataset of 760 clinically-derived sequences phenotyped using Enhanced (140 sequences) and standard Trofile (620 sequences). The clinical model shows a visibly better performance compared to the clonal model on the sequences phenotyped using the enhanced Trofile assay (left panel, solid black and red curve, respectively) and outperforms g2p model train on clinical or clonal data (left panel, dashed black and red curve, respectively). These differences in performance between the clinical and clonal models are not observed on the subset of sequences phenotyped with the standard Trofile assay (right panel). Nevertheless, in this subset, the structure-based models outperform the corresponding sequence-based models by <2 percentage points.
(TIFF)
Click here for additional data file.
Effect of indels on the prediction accuracy of the clonal model. The curves illustrate the prediction performance of the clonal model based on a dataset containing only sequences with indels (black curve) and only sequences with indels (red curve). Similar performance of the clonal model based on both datasets suggests there is a limited effect of the presence of indels on the model accuracy.
(TIF)
Click here for additional data file.
Distribution of V3 positions (top panel) and amino acid indices (bottom panel) among the features selected for the clinical model constructed analogous to
(TIFF)
Click here for additional data file.
Side-chains of the V3 loop in the unbound (A, structure 2B4C) and bound (B, structure 2QAD) conformation. In the bound conformation the residues of CS1 (304 and 307) and CS2 (319–321) are closely located and form bonds between two sides of the loop stem.
(TIF)
Click here for additional data file.
Clusters of selected features mapped on the 2B4C structure.
(TIF)
Click here for additional data file.
Significantly overrepresented features mapped on the 2B4C structure.
(TIFF)
Click here for additional data file.
V3 residue numbering. The numbering of V3 residues used in this manuscript is shown on the 2B4C structure. Top numbers indicate residue position within V3 loop, bottom numbers are assigned according to HXBc2, a numbering used also in the 2B4C annotation
(TIFF)
Click here for additional data file.
Hierarchical clustering of amino acid indices. Black dots indicate numbers of clusters, red dots the silhouette values for the consecutive steps of the clustering procedure. Vertical lines indicate the best clustering obtained for 12 clusters and second best with a lower number of clusters (4).
(TIFF)
Click here for additional data file.
Separation of amino acid indices into 12 clusters – the separation that showed the largest silhouette value.
(TIFF)
Click here for additional data file.
Performance of the clinical model and models derived from the clinical dataset by removing sequences with ambiguities (HOMER-filter), removing sequences without ambiguities (HOMER-ambi) and replacing ambiguities with gaps (HOMER-gap).
(PDF)
Click here for additional data file.
Therapy outcome prediction using structure-based model.
(XLS)
Click here for additional data file.
Therapy outcome prediction using g2p model.
(XLS)
Click here for additional data file.
Ranking and selection of features.
(XLS)
Click here for additional data file.
AA indices of the selected features.
(XLS)
Click here for additional data file.
Positions of the V3 loop of the selected features.
(XLS)
Click here for additional data file.
Selection of model parameters.
(PDF)
Click here for additional data file.
Overlap of features selected by different feature selection methods.
(PDF)
Click here for additional data file.
Feature correlation.
(PDF)
Click here for additional data file.
Combining structure and sequence descriptors.
(PDF)
Click here for additional data file.
Validation of the clinical model on external dataset.
(PDF)
Click here for additional data file.
Effect of ambiguities on prediction accuracy.
(PDF)
Click here for additional data file.
Clonal dataset.
(ZIP)
Click here for additional data file.
We would like to thank Alexander Thielen for providing support with additional dataset analysis.