In silico profiling of non-synonymous SNPs in IDS gene for early diagnosis of Hunter syndrome

Single amino acid substitutions in the Iduronate-2-sulfatase enzyme result in destabilization of the protein and cause a genetic disorder called Hunter syndrome. To gain functional insight into the mutations causing Hunter syndrome, various bioinformatics tools were employed, and special significance is given to molecular docking. In-silico tools available online for preliminary analysis including SIFT, PolyPhen 2.0, etc., were primarily employed and have identified 51 Non-synonymous Single Nucleotide Polymorphisms (ns-SNPs) as possibly deleterious. Further, modelling and energy minimization followed by Root Mean Square Deviation (RMSD) calculation has labelled 42 mutations as probably deleterious ns-SNPs. Later, trajectory analysis was performed using online tools like PSIPRED, SRide, etc., and has predicted six ns-SNPs as potentially deleterious. Additionally, docking was performed, and three candidate ns-SNPs were identified. Finally, these three ns-SNPs were confirmed to play a significant role in causing syndrome through root mean square fluctuation (RMSF) calculations. From the observed results, G134E, V503D, and E521D were predicted to be candidate ns-SNPs in comparison with other in-silico tools and confirmed by RMSF calculations. Thus, the identified candidate ns-SNPs can be employed as a potential genetic marker in the early diagnosis of Hunter syndrome after clinical validation.


Introduction
Hunter syndrome, a noxious and uncommon lysosomal storage disease, which is X-linked and caused by the mutation within the IDS gene [1], coding a lysosomal polypeptide of 550 amino acid length (IDS, E.C. 3.1.6.13) [2], located at chromosome Xq27.3-q28, spans 24 Kb, composed of 9 exons and belongs to the sulfatase family [3]. It is also known as type II Mucopolysaccharidosis (OMIM 309900). IDS gene codes for the significant enzyme that prevents the build-up of glycosaminoglycans (GAGs) in the lysosome.
IDS breaks-down heparan sulfate (HS) and dermatan sulfate (DS) by specifically targeting the sulfate group at the O-2 position [4]. This enzyme contains 25 amino acid residues at the amino-terminal, that acts as a signal peptide followed by 8 amino acids. These 33 amino acids are cleaved after processing and thus the functional length of the enzyme is 517 amino acids.
Activity of IDS gene is crucial for cellular maintenance [5]. Hence, defect in this gene leads to the buildup of GAGs -HS and DS in the body. It is estimated that Hunter syndrome occurs in approximately one in 170,000 male live births and prevails in individuals of all  23:53 ethnicities [6,7]. More than 500 disease-causing mutations have been identified in the IDS gene, which include single nucleotide polymorphisms (SNPs) that influence the structure and function of the enzyme greatly. Thus, SNPs can be considered as a diagnostic marker for Hunter syndrome [8].
People with Hunter syndrome show various clinical manifestations such as respiratory infections [9], sleep apnea, joint stiffness, pelvic dysplasia, hepatomegaly, umbilical and inguinal hernias, otitis, gingival hypertrophy, hyperplasia [10], cardiological and ocular manifestations, and also skin infections such as Mangolian spots and lesions [11]. Due to the non-availability of the drug, patients are advised for enzyme replacement therapy (ERT) or stem cell therapy (SCT) [12,13].
The current study utilized a combination of bioinformatic tools to predict the structural and functional modifications of proteins and aims to identify candidate non-synonymous single nucleotide polymorphisms (ns-SNPs) in the IDS gene that may be used for specific and appropriate attuned therapies for Hunter syndrome in near future.

Datasets and retrieval of nsSNPs
National Centre for Biotechnology Information (NCBI) (https:// www. ncbi. nlm. nih. gov/ snp/) is an open-source database [14], from which the non-synonymous (missense) SNPs in the IDS gene of Homo sapiens and protein sequence in FASTA format were retrieved. Recovered SNPs were subjected to look into their deleterious effect on the IDS gene by employing a set of insilico tools.

Combined prediction using preliminary tools
FASTA format of the IDS gene was given as input in tools as shown in Table 1 and ns-SNPs predicted as deleterious by all 5 tools were considered for further analysis.

Modelling and RMSD calculation
The three-dimensional structure has a momentous role in analysing the trajectory of the protein. Since no structure was available for the protein IDS, a 3-D structure was modelled through SWISS-MODEL expasy (http:// swiss model. expasy. org/) [20], validated using PROCHECK [21] and ProSA Webserver (https:// prosa. servi ces. came. sbg. ac. at/ prosa. php) [22]. In due course, all mutant structures were generated using SWISS-PDB Viewer [23]. Once generated, all the structures including native were energy minimized using GROMACS, which works on l-bfgs (imited-memory Broyden-Fletcher-Goldfarb-Shanno quasi-Newtonian minimizer) [24]. Structures obtained from GROMACS were utilized for root mean square deviation (RMSD) calculation [25] using SWISS-PDB Viewer.

Trajectory analysis
IDS gene, either as a sequence or as a structure file was submitted in tools as shown in Table 2 for trajectory analysis. ns-SNPs predicted as deleterious in the maximum number of tools were selected for further analysis.

Molecular docking and visualization
Docking was performed by using Autodock Vina [32]. The results were visualized and analyzed using Lig-plot+ [33].

Effect of SNP on splicing
Splicing is the most significant post-transcriptional modification, which removes the non-coding regions and collate the exons to form the functional protein.
To evaluate the influence of candidate SNPs on the splicing sites, NetGene 2 online server (https:// servi ces. healt htech. dtu. dk/ servi ce. php? NetGe ne2-2. 42) [35] was employed. This server was provided with the DNA sequence in FASTA format as input.

Pathogenicity analysis
At last, all the candidate SNPs were employed for pathogenicity analysis to confirm and further support the prediction. This analysis was performed using DUET online server (http:// biosig. unime lb. edu. au/ duet/) [36].

Datasets and retrieval of ns-SNPs
The IDS gene contains 5549 SNPs in total (As of 22nd September 2020). Out of these 5549 SNPs, only 289 non-synonymous SNPs (missense mutation) were retrieved from the National Centre for Biotechnology Information (NCBI) database. Missense mutation contributes 5% among total SNPs in the IDS gene (289/5549).

Combined analysis using preliminary tools
The combined analysis of five online tools had predicted 51 out of 289 ns-SNPs as deleterious in common and the individual prediction by the tools are listed in Table 3. Since the preliminary tools involve different algorithms, a combined approach may increase the precision in identifying the deleterious ns-SNPs [37]. Only the predicted 51 possibly deleterious ns-SNPs were considered for the subsequent analysis.

Modelling and RMSD calculation
A three-Dimensional structure is required for understanding the structure-function relationship of proteins, since point mutation can drastically affect protein function [38]. The 3D structure of IDS protein modelled using SWISS-MODEL was based on the template 5fql.1.A (99.81% similarity) and with the QMEAN score of -1.21.
The structure was further validated using PROCHECK and ProSA webserver. Ramachandran Plot shows the presence of 86.5% of residues within the most favourable regions, which indicates the structure obtained is in good quality [39] and the overall z-score for the protein IDS was found to be -8.14 using the ProSA server.
Since the structure obtained is a homodimer, only the Chain A was used to generate the 51 mutant structures using SWISS-PDB viewer.  Ultimately, Energy minimization was done for all structures including native using GROMACS, and RMSD values were calculated by super-imposing native and mutant structures in SWISS-PDB Viewer. Based on the RMSD values of 51 SNPs, the threshold was fixed as above 0.15 Å (> 0.15 Å). Over one-third of the mutants have RMSD value greater than 0.15, the higher the RMSD value, the higher the degree of deleterious effect [40]. Based on RMSD values, 42 SNPs were found to be probably deleterious and are enlisted in Table 4.

Trajectory analysis
Only the possibly deleterious ns-SNPs (42 ns-SNPs) were analysed using trajectory tools. The analysis had predicted 6 ns-SNPs (G134E, P358T, R468W, M488I, V503D, and E521D) as potentially deleterious in common by maximum number of tools (Table 5). Multiple approaches were employed for trajectory analysis using various tools, to get more accurate results. Among these ns-SNPs five SNPs (G134E, P358T, M488I, V503D, and E521D) have variations in their secondary structures after mutation and all these 6 SNPs have modified stability residues in comparison with the native structures. The conservation score also varies significantly along with the Relative Solvent Accessibility values. Analogously, a study has been communicated, which also employs these properties for the prediction of most deleterious mutation in HBA1 gene [39].

Molecular docking and visualization
Protein-Ligand Docking Analysis demonstrated that the mutant structures bind to the ligand in a slightly different orientation compared to the native protein. Three ligands were used for docking within which two were the enzyme's own substrates (dermatan sulfate, DS and heparan sulfate, HS) and one from a plant source (Luteolin Sulfate -LS). LS was selected as a ligand because of its abundant availability [41] and its structural similarity with HS and DS. The binding energy of the ligands with mutants were compared with native protein.
InterPro server [42] and literature survey [5] identified D45, D46, C84, R88, K135, F137, H138, H229, D334, H335, and K347 (11 residues) as active site residues, and site-specific docking was performed using Autodock Vina. From the results of docking, it is obvious that binding energy differs drastically in the case of G134E, V503D, and E521D (especially for the ligand-Heparan  Table 6 Tabulation of Binding energies in Kcal/mol found using Autodock vina Sulfate) ( Table 6). Binding energies for remaining residues vary very slightly. Thus, HS was hypothesised to play a crucial role in causing Hunter's syndrome and these three ns-SNPs were considered as candidate ns-SNPs. Binding energy of these mutant proteins with HS in the presence of LS were calculated ( Table 6). From the results, it is observed that binding of LS with the protein alters the protein structure and increases the affinity for the ligand. Thus, LS may act as a potential molecule in treating Hunter Syndrome. The interaction between native and mutant proteins (G134E, V503D, and E521D) with the ligand Heparan Sulfate were visualized and bonds formed in native and mutants were observed using ligplot+ (Fig. 1). Ligplot+ analysis shows the variations in bonds formed by mutant proteins with ligands, compared with native protein. The mutation could also be the rational for these variations [43] and it supports the conclusions obtained from all the previous analyses. These predictions are in accordance with the deviation in RMSD values, as predicted earlier.

RMSF calculation
RMSF calculation done using the CABS-flex server, confirms G134E, V503D, and E521D as candidate ns-SNPs with variations in RMSF values (Table 7). Obtained result revealed that the values for mutants were lower when compared with native structure. This confirms the compressed nature of protein structure which in turn can affect the protein function [44].

Effect of SNP on splicing
Netgene 2 employs an artificial neural network that predicts the splice site location. The DNA sequence in FASTA format for native and all three identified candidate SNPs were given as inputs and splice sites were predicted. The results of the candidate SNPs splice site prediction were compared with the native prediction. As a result, we found that these SNPs do not interfere with the splicing mechanism of the IDS gene.

Pathogenicity analysis
Pathogenicity analysis was performed using DUET web server. It is an online server used to analyse the pathogenic effects of missense mutations in proteins. This server provides the integrative result by combining two independent approaches, namely mCSM and SDM. The results of this analysis are provided in Table 8, which predicts that all the three identified candidate SNPs will result in the destabilization of protein. This further supports the identification of G134E, V503D, and E521D as candidate SNPs.

Conclusion
Analysis using various insilico tools predicts the influence of ns-SNPs on the structure and function of protein IDS. Out of 289 ns-SNPs, G134E, V503D and E521D with SNP IDs-rs193302910, rs398123248, and rs1602725543, respectively, were predicted as candidate ns-SNPs. These ns-SNPs may alter the structure of the protein and interfere with the functions. The result summary of the complete workflow is provided in Fig. 2 and the structure of the mutant is compared with the native structure using PyMol [45] in Fig. 3. The presence of these ns-SNPs may inactivate the enzyme IDS and results in accumulation of GAGs, ultimately leading to Hunter syndrome. These ns-SNPs can be considered for clinical confirmatory studies for understanding the exact mechanism and pathology of mutations, before used for diagnostics.