# Animal-Origin Coronavirus Pandemic Risk Prediction Using Deep Learning Method | Infectious diseases of poverty

### Initial virus data

Coronavirus sequences were viewed from the Coronavirus Genome Resource Library (https://ngdc.cncb.ac.cn/ncov/) on June 30, 2020, including those from MERS-CoV, HCoV-OC43 , HCoV-NL63, HCoV-229E, HCoV-HKU1, SARS-CoV genome sequences (combined SARS-CoV-1 and SARS-CoV-2) and coronavirus of animal origin [20]. A total of 3,257 genomes were uploaded and coronaviruses of human and animal origin were considered positive and negative samples, respectively (see Supplementary File 1). Using the K– nearest neighbors algorithm (k = 5), negative samples (coronavirus of animal origin) were divided according to the six types of positive samples (frequency of four nucleotides as characteristics; Six human groups: MERS-CoV, HCoV-OC43, HCoV-NL63, HCoV -229E, HCoV-HKU1, SARS-CoV), then combined with the corresponding positive samples to form six types of coronavirus data sets (see Supplementary File 2).

### Artificial negative data

The spike protein is the most important surface membrane protein in coronaviruses and is responsible for their binding to the host cell receptor, which plays a key role in transmission efficiency and host range [13]. If the peak protein of the positive virus is replaced by that of the negative virus (see Supplementary File 3), the positive virus should significantly reduce the efficiency of transmission and its phenotypic label should be changed. [13]. According to the strategy of artificial recombination in silicon, the sequences generated based on the replacement of the coding region of the spike protein for the initial positive sample were added to the negative sample dataset (see file additional 4). During training, the peak protein weights were further increased in the model, thereby improving the accuracy of the prediction with robustness. Given the synergistic effect of other viral proteins on interspecific infection, this approach is consistent with biological studies of host adaptation [13]. After adding artificial negative data and balancing the number of samples (direct duplication), the final data set of six viruses has been shown in Table 1.

### CCSI-DL Model Framework

The CCSI-DL model consists of five main steps: genome segmentation, sequence inclusion, 1D convolution, RNN, and attention mechanism. Figure 1 shows the structure of the model proposed in the article.

### Segmentation and integration of sequences

The coronavirus genome sequence cannot be directly used for model construction. The traditional sequence conversion method uses one-hot vectors to encode DNA sequence fragments. Because the information between each vector is independent, the model cannot capture the associated information hidden in the sequence and therefore is not suitable for deep learning algorithms. To avoid this problem, we used the DNA vector obtained by the dna2vec method [21] as a vehicle for pre-training. The dna2vec method is based on the word2vec word integration model [22], which is the classic text representation method in NLP.

The dna2vec method uses DNA sequence fragments of length k(k-mers) like words. Its purpose is to calculate the distributed representation of DNA fragments and to capture the associated information in the original sequence. Based on the pre-training vector, the model uses fine-tuning strategies in the integration layer to improve model performance [23]. In the article we have chosen k = 2 and an embedding dimension of 8 [19].

The length of the original coronavirus genome sequence is approximately 27 to 32 kb. We took a two base RNA fragment as the base word and preprocessed the original genome sequence to obtain a numeric index sequence. To facilitate model entry, the resulting index sequence was rounded to a length of 30 kb and divided into 10 equal segments (3 k for each segment). The model integration layer performed the integration of the input sequence based on the pre-trained vector. To improve model performance, the weight of the inclusion layer was set to be trainable so that the DNA vector could be refined based on training data from the coronavirus genome.

### One-dimensional convolution

The CCSI-DL model combines a 1D convolution with an RNN for feature extraction. A 1D convolution is first used to capture the local correlation characteristics in the sequence and then input into the two-way GRU network to extract the overall correlation characteristics. 1D convolution slides along data in one dimension and extracts features from shorter segments of the genome sequence.

For computer vision tasks, using a deeper 2D convolutional model can produce a more precise classification [24], but increasing the network depth does not necessarily improve the performance of 1D data [25]. The 1D convolutional layer is used to extract the local characteristics of the sequence after integration. We set the number of convolution kernels (filters) to 4 and the length of convolution kernels (kernel_size) to 50, and used the rectified linear activation function (ReLU). The ReLU can provide parsimony to the network and the pooling layer can reduce computational complexity. The maximum pooling layer was connected after the convolutional layer to further reduce the sequence length, increase computational speed and prevent overfitting.

### Recurrent neural network

The recurring unit used in the RNN was the GRU network, which is enhanced by the LSTM network and can also solve problems like long-term memory and backpropagation gradients. [26]. The GRU removes the cell state from the LSTM network and keeps two units closed (the update gate and the rest gate), which simplifies the LSTM structure and reduces the complexity of training while achieving an experimental effect. similar.

To correctly capture the characteristics of the biological sequences, a bidirectional GRU architecture was used (two unidirectional GRU layers, one forward and one back). This architecture links the forward hidden state and the reverse hidden state, allowing the output result to simultaneously account for the sequence correlation of the forward and backward states. The output of a single GRU layer was set to 50 dimensions, and the two opposite outputs were connected to achieve a total output dimension of 100.

### Attention Mechanism

After the two-way GRU layer, the attention layer is used to learn feature weights. Inspired by human attention, its purpose is to focus attention on a specific part based on a large amount of information. The attention mechanism was first applied in the field of image processing and then introduced in the field of NLP for machine translation in 2015 [27] and extended to various other NLP tasks, resulting in many improvements [28]. Assuming that the features of the sequence are not of equal importance to the prediction, the use of the attention mechanism can improve the contribution of the key features to the prediction. The attention mechanism is described by the following formulas [19]:

$${h} _ {i} = mathrm {tan} h ({W} _ { omega} {f} _ {i} + {b} _ { omega})$$

$${ alpha} _ {i} = frac { mathrm {exp} ({h} _ {i} ^ {T} {h} _ { omega})} {{ sum} _ {i} mathrm {exp} ({h} _ {i} ^ {T} {h} _ { omega})}$$

$$v = sum_ {i} { alpha} _ {i} {f} _ {i}$$

For the Ifunctionality FI output by the previous layer, its hidden representation hI is calculated. The importance of characteristics is measured by the similarity between hI and the background vector hw. Standardized weight ??I of each characteristic is obtained by multiplying hw through hI then using the softmax function. The characteristic vector FI is multiplied by the corresponding weight ??I and added to the final output vector vused for prediction. W??, b??, and h?? were initialized at random and learned during model training.

### Model training and evaluation

During training, we used a lot size of 24, the cross-entropy loss function, and the Adam optimization algorithm to update the lattice weights. To avoid overfitting, a batch normalization layer and an abandonment layer (probability of random abandonment = 0.5) have been added after the merge layer. After amplification of the positive samples, the six viral data sets were randomly divided into a training set (90%) and a test set (10%). The model was trained using the training set for 15 turns, and then the effect of training the model was evaluated using the test set.

To evaluate the predictive models, we calculated the area under the receiver operating characteristic curve (AUROC) [29] and the area under the precision return curve (AUPR) [30], both of which are suitable for unbalanced data sets. The Receiver Operating Characteristic Curve (ROC) plots the rate of true positives against the rate of false positives. The closer the AUROC is to 1, the better the performance of the model. The ROC curve is not affected by the distribution of positive and negative samples, so AUROC is suitable as an evaluation index for an unbalanced binary classification model. The precision-recall curve plots precision against recall, reflecting the trade-off between model precision in identifying positive examples. The closer the AUPR value is to 1, the better the performance of the model.

To verify the advantage of the pre-trained DNA vector and the attention mechanism on the performance of CCSI-DL-specific and CCSI-DL-general, three variant models were proposed: (1) CCSI-DL- nopre, wherein the inclusion layer does not use pre-drive DNA vectors; (2) CCSI-DL-onehot, which does not use an embedding layer and only uses a one-hot encoded sequence as input; (3) CCSI-DL-noatt, which does not use the attention layer. The same learning processes were applied to specific and general models: CCSI-DL-spe-nopre, CCSI-DL-spe-onehot, CCSI-DL-spe-noatt, CCSI-DL-gen-nopre, CCSI-DL -gen- onehot and CCSI-DL-gen-noatt.

### Implementation of the prediction tool

We used Python 3.7.4, tensorflow 2.1.0 and Keras 2.3.1 to create an easy to use tool that implements our predictor, which is freely accessible via https://github.com/kouzheng/CCSI-DL and can work end-to-end and manage big data. Users should prepare the coronavirus genome query sequences in FASTA format, enter the name of the query file, and set the confidence parameter (from 0.0 to 1.0) before running the tool. Setting a smaller confidence parameter results in more sensitive predictions. A predicted pandemic risk phenotype is labeled “H”, while the “N” label indicates no transmission.