Covidex Supplementary data

Nextstrain nomenclature

This file contains the supplementary data from the Covidex app. All graphs and some tables are interactive and the reader can explore the data.
First, we present some basic stats from the training and testing datasets.

Classification model Training date Sequences Number of subtypes Number of trees mtry Oob error rate
Nextstrain nomenclature 2021-03-08 1402 12 1000 200 0.0171
Classes were excluded due to contradictions with data supplied by Rambaut et al.
Classification model Sequences Number of subtypes Error Multi-class AUC
Nextstrain nomenclature 706 12 0.0212 0.9869

The following graph plots probability vs the number of ambiguous bases for each sequence. As expected, the proportion of wrongly classified sequences (red dots) increases with lower probability values. Also we see a trend towards larger proportion of wrongly classified sequences with the number of ambiguous bases.  

In the following table evaluation metrics for each class are presented:

In the next heatmap we show the correlation between the expected classification and the obtained classification by Covidex for each class. Overall we find a high correlation value.

The Precision-Recall curve shows the good performance of the method