Covidex Supplementary data

Nextstrain nomenclature

This file contains the supplementary data from the Covidex app. All graphs and some tables are interactive and the reader can explore the data.
First, we present some basic stats from the training and testing datasets.

Classification model	Training date	Sequences	Number of subtypes	Number of trees	mtry	Oob error rate
Nextstrain nomenclature	2021-03-08	1402	12	1000	200	0.0171

Classes were excluded due to contradictions with data supplied by Rambaut et al.

Classification model	Sequences	Number of subtypes	Error	Multi-class AUC
Nextstrain nomenclature	706	12	0.0212	0.9869

The following graph plots probability vs the number of ambiguous bases for each sequence. As expected, the proportion of wrongly classified sequences (red dots) increases with lower probability values. Also we see a trend towards larger proportion of wrongly classified sequences with the number of ambiguous bases.

In the following table evaluation metrics for each class are presented:

In the next heatmap we show the correlation between the expected classification and the obtained classification by Covidex for each class. Overall we find a high correlation value.

The Precision-Recall curve shows the good performance of the method