Covidex Supplementary data

Rambaut et al nomenclature

This file contains the supplementary data from the Covidex app. All graphs and some tables are interactive and the reader can explore the data.
First, we present some basic stats from the training and testing datasets.

Classification model	Training date	Sequences	Number of subtypes	Number of trees	mtry	Oob error rate
Rambaut et al nomenclature	2021-03-15	60362	882	500	350	0.0365

Classes were excluded due to contradictions with data supplied by Rambaut et al.

Classification model	Sequences	Number of subtypes	Error	Multi-class AUC
Rambaut et al nomenclature	24411	882	0.0293	0.9472

The following graph plots probability vs the number of ambiguous bases for each sequence. As expected, the proportion of wrongly classified sequences (red dots) increases with lower probability values. Also we see a trend towards larger proportion of wrongly classified sequences with the number of ambiguous bases.

In the following table evaluation metrics for each class are presented:

In the next heatmap we show the correlation between the expected classification and the obtained classification by Covidex for each class. Overall we find a high correlation value.

The Precision-Recall curve shows the good performance of the method