📈 Quantifying 'The Curve'
A naive attempt at COVID curve fitting
With COVID-19 being at the top of everyone’s mind, I decided to poke around with
some of the data floating around the internet. It seems that the
Johns Hopkins University data-set is among the most used sources.
It also allowed me to acquaint myself with a new and exciting part of science: data science.
I stocked up my toolbox with
pandas and went to work.
All my code can be found on GitHub:
Getting the data
It consists of a
covid package that contains methods for grabbing the Johns
Hopkins University data as a neatly formatted
pandas.DataFrame, indexed by
In order to get metrics per population a
population column is added by using
data from data.worldbank.org via the
wbdata Python package and the
We then just iterate over the three metrics and calculate the
Processing the data
Since the COVID outbreak did not start at the same time everywhere around the
world it makes sense to index the data based on the number of days,
since a threshold amount of a given metric was reached. For example, we may want
to index based on the number of days since 10 deaths were recorded for the
This is done by applying the
covid.utils.get_x_day function country-wise:
A simple mathematical model that fits our data reasonably is a Logistic Fit
model, as described by
covid.statistics.LogisticMode which we can fit to using
lmfit Python package for each country.
Graphing our data and fit
We can now plot the Johns Hopkins data and our fit using
nice graphs with standard errors like the one below. Feel free to download the
git repository and have a play with the data in the
In the legend we also find the 3 parameters from our logistic fit, where the
value corresponds to the maximum number of deaths per 1 million population
reached according to the crude logistic model.
As a Dane living in the UK, I find it interesting to see how the governmental responses of Sweden, the UK and Denmark seem to correlate to the number of deaths per population. Particularly Sweden is very comparable to Denmark in many regards, but has chosen a wildly different strategy for dealing with COVID-19.
For the time being, the predicted number of US deaths per population is quite low, but that is probably explained by the fact the outbreak is still fairly contained to a few major cities. The model is unable to take this into consideration, and given the media coverage surrounding the US response so far, it seems likely that the death toll there will be much larger than that described by the model.
Breaking the curve
In order to see when the curve is “broken”, we can change the
y-axis to a log
scale to get:
We see that the US has only just broken the curve, while hard-hit countries like Spain and Italy seem to finally see the results of their strict lock-downs.
Here is the graph for all countries — it is a bit busy.
It should be noted that the tool also does
deaths was chosen as this is probably the most reliable number across the
For me this was a really nice way of playing around with the rather tragic COVID-19. I was quite surprised to see just how well the logistic model fits the data, though it does not account for the fact that societies will eventually have to reopen, which will likely result in a second ‘wave’.
I hope this was at least interesting and hopefully the git repository can serve as a starting point for the next guy looking to poke around with the data.
NOTE: I myself took
as a starting point for using the Johns Hopkins data and
shamelessly stolen from there.