Corona-Chan Project, Part 3: Analyzing the Prediction Function

This article will be building on my research in the previous two articles on the Corona-Chan Project, so if you haven’t read Parts 1 and 2, you might want to go back and do that:

Project for Quarantine Period: Tracking the Coronavirus Outbreak Using Calculus and C
Corona-Chan Project Update: Smoothing the Prediction Function

First order of business: I modified my derivatives program so that it reads and prints dates from the file. The format of the input file now consists of a date followed by the number on each line. Just thought I’d share that first, since it doesn’t fit in anywhere else in this article:


 1 #include <stdio.h>
 2 #include <stdlib.h>
 3 #include <string.h>
 4 #include <errno.h>
 5 #include <ctype.h>
 6 
 7 #define SMOOTH_ITERATIONS 13
 8 
 9 #define smooth( dataset, i )\
10 ((int) (0.8 * (float) dataset[i] + 0.1 * (float) dataset[i+1] + 0.1 * (float) dataset[i-1]))
11 
12 void mainint argc, char **argv ){
13         FILE *fp;            // File pointer
14         char buf[30];        // Buffer for reading the file
15         char *token;         // Used for tokenizing the buffer
16         int *derivatives[4]; // Derivative data points
17         char **dates;        // Dates from file
18         int *smooth;         // Array for smoothed data set
19         int lines = 0;       // Number of data points in each array
20         int i, j, k;         // Loop counters
21         int len;             // String length
22         // Open file and handle any errors:
23         if( !(fp = fopen( argv[1], "r" )) ){
24                 fprintfstderr"%s%s%s\n", argv[0], argv[1], strerror( errno ) );
25                 exit( errno );
26         }
27         // Count lines in file:
28         whilefgetc( fp ) != EOF ){
29                 fgets( buf, 30, fp );
30                 lines++;
31         }
32         // Set up dates array:
33         dates = (char **) mallocsizeofchar ** ) * lines );
34         for( i = 0; i < lines; i++ ){
35                 dates[i] = (char *) calloc6sizeofchar ) );
36         }
37         // Set up arrays of data points:
38         for( i = 0; i < 4; i++ ){
39                 derivatives[i] = (int *) mallocsizeofint ) * lines-- );
40         }
41         lines += 3;
42         smooth = (int *) mallocsizeofint ) * lines );
43         rewind( fp );
44         // Read original data set from file:
45         for( i = 0; i < lines; i++ ){
46                 fgets( buf, 30, fp );
47                 // First token copies date:
48                 token = strtok( buf, \t\r\n" );
49                 strcpy( dates[i], token );
50                 // Second token copies number:
51                 token = strtokNULL\t\r\n" );
52                 derivatives[0][i] = atoi( token );
53         }
54         // Smooth original data set:
55         for( i = 0; i < SMOOTH_ITERATIONS; i++ ){
56                 for( j = 1; j < lines-1; j++ ){
57                         smooth[j] = smooth( derivatives[0], j );
58                 }
59                 // Copy smooth data set back to original array:
60                 for( j = 1; j < lines-1; j++ ){
61                         derivatives[0][j] = smooth[j];
62                 }
63         }
64         // Outer loop loops through array of arrays
65         for( i = 1; i < 4; i++ ){
66                 lines--;
67                 // Calculate derivatives:
68                 for( j = 0; j < lines; j++ ){
69                         derivatives[i][j] = derivatives[i-1][j] - derivatives[i-1][j+1];
70                 }
71                 // Smooth data set:
72                 for( j = 0; j < SMOOTH_ITERATIONS; j++ ){
73                         smooth = (int *) mallocsizeofint ) * lines );
74                         for( k = 1; k < lines-1; k++ ){
75                                 smooth[k] = smooth( derivatives[i], k );
76                         }
77                         // Copy smooth data set back to original array:
78                         for( k = 1; k < lines-1; k++ ){
79                                 derivatives[i][k] = smooth[k];
80                         }
81                 }
82         }
83         // Drop last three data points because
84         // it makes programming easier.
85         for( i = 0; i < lines; i++ ){
86                 printf"%s:\t", dates[i] );
87                 for( j = 0; j < 4; j++ ){
88                         printf"%8d\t", derivatives[j][i] );
89                 }
90                 putchar'\n' );
91         }
92         // Free up memory:
93         for( i = 0; i < 3; i++ ){
94                 free( derivatives[i] );
95         }
96         fclose( fp );
97 }

Now that I’ve shared the current iteration of the program I’m using to analyze the data, I can get into modeling the results.

For this article I’m going to be providing a more precise definition of the mathematical functions that I’m using to model and predict the behavior of the Coronavirus spread. The specific class of functions I’m using consists of the zeroth, first, second, and third derivatives of linear transformations of the hyperbolic tangent function. By a linear transformation of a function I mean a second function that is derived from that function by either multiplying x or y by a scalar value or adding a scalar value to some scalar multiple of x or y. (Technically mathematics defines linear transformations on vector spaces, so since there’s no concept of a linear transformation of a function, I’m modifying the definition somewhat for my own use.)

As an example of this concept, the function y = 2tanh(5x-4)+3 is a linear transformation of the hyperbolic tangent. The reason for using transformations like this is so that we can properly fit the curves to the actual data through translation and scaling.

Using the rules of differential calculus, we can find that the derivatives of y = tanh(x) are as follows:

First, second, and third derivatives of hyperbolic tangent function

In order to understand how the adjustments effected by the smoothing algorithm may cause the resulting curve to deviate from its ideal model, we can explore what happens when we apply those same adjustments to the models themselves. So, for example, we could replace the tanh(x) with 0.8tanh(x) + 0.1tanh(x+0.1) + 0.1tanh(x-0.1) per the weights we used in the actual smoothing function.

Let’s look at the data points from the smoothed vs. unsmoothed data sets again:

Coronavirus data before smoothing:
Coronavirus data before smoothing

Coronavirus data after smoothing:
Coronavirus

If you look at the first column of the first table versus the first column of the second table, you’ll notice that the values in the second table are slightly higher than the corresponding values in the first table. This is due to an inevitable inconsistency in the weighted average. If both neighbors are given the same weight, then assuming the slope of the curve is not constant, one point will end up having a greater effect on the new value of the middle point than the other one. If the slope is positive and increasing or negative and decreasing, then the point to the right will have a greater effect, whereas if the slope is positive and decreasing or negative and increasing, then the point on the left will have the greater effect. This means the points are unbalanced, and the smoothing algorithm will shift the curve at least slightly towards the area of greatest slope.

This will probably be easier to understand with a visual:

Effects of smoothing as modeled by adjustments to the hyperbolic tangent function

Here Red represents the original sigmoid curve and Blue represents the sigmoid curve after application of the smoothing algorithm. This graph is deliberately distorted so that the difference between the two curves is more clearly visible, but the real situation is analogous. You can look at the figure and mentally trace sample triplets of points going up to and through the inflection point to convince yourself of the phenomenon I have just described. Basically the end result is that the sigmoid curve is collapsed inward vertically towards the inflection point.

You may notice that the inflection point itself doesn’t move. This is very convenient for us if we want to determine the x-value of that point. The reason it doesn’t move is because the graph is rotationally symmetrical, which means the pull-up/pull-down effect is also symmetrical about that point, resulting in a net shift of zero. We see a similar effect with relative extrema – a local maximum will be pulled down and a local minimum will be pulled up, and provided a large enough region of the graph is symmetrical about that point (as is the case with the bell curve), the x-value of the extremum won’t change.

Unfortunately this property does not hold for curves that are not symmetrical about their critical points, as we can see in the following figure:

Shifting of the local maximum due to application of a smoothing algorithm

This is a graph of two linear transformations of the second derivative function given in the equations above. The local maximum is in fact the zero point of the third derivative that we were looking for in the previous article. As you can see, when the smoothing algorithm is applied to the first curve, the local maximum is shifted not only downward, but also to the left (back in time). The x-value is only decreased by 0.003, but if this curve is scaled with the Coronavirus data and the smoothing function is applied a few dozen times, we may see a lot of shifts of several hours each adding up to days or even weeks.

Before we explore ways to correct this problem, let’s examine exactly why this shift happens. This was a question I left open in the previous article in this series, but now I’ve finally figured it out! In this function, the section of the graph to the right of the maximum is decreasing a lot more sharply than the section to the left is increasing. This means that, in accordance with the collapsing effect I described previously, the points to the right will get pulled down further than the points to the left, resulting in a slight shift of the maximum value to the left.

Now that we know why this distortion occurs, let’s try to find a way to correct it. One method that I thought of is what I call the marking method. To illustrate this method, let’s look at some data sets with high levels of smoothing…

Coronavirus data with 20 iterations of smoothing

Here we have 20 iterations of smoothing. Notice how the program now prints the date at the beginning of each line. This will come in handy right now. We notice here that the maximum point for the rightmost column is at 3/14 and the zero point is at 3/19. Due to the shape of the curve, the maximum point will always trail closely behind the zero point, so we can roughly predict one with the other.

Because we can say with certainty that the maximum and zero points in this data set are earlier than in any less smooth data sets, we can mark off these points and discount any points earlier than those two points in further data sets as we work our way downward to lower and lower levels of smoothing.

Now we calculate the data set for 10 iterations:

Coronavirus data with 10 iterations of smoothing

The points have moved just a little bit here, which is much different from what I saw the other day. I think this may be because I’ve added new data and the smoothed data set is starting to adjust itself as we move closer to the actual zero point.

Let’s do 5:

Coronavirus data with 5 iterations of smoothing

Here we can see a lot of irregularities in the data, but the numbers are only consistently negative after 3/20.

Now 2:

2-iterations

Here we can see that the rightmost column is all over the place and dips into the negative range at several points. However, because we’ve marked off all but the top 3, we know that the zero point has to be at least there.

I hope you can see the rationale behind this method, as it allows us to discount certain places where the unsmoothed function dips below zero and get a better understanding of where the actual zero point is. Based on this data, I would say that the zero point is pretty close to today. Now the USA where I live has the fastest growing cases of Coronavirus, and it’s also one of the world’s largest economies, so that might have an effect of delaying the bottoming-out point. But at this point I think we’re getting pretty close to the bottom. I’ve looked at the NASDAQ and the DOW and they both seem to be leveling off somewhat, which further corroborates my hypothesis. I don’t know about you, but I’ve certainly started buying up stocks.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s