Corona-Chan Project Update: Smoothing the Prediction Function

In the previous article I talked about how I’m using sophisticated data analysis techniques to predict when the stock market will hit rock bottom due to the Coronavirus recession, pointing to the optimal time to invest in cheap stocks that are guaranteed to go up. In that article I shared a C program I wrote for calculating the first, second, and third derivatives of a data set showing cumulative Coronavirus cases excluding mainland China, and stated that the point where the third derivative hits zero indicates the approximate time when the market will start to bottom out. Specifically, the third derivative will cross the x-axis several days before the market hits rock bottom, giving me plenty of time to build my portfolio, well ahead of the people who wait for the market to start picking up to invest.

There was only one problem with my program. If we look at the output it generates, we can see what that problem is:

Coronavirus derivative data produced by original C program

As we can see, the third derivative bounces all over the place, making it impossible to tell when it will cross zero. In fact it’s negative in nearly half of the data points. We need something better. We need a way to iteratively smooth out these functions so that they behave more like the curves derived from differentiating a sigmoid function. Just to refresh your memory, they look something like this:

A sigmoid function and its first, second, and third order derivatives

It turns out if we smooth out the sigmoid function, the derivatives automatically smooth out to an even greater degree. For example, the third derivative is literally all over the place in the original output, with no discernable pattern whatsoever, but using a simple smoothing algorithm with enough iterations of smoothing, we can reduce it to a function that behaves more or less like the purple curve shown in the above figure. That is to say, it starts out at zero, then steadily increases, reaches a maximum, and then starts steadily decreasing all the way down to zero.

To understand the reason why this is possible, first we have to understand the reason why the original functions are so wacky. Basically, because we’re drawing data points from real life, we can’t have a perfect mathematical function, because real life doesn’t work that way when you’re talking about large populations of people. The function is anomalous in that it varies ever so slightly from an actual sigmoid function. The anomalies are caused by completely random variables of which there are too many to take into account. These anomalies may be very slight, but they are magnified in the derivative, magnified even more in the second derivative, and magnified still more in the third derivative. Thus the degree to which a function varies numerically from the mathematical model (the error if you will) increases exponentially with the order of differentiation.

The upside is we can use this magnification property to our advantage. If we simply smooth out the sigmoid function a little bit, we will see increasingly greater degrees of smoothing with each successive derivative. So all we have to do is smooth the function enough times that the derivative functions converge to their mathematical models. We know more or less what we want, at least during this accelerative phase of the Coronavirus outbreak: We want a sigmoid curve that is increasing exponentially, a first and second derivative that are monotonic increasing, and a third derivative that is monotonic increasing up to a certain point and then monotonic decreasing after that. We want the third derivative to start dropping first, then the second derivative, then the first derivative. Having implemented a smoothing algorithm and carefully observed the output I can say that, given enough iterations, this is exactly what we get.

First we need to define our smoothing function. I chose to take a weighted average of the current point and its two neighboring points.


#define smooth( dataset, i )\
((int) (0.8 * (float) dataset[i] + 0.1 * (float) dataset[i+1] + 0.1 * (float) dataset[i-1]))

I chose a rather high weight for the given point and low weights for its neighbors, so that each iteration only smooths the function very gently. My rational behind this was that each iteration increases the sphere of influence of each data point. In the first iteration, a point is being pulled upward or downward by its two neighbors. In the second iteration, the same thing happens except the two neighbors have also been pulled upward/downward by their neighbors, thus the point is influenced by both its direct neighbors and the neighbors of its neighbors. Thus with each successive iteration the values propagate throughout the function. Using only small weights for the neighbors allows us to run large numbers of iterations, thus allowing the smoothing algorithm to naturally take advantage of this propagation pattern.

Next we need to define a macro for the number of iterations:


#define SMOOTH_ITERATIONS 3

At this iteration level, the smoothing algorithm is not very effective, but it’s a start. Basically we have to zero in on a value that is high enough to make the functions roughly converge to their mathematical models.

The code that uses the iteration algorithm looks something like this:


// Smooth original data set:
for( i = 0; i < SMOOTH_ITERATIONS; i++ ){
        for( j = 1; j < lines-1; j++ ){
                smooth[j] = smooth( derivatives[0], j );
        }
        // Copy smooth data set back to original array:
        for( j = 1; j < lines-1; j++ ){
                derivatives[0][j] = smooth[j];
        }
}


As you can see, this code applies the smoothing algorithm several times, writing the smoothed data points to a separate array and then writing them back to the original array.

Here is the program in its entirety:


 1 #include <stdio.h>
 2 #include <stdlib.h>
 3 #include <string.h>
 4 #include <errno.h>
 5 #include <ctype.h>
 6 
 7 #define SMOOTH_ITERATIONS 14
 8 
 9 #define smooth( dataset, i )\
10 ((int) (0.8 * (float) dataset[i] + 0.1 * (float) dataset[i+1] + 0.1 * (float) dataset[i-1]))
11 
12 void mainint argc, char **argv ){
13         FILE *fp;            // File pointer
14         char buf[20];        // Buffer for reading the file
15         int *derivatives[4]; // Derivative data points
16         int *smooth;         // Array for smoothed data set
17         int lines = 0;       // Number of data points in each array
18         int i, j, k;         // Loop counters
19         int len;             // String length
20         // Open file and handle any errors:
21         if( !(fp = fopen( argv[1], "r" )) ){
22                 fprintfstderr"%s%s%s\n", argv[0], argv[1], strerror( errno ) );
23                 exit( errno );
24         }
25         // Count lines in file:
26         whilefgetc( fp ) != EOF ){
27                 fgets( buf, 20, fp );
28                 lines++;
29         }
30         // Set up arrays of data points:
31         for( i = 0; i < 4; i++ ){
32                 derivatives[i] = (int *) mallocsizeofint ) * lines-- );
33         }
34         lines += 3;
35         smooth = (int *) mallocsizeofint ) * lines );
36         rewind( fp );
37         // Read original data set from file:
38         for( i = 0; i < lines; i++ ){
39                 fgets( buf, 20, fp );
40                 len = strlen( buf );
41                 for( j = 0; j < len; j++ ){
42                         // Strip control characters so they don't
43                         // interfere with atoi():
44                         if( !isdigit( buf[j] ) ) buf[j] = '\0';
45                 }
46                 derivatives[0][i] = atoi( buf );
47         }
48         // Smooth original data set:
49         for( i = 0; i < SMOOTH_ITERATIONS; i++ ){
50                 for( j = 1; j < lines-1; j++ ){
51                         smooth[j] = smooth( derivatives[0], j );
52                 }
53                 // Copy smooth data set back to original array:
54                 for( j = 1; j < lines-1; j++ ){
55                         derivatives[0][j] = smooth[j];
56                 }
57         }
58         // Outer loop loops through array of arrays
59         for( i = 1; i < 4; i++ ){
60                 lines--;
61                 // Calculate derivatives:
62                 for( j = 0; j < lines; j++ ){
63                         derivatives[i][j] = derivatives[i-1][j] - derivatives[i-1][j+1];
64                 }
65                 // Smooth data set:
66                 for( j = 0; j < SMOOTH_ITERATIONS; j++ ){
67                         smooth = (int *) mallocsizeofint ) * lines );
68                         for( k = 1; k < lines-1; k++ ){
69                                 smooth[k] = smooth( derivatives[i], k );
70                         }
71                         // Copy smooth data set back to original array:
72                         for( k = 1; k < lines-1; k++ ){
73                                 derivatives[i][k] = smooth[k];
74                         }
75                 }
76         }
77         // Drop last three data points because
78         // it makes programming easier.
79         for( i = 0; i < lines; i++ ){
80                 for( j = 0; j < 4; j++ ){
81                         printf"%8d\t", derivatives[j][i] );
82                 }
83                 putchar'\n' );
84         }
85         // Free up memory:
86         for( i = 0; i < 3; i++ ){
87                 free( derivatives[i] );
88         }
89         fclose( fp );
90 }

Here I’ve used the optimum value of 14 for the iteration level. I arrived at this value by simply testing different values and zeroing in on the correct one.

Let’s see what happens when we use 3 iterations:

Smoothing function output with 3 iterations

Okay, the derivative is now monotonic increasing and we’ve eliminated the negatives in the second derivative as well as most of the negatives in the third derivative, so we’ve definitely made progress. But the third derivative is still bouncing around too much to provide any useful information.

Let’s try 20:

Smoothing function output with 20 iterations

This has eliminated any anomalies, but now the third derivative seems to approach zero too quickly. It turns out there is a trade-off between the shape of the curve and the accuracy of the zero point. The higher the level of iteration, the more the zero point gets pulled back in time. So it’s impossible to precisely predict when the function will reach its zero point without some loss of accuracy due to the smoothing curve. I have no idea why this happens. If anyone in the comments has an answer, please let me know, because this is a very interesting phenomenon that I don’t understand.

So as you can see, we need a happy medium between the two extremes. How about 12:

Smoothing function output with 12 iterations

Here we’ve still eliminated all negatives, and the third derivative doesn’t seem to approach zero too fast, but it’s still oscillating somewhat. We want it to be monotonic increasing until the maximum and then have it be monotonically decreasing until it hits zero and becomes negative.

Here is the output for 14 iterations:

Smoothing function output with 14 iterations

This data set seems to have (at least close to) the properties that we want. We can increase it to 15 if we want to make the curve conform to the model a little more, but at this point it doesn’t really matter.

So now we have a convenient function that will predict with some accuracy when the Coronavirus epidemic will hit its peak, at which point the stock market will be at its lowest point. I’ll be keeping an eye on those numbers and keeping some cash handy to buy up stocks when that happens.

One thought on “Corona-Chan Project Update: Smoothing the Prediction Function

Leave a reply to Darren Cancel reply