So a few days ago I talked about my recent obsession with data and how I started tabulating, charting, and analyzing SEO and similar data in Microsoft Works. Now I will talk about the subsequent data analyses that I did using the R programming language. Let’s get to it…

What is R? I would describe it as a functional, declarative programming language designed for working with large amounts of statistical data. It is optimized for statistical operations, and isn’t really good for much else, which would qualify it as a domain-specific language. I would also argue that it can be classified with so-called “very high-level” languages like awk. Due to its domain-specific nature it is highly declarative, basing its syntax largely around abstract vector operations, and although it does provide the same decision statements, loop statements, etc. present in other languages, they are used quite a bit less in R. In fact some would debate whether R even counts as a programming language, given that structurally many R scripts more closely resemble batch files in languages like SQL than actual scripts. One thing that’s for certain is that R focuses a lot more on specialized high-level operations on entire data sets than on building algorithms from simpler primitives.

Now that I’ve given my personal take on the taxonomy of R, I’ll move on to my own programming efforts with this language. I’ve studied R a couple times in the past but never really retained the knowledge, partly because I only did a cursory overview of the language, and partly because I just didn’t find much practical use for it at the time. Now I’m teaching myself R again because I want to use it to analyze data from my WordPress analytics page to try to gain insights and make predictions. I will be focusing on linear regression here because at the moment that’s the only machine learning problem I know how to solve in R (I haven’t quite figured out things like KNN and K-Means yet, nor have I found a practical use for those algorithms).

First I want to revisit the data set that I graphed in Microsoft Works last time:

One of the reasons I wanted to relearn R (aside from just the fact that it’s a neat language) was because I found the data analysis capabilities of Microsoft Works wanting. It’s great for visualizing data in a way that allows you to pick out patterns with your eyes, but there’s no real way to automate pattern recognition, which is something I would need if I want to make accurate predictions. I wanted to do a linear regression on this data, because although it goes all over the place, it does approximate to a definite linear increase over time, and I wanted to figure out what this linear increase looked like and what its exact parameters were. To accomplish this, I copied the data (by hand) into a CSV file, like so:

```
"Index","Month","Returns"
1,"19-05",38
2,"19-06",47
3,"19-07",17
4,"19-08",27
5,"19-09",38
6,"19-10",71
7,"19-11",47
8,"19-12",56
9,"20-01",23
10,"20-02",32
11,"20-03",66
12,"20-04",43
13,"20-05",67
14,"20-06",99
15,"20-07",60
16,"20-08",67
```

I then wrote the following R script to plot a line graph of the data and then perform a linear regression on it, plotting the regression line over the raw data:

```
1 #!/usr/bin/env Rscript
2
3 dup <- read.csv( "dup.csv", header = TRUE, sep = ",", quote = "\"" )
4 lm.dup <- lm( Returns ~ Index, dup )
5 plot( dup$Month, dup$Returns, col = "#0000ff", type = "n", pch = 19, xlab = "Month", ylab = "Return Hits" )
6 lines( dup$Month, dup$Returns, col = "#0000ff", type = "o" )
7 abline( lm.dup, col="#ff0000" )
```

I then ran this code from within the R shell. The following is the graphical output of the script, exported to a PNG file:

A few things I want to mention here. First, the format of the CSV file is obviously rather odd. This has to do with a history of attempts to make the graph come out right. I found that when I used the month as the independent variable, R plotted the *x*-coordinates in the wrong order, causing the graph to come out all wonky. I changed that column so that the months would be in numerical order, and then eventually just added an index column and used that as the independent variable. The Month column is there for visual reference, and to label the *x*-axis.

Also, for some reason R put small horizontal line segments over each of the points of the line chart. I don’t know why it did this, and I couldn’t figure out how to fix it in the code. However, running the program from Rscript rather than from the REPL seems to have solved this problem, as you’ll see later.

The final issue, of course, is that this script is not very flexible. I managed to graph a single data set and its linear regression, but what if I want to apply it to other data sets? I would have to write a separate ad hoc script for each data set I want to analyze. To address this problem, I wrote the following script, which is a generic linear regression solver in R:

```
1 #!/usr/bin/env Rscript
2 # Command syntax:
3 # regression.r [--noheader] filename.csv
4 # CSV file must be in one of two formats:
5 # independent,dependent
6 # index,independent,dependent
7 # CSV header is optional
8
9 args <- commandArgs()
10 h <- TRUE # CSV file has header?
11 for( arg in args ){
12 if( arg == "--noheader" ){
13 h <- FALSE
14 }
15 else{
16 f <- arg
17 }
18 }
19 data <- read.csv( f, header = h )
20 fields <- colnames( data ) # Used for labeling axes
21 if( ncol( data ) == 2 ){ # No index field
22 colnames( data ) <- c( "x", "y" )
23 }
24 if( ncol( data ) == 3 ){ # Index field
25 colnames( data ) <- c( "x", "q", "y" )
26 }
27 lm.data <- lm( y ~ x, data )
28 plot( data$x, data$y, col = "#0000ff", type = "n", pch = 19,
29 # I got an error message if I didn't supply xlim and ylim
30 xlim = c( min( data$x ), max( data$x ) ),
31 ylim = c( min( data$y ), max( data$y ) ),
32 xlab = fields[ncol( data )-1],
33 ylab = fields[ncol( data )] )
34 lines( data$x, data$y, col = "#0000ff", type = "o" )
35 abline( lm.data, col="#ff0000" )
36 lm.data # Print slope and intercept
```

I had to compromise on this script of course, because now I get an error message if I don’t supply the `xlim`

and `ylim`

parameters (for some reason that now results in an infinite domain and range). So I had to scrap the labels and only use the *x* column as a visual reference for those reading and writing the CSV file. Maybe in the future I will modify this script so that it labels the horizontal axis with the *x* values rather than the index, but at the moment I don’t know how to do that (I’m still very much an R rookie).

Another thing you’ll notice is that I added a line to print the parameters of the linear model. Since R is a functional language (which means functions and data are more or less interchangeable), all you have to do is write the name of the variable on a line by itself and R will dump the contents of that variable to the console. In this case it prints the slope and intercept (technically the linear model is nothing more than an ordered pair of floating point values representing these two parameters).

If we run this script on the duplicate hit data from the CSV file, we get the following graphical output:

Along with the following console output:

```
$ Rscript regression.r dup.csv
Call:
lm(formula = y ~ x, data = data)
Coefficients:
(Intercept) x
24.551 3.135
$
```

Here `x`

stands for the slope and `(intercept)`

obviously stands for the intercept. (Also the shape of the graph is somewhat different than before because it’s been updated for new data.) Graphing this line on my calculator I can predict what my reader retention will be like at various points in the future. For example, the model shows that a year from now I will be getting about 112 duplicate hits per month. That doesn’t seem like much, but that’s because the number only increases by 3.135 hits per month. The next thing I would need to do is figure out ways to increase this slope, possibly by posting content more regularly and putting myself out there more – both things I was doing quite a lot of in the beginning (I got a shit-ton of duplicate hits in the first three months that this blog was up) but have neglected to do recently.

One question we can answer with linear regression is whether increased reader retention is a result of the loyalty of individual readers, or just an increase in the sheer number of readers. In other words, are we seeing an increase in quality, or quantity? To find out, we plot the Views Per Visitor for each month, which can easily be found in the WordPress analytics. This is much like the previous data set, except instead of subtracting, we’re dividing. Plotting the data for the last 15 months gives us the following regression:

We don’t need the console output to tell that the slope is negative. Average reader retention is decreasing, even though the overall number of duplicate hits is increasing. This means that increased reader retention is purely a result of the number of readers. It’s quantity, not quality. This is probably a result of me, as I said before, not doing what I need to do to engage my audience. I should probably post more regularly, and try to interact with the community a little more. That in combination with the existing increase in overall turnout will lead to a larger audience.

It’s also interesting to look at the raw number of hits per month, because this has been increasing in a fairly linear fashion as well. Here’s the graphical output of running this data through the R script:

And here’s the console output:

```
$ Rscript regression.r raw-hits.csv
Call:
lm(formula = y ~ x, data = data)
Coefficients:
(Intercept) x
26.55 17.66
$
```

Extrapolating this linear increase into the future, I can predict that a year from now I will be getting just over 500 hits per month. Of course that’s assuming I do nothing in the meantime and just let my site grow on its own.

The last linear regression I want to do is on the search hit data. Now I have one post that is currently increasing exponentially in terms of search hits, but if we factor that one out as an outlier and just look at the other ones – the ones that haven’t taken off yet – we find that the increase is roughly linear.

```
$ Rscript regression.r search-hits.csv
Call:
lm(formula = y ~ x, data = data)
Coefficients:
(Intercept) x
-10.826 6.455
$
```

Extrapolating into the future, we can predict that a year from now I will be getting just under 200 hits per month from search engines alone, not counting the post that has already taken off and also assuming none of my other posts take off.

Actually, it’s interesting to note that the curve is mostly above the line for the first five months, then it’s mostly below the line for the next ten months or so, and then it’s mostly above the line after that. This means that the increase is actually slightly more than linear. In other words, if we were to perform a proper regression, we would more likely get a linearithmic curve. I’ve tested a few functions on my calculator and found that the proper regression curve is roughly *y=2x*ln(x)*. This puts the projection for one year from now at about 220 search hits. Of course it’s entirely theoretically possible to write an algorithm that performs linearithmic regression using the same gradient descent algorithm that’s used for linear regression. However, high-level declarative languages like R don’t provide that level of flexibility or control. If you want to do real machine learning programming, you have to use a more low-level language like C.

So basically, the point to take from this is that you can get a lot of relevant SEO information about a WordPress site by just performing linear regressions on the analytics data. I definitely understand my site’s growth a lot better for it, and I’ve hit on some definite ways that I can improve.