Data Analysis: Retro Edition with Microsoft Works 3.0

So recently I’ve started getting into some more data-oriented stuff – statistics, regression analysis, etc. It started with some spreadsheets and graphs that I was making in Microsoft Works and gradually transitioned into data analysis using R and eventually into actual AI programming. To chronicle this recent evolution properly, I think it best to start at the beginning, with the work I was doing this past summer tabulating, charting, and analyzing data from my WordPress site in the Microsoft Works spreadsheet program.

One of the first things I wanted to do was look at the top posts on my WordPress site (in terms of hit count) and see if I could find any patterns in terms of their growth over time. One thing I wanted to figure out was, if I had a particular post that seemed to blow up, could I trace it back to a particular moment in time when that topic may have been trending, or was it just a gradual and steady climb over several months with no definite take-off point? The image above shows the data I tabulated for my top six posts (Microsoft Works will only graph up to six items at a time). Below is the bar chart I made for this data.

As you can see, most of the growth in my top posts is attributable to gradual accumulation of hits over time, which caused them to be viewed increasingly favorably by Google. However, we can see a few other things if we look closely. For one, the Arch Linux battlestation post (ARCH HKR) dropped sharply in October of last year when I removed it from my homepage, but then returned to its previous level over the next couple months, eventually becoming the fourth most successful post on my site. The one on hacking images saw some steady growth in late 2019, then went AWOL for an entire month in early 2020 before shooting up to first place over the next few months. Observing patterns like this can give us insight into Google’s page-ranking behavior, which ultimately determines which pages get the most hits.

Here we can see the next six articles (with QBASIC 1 omitted because it doesn’t tell us anything that we don’t already know from QBASIC 3). The growth of these posts is somewhat more irregular, with some of them actually declining in recent months. This is to be expected, as we are looking at posts with relatively few hits (which describes the vast majority of stuff I post; think of the 80-20 rule). One thing we can see is that the article on OpenVPN got a single hit in the month it was posted, then got completely ignored for a couple months, then started increasing rather sharply towards the end of 2019. This shows that it is entirely possible for Google to suddenly start paying attention to a page that has had almost no traffic. This is a reflection of the fact that WordPress posts often take several weeks or even several months to “mature” to the point where Google actually considers them worth sending people to. I haven’t figured out why it does this, but it does it with pretty much everything I post. This only happens with Google though. Alternative search engines like Bing and DuckDuckGo have no problem sending people to an article that I just posted yesterday. Unfortunately that’s not where the majority of my traffic comes from, so…

Another 100% bar chart. This breaks down the number of hits over four weeks (excluding the one article I had that took off; I didn’t want an outlier skewing the data) so that I can see what topics are trending in Google search. If there’s a sudden uptick in the portion of hits going to a particular topic, then that means that topic is trending. I don’t look at the raw percentage, because that’s mostly dependent on search engine inertia; it doesn’t tell me anything about current trends. It’s the changes that reveal trends in search engines. Here we see an uptick in searches for DOS/retrocomputing-related posts and a downtick in searches for security-related posts. So the natural thing to do would be to post something DOS-related to try to ride that wave. I’m still debating whether this is actually an effective method of SEO, because it’s only led to marginal success so far. Part of this is due to the sheer randomness of search engine data, and part of it is due to the unpredictability of how long these trends will actually last.

One weekend in the second half of August, there was a sizable traffic surge on my site, followed by what seems to be a permanent increase in daily traffic. This is shown in the graph above. Here we can see that the surge disproportionately affected DOS-related articles more than anything else. So I decided to post something DOS-related in the wake of that surge. I managed to get the initial hits for that post needed to get it indexed by Google, so I would call it a marginal success. Didn’t skyrocket by any means though.

Here’s a pie chart representation of the breakdown of the traffic surge. This only shows data collected from Saturday and Sunday of that week. And yeah, the colors used for the categories in this chart are different from the ones used in the bar chart, so sue me.

The pie chart format is also useful for breaking down my purchases on any given grocery run.

During this week I found that I ran out of food rather quickly, whereas previous weeks I had no trouble making it last an entire week. Looking at the pie chart you can see why. The moral of the story: Meal food can be stretched out much longer than snack food. Buy snack food sparingly.

At one point I thought I would look at another dimension of my blog’s health: the diversity of hits. This means the total number of hits to posts that weren’t in the top 13. As you can see, June was a very good month by this metric. I haven’t really figured out what factors influence hit diversity though. They seem to be pretty random. I will say, however, that if we consider the fact that diverse hits don’t seem to be increasing over time (at least not by much), it’s safe to say that the vast majority of growth in my site has been concentrated on a small handful of posts. That’s an insight that I think makes the tabulation of this particular data set worthwhile.

Then of course there’s the question of skyscraper content vs. evergreen content. I’ve noticed (though I don’t have solid proof of this) that Google prefers evergreen content, and will judge whether something is skyscraper or evergreen based on the distribution of hits over time. I believe this is why my Arch Linux battlestation post was so successful. It was at the top of my homepage for like four months while I was on vacation last year, and thus accumulated a decent number of hits spread over those months. In the above figure I tabulated hits for my six most recent posts in an attempt to predict whether Google would label them as skyscraper or evergreen. This obviously isn’t a foolproof prediction method, but it’s the best I have at the moment.

I also wanted to look at the number of duplicate hits per month. This figure is obtained by simply subtracting the number of visitors from the number of views. This is an important figure to look at because it’s an indication of the level of reader retention, that is, the extent to which readers are revisiting the site and/or reading multiple articles in one visit, and this in turn is a good way of measuring the loyalty of my readership. Keep in mind I’m talking about reader loyalty, not follower loyalty, here. My actual followers tend to ignore me most of the time, probably because I write articles on a variety of different topics, and only maybe 10% of what I write appeals to any one follower’s specific interests. Most of my readers come from Google, not WordPress, and it’s these readers that I want to focus on snagging.

As you can see I switched to a line graph rather than a bar graph for this one. At this point I also moved a large portion of my data analysis operation over to R, mostly because Microsoft Works doesn’t let you do linear regression (which this particular graph was just begging for). So this is where I will leave off for now. Later I will talk about data analyses I did in R, as well as an R script that I wrote to perform linear regression analysis on analytics data read from a CSV file. Stay tuned.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s