Analyzing Twitter Analytics Data in R

I’ve spent a lot of time trying to figure out how to become popular on Twitter. It’s pretty much impossible if you’re not already a celebrity or public figure, but I figured I’d try to maximize my success anyway. I did some data analysis using CSV data that I exported from the Twitter Analytics page. Analytics is a lesser-known feature of Twitter that is (as far as I can tell) only available on the desktop version, but it can give you some great insights into how your tweets are performing.

The language I used for analyzing the Twitter Analytics data was R. I have written a number of scripts to try to find patterns in my more successful tweets. Basically I run the CSV file with the analytics data through an R script, then I have a collection of small Unix scripts that convert the data output into Markdown format, which I then upload to the Deep Web repository on my GitHub (Deep Web meaning only I have access to it). GitHub interprets and formats the Markdown code for me, so then I can just manually go through the list of tweets and count instances of a certain pattern to see which ones are most conducive to success.

There are a number of metrics Twitter Analytics uses to measure tweet success, but I’m primarily interested in profile clicks and link clicks, because my Twitter is basically a conduit to get more people to visit my WordPress site and my GitHub portfolio. I’m basically trying to siphon traffic from the larger community of Twitter to content of mine that’s hosted on smaller platforms like WordPress and GitHub. That’s the entire reason I’m on Twitter in the first place.

The data analyses I do typically start with a hypothesis about what makes a tweet successful. Then I use an R script and the analytics data in CSV form to test the hypothesis. For example, one of my hypotheses was that I can get more profile clicks by mentioning content that I have on other sites besides Twitter. I export tweet data from the past 28 days and run it through the following R script:

 1 #!/usr/bin/env Rscript
 2 # Finds the median profile click count of the tweets in a CSV
 3 # file and then prints lists for tweets that are above and
 4 # tweets that are below the median.
 6 file <- commandArgs( 1 )
 7 tweets <- read.csv( file, header = TRUE )
 9 # Convert engagement.rate column to number vector
10 clicks <- as.vector( tweets[,"profile.clicks"] )
11 for( n in 1:length( clicks ) ) {
12         clicks[n] <- as.numeric( clicks[n] )
13 }
15 median.engagement = median( clicks )
17 # Find subset of table where engagement.rate > median.engagement
18 print( "Tweets with above-average profile clicks:" );
19 for( row in 1:nrow( tweets ) ) {
20         engagement <- as.numeric( tweets[row, "profile.clicks"] )
21         if( engagement > median.engagement ) {
22                 print( tweets[row, "Tweet.text"] )
23         }
24 }
26 # Find subset of table where engagement.rate <= median.engagement
27 print( "Tweets with below-average profile clicks:" );
28 for( row in 1:nrow( tweets ) ) {
29         engagement <- as.numeric( tweets[row, "profile.clicks"] )
30         if( engagement <= median.engagement ) {
31                 print( tweets[row, "Tweet.text"] )
32         }
33 }

This R script calculates the median profile click count and then lists the text of all tweets with above-the-median counts followed by a list of all tweets with below-the-median counts. I then run this data through a pipeline of tiny Unix scripts that format it as a Markdown file. There are two versions of this pipeline: one for R scripts like this one that produce multiple discrete lists, and one for R scripts that produce a single list where the tweets are ranked according to some metric. I’m just going to list all the scripts in the pipeline without explanation, partly because I’m lazy and partly because there’s not really much to explain.

#!/usr/bin/env bash
# Use for scripts that output multiple data sets with labels at the top

Rscript "$1" "$2" | sed -f backslash.sed | ./condense | awk -f formatting.awk | ./double-newlines | fmt | ./unix2dos

#!/usr/bin/env bash
# Use for scripts that output numerical data for ranking as the first column

Rscript "$1" "$2" | sed -f backslash.sed | ./condense | sort -rg | sed "s/^[0-9][0-9]*\.*[0-9]* *: *//" | ./double-newlines | fmt | ./unix2dos


 1 #!/usr/bin/sed -f
 3 s/^\[1\] \"//
 4 s/\"$//
 5 s/_/\\_/g
 6 s/#/\\#/g
 7 s/&gt;/>/g
 8 s/&lt;/</g
 9 s/>/\\>/g
10 s/\*/\\\*/g
11 s/-/\\-/g
12 s/\\n/ /g


 1 #include <stdio.h>
 2 #include <stdbool.h>
 4 int mainint argc, char **argv ){
 5         bool in_quote;
 6         int c;
 7         in_quote = false;
 8         while( (c = fgetcstdin )) != EOF ){
 9                 if( c == '\n' && in_quote ){
10                         putchar'\\' );
11                         putchar'n' );
12                 }
13                 else{
14                         if( c == '"' )
15                                 in_quote = !in_quote;
16                         putchar( c );
17                 }
18         }
19         return 0;
20 }


#!/usr/bin/awk -f

/:$/ { printf"---------------------------------------------------------------------------\n" ) }
/:$/ { printf"### " ) }
     { print $0 }


 1 #include <stdio.h>
 3 int mainint argc, char **argv ){
 4         FILE *fp;
 5         int c;
 6         while( (c = fgetcstdin )) != EOF ){
 7                 putchar( c );
 8                 if( c == '\n' )
 9                         putchar( c );
10         }
11         return 0;
12 }


 1 #include <stdio.h>
 3 int mainint argc, char **argv ){
 4         int c;
 5         while( (c = fgetcstdin )) != EOF ){
 6                 if( c == '\n' ) putchar'\r' );
 7                 putchar( c );
 8         }
 9         return 0;
10 }

Once I run the output through the formatting pipeline and get the Markdown file, I upload it to my private repo on GitHub where I will typically proceed to open it on my phone and count the number of tweets both above and below the median that mention external content. If the number above the median is significantly higher than the number below the median, then this confirms the hypothesis.

If I want to have a single ranked list with exact numbers included, as opposed to two unranked lists, I can use the following R script:

#!/usr/bin/env Rscript

file <- commandArgs( 1 )
tweets <- read.csv( file, header = TRUE )

for( row in 1:nrow( tweets ) ) {
        print( paste( tweets[row, "profile.clicks"], tweets[row, "Tweet.text"] ) )

Then I perform a similar procedure using the second formatting pipeline (

Turns out mentioning external content is indeed a good way to get profile clicks on Twitter, because when people see the tweet, they’re more likely to get curious and click on your profile to see if they can find said content. And I have a pinned tweet with links to all my other social media profiles on it, so the more people view my Twitter profile, the more people land on my other profiles and view my external content. So there you go – some practical advice for anyone who uses Twitter and wants to build a following, from someone who’s actually studied Twitter mathematically.

If I want to measure tweets by a different metric, I find it in the header of the CSV file, then replace the spaces in that field with periods and use that string in my R script. For example, to rank tweets by overall engagement rate, I would use the following script:

#!/usr/bin/env Rscript

file <- commandArgs( 1 )
tweets <- read.csv( file, header = TRUE )

for( row in 1:nrow( tweets ) ) {
        print( paste( tweets[row, "engagement.rate"], tweets[row, "Tweet.text"] ) )

Sometimes it’s useful to look at the outliers for a particular metric and then count instances of a pattern within these outliers. This can give you better insight into your top-performing tweets and what makes them so special. Here’s an R script that shows only tweets with at least 10 retweets or no retweets at all:

 1 #!/usr/bin/env Rscript
 2 # Syntax: Rscript retweets.r by-tweet.csv
 4 file <- commandArgs( 1 )
 5 tweets <- read.csv( file, header = TRUE )
 7 # Indicates which tweets garner the most attention
 8 # for my profile through retweets
 9 print( "Tweets with more than 10 retweets:" )
10 for( row in 1:nrow( tweets ) ) {
11         retweets <- tweets[row, "retweets"]
12         if( retweets >= 10 ) {
13                 print( tweets[row, "Tweet.text"] )
14         }
15 }
17 print( "Tweets with no retweets:" )
18 for( row in 1:nrow( tweets ) ) {
19         retweets <- tweets[row, "retweets"]
20         if( retweets == 0 ) {
21                 print( tweets[row, "Tweet.text"] )
22         }
23 }

There’s some good insight to be gained from this one too. Tweets that take advantage of retweet bots on Twitter are retweeted a lot more and thus reach anyone who is following those bots. You take advantage of retweet bots by using a hashtag that one of those bots is programmed to retweet. Of course not all Twitter hashtags are created equal. Some have more retweet bots associated with them than others, and not all retweet bots associated with a hashtag will pick it up. For example, tweets using #Linux will get picked up by two or three retweet bots, while tweets using #100DaysOfCode will get picked up by anywhere from five to ten bots. Once I discovered the magic of retweet bots, my reach and my following on Twitter started to grow exponentially.

So there you have it, guys – a few short but useful scripts for finding patterns in successful tweets, and a couple freebie tips that I discovered for being more successful on Twitter. That’s all for today, so farewell and happy coding.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s