Doing Statistics in awk

One exercise that offers a lot of interesting challenges in coding is to use a programming language for an application that it’s not intended to be used for. Writing a Fortran compiler in TECO and solving the Towers of Hanoi problem in the troff text formatter are two examples from the legendary Hacker Purity Test. There have also been Turing machine simulations written in languages like sed. The more limited and domain-specific the language is, the better.

While writing compilers and Turing machines in small scripting languages is somewhat beyond my abilities, I have explored this theme in other ways, like trying to use the awk utility for tasks typically done with a statistical language like R. This involves getting awk to read statistical databases in the CSV format, as well as implementing some of the statistics functions found in the R language.

There are a couple steps for getting awk to read databases that are written in CSV. CSV fields are separated by commas, and they can be either quoted or unquoted. So we start by telling awk to use a comma for the field separator character:


BEGIN { FS = "," }

Next we create an awk function for stripping the quote characters from a quoted string:


function strip_quote( str ){
        return substr( str, 2, length( str ) - 2 )
}

Since awk scripts are designed for working with a specific database file, we will not be writing a generalized CSV-reading script here. Instead we will be writing for a specific field layout. One thing we can do is divide records into categories and sum the records in the different categories. This is accomplished by the following awk script:


 1 #!/usr/bin/awk -f
 2 # Total values in categories from a CSV database
 3 # Database structure: Item,Category,Value
 4 
 5 function strip_quote( str ){
 6         return substr( str, 2, length( str ) - 2 )
 7 }
 8 
 9 BEGIN { FS = "," }
10 { item = strip_quote( $1 )
11   cat = strip_quote( $2 )
12   value = $3
13   values[cat,item= value }
14 END { for( i in values ){
15         split( i, subscr, SUBSEP ) # separate out first index
16         cat = subscr[1]
17         if!total[cat] ) total[cat= 0 # initialize if doesn't exist
18         total[cat+= values[i] }
19       for( cat in total ) printf"%s%.2f\n", cat, total[cat] ) }

This script takes a database of items with numerical values and sums the values for each category. We can run it on the following sample CSV file, which is a database of assets using numbers I pulled out of the air:


"Bitcoin","Crypto",1000.00
"Gold","Future",700.00
"Ethereum","Crypto",500.00
"Silver","Future",300.00
"Stocks","Equity",1500.00
"Cash","Cash",500.0

This will produce the following output:


Cash: 500.00
Crypto: 1500.00
Future: 1000.00
Equity: 1500.00

Of course this isn’t a perfect script because it will still trip over commas and escaped quotes within a quoted string. There are probably ways around this, but they would involve parsing out all the individual characters within each field and pasting together any fields that are found to be part of the same string. In most cases, this makes things unnecessarily complicated, so there’s no point in doing this unless we specifically need to process strings containing those characters. One possible shortcut would be to simply use different delimiter and quote characters, which in an option in a lot of CSV readers.

We can also calculate basic statistical measurements like mean, median, and mode. Here’s how we would calculate the mean of a field over several records:


 1 #!/usr/bin/awk -f
 2 # Find the means of categories in a CSV database
 3 # Database structure: Item,Category,Value
 4 
 5 function strip_quote( str ){
 6         return substr( str, 2, length( str ) - 2 )
 7 }
 8 
 9 BEGIN { FS = "," }
10 { item = strip_quote( $1 )
11   cat = strip_quote( $2 )
12   value = $3
13   if!count[cat] ) count[cat= 0 # Initialize if doesn't exist
14   count[cat]++
15   values[cat,item= value }
16 END { for( i in values ){
17         split( i, subscr, SUBSEP ) # separate out first index
18         cat = subscr[1]
19         if!mean[cat] ) mean[cat= 0
20         mean[cat+= values[i] }
21       for( cat in count ) mean[cat/= count[cat]
22       for( cat in mean ) printf"%s%.2f\n", cat, mean[cat] ) }

To find the median of any list, you would first sort the list and then pick the middle element. There is no sort() function in awk, so I’ve implemented a simple insertion sort myself:


function sort( a ){
        # Get array size:
        size = 0
        for( i in a ) size++
        # Insertion sort:
        for( i = 2; i <= size; i++ ){
                aux = a[i]
                l = i - 1
                while( l >= 1 && a[l] > aux ){
                        a[l+1= a[l]
                        --l
                }
                a[l+1= aux
        }
}

Notice that since there’s also no function for getting the length of an array, I had to implement this as well. We can modify this sort function for a multidimensional array to implement an awk script for computing the median of each category:


 1 #!/usr/bin/awk -f
 2 # Find the medians of categories in a CSV database
 3 # Database structure: Item,Category,Value
 4 
 5 function strip_quote( str ){
 6         return substr( str, 2, length( str ) - 2 )
 7 }
 8 
 9 # Sort function only works for multidimensional
10 # arrays with a numerical second index
11 function sort( a, cat ){
12         # Get array size:
13         size = 0
14         for( i in a ){
15                 split( i, subscr, SUBSEP ) # Separate out first index
16                 if( subscr[1== cat ) size++
17         }
18         # Insertion sort:
19         for( i = 2; i <= size; i++ ){
20                 aux = a[cat,i]
21                 l = i - 1
22                 while( l >= 1 && a[cat,l] > aux ){
23                         a[cat,l+1= a[cat,l]
24                         --l
25                 }
26                 a[cat,l+1= aux
27         }
28 }
29 
30 BEGIN { FS = "," }
31 { cat = strip_quote( $2 )
32   value = $3
33   if!count[cat] ) count[cat= 0 # Initialize if doesn't exist
34   count[cat]++
35   values[cat,count[cat]] = value }
36 END { for( cat in count ){
37         sort( values, cat )
38         i = int( count[cat/ 2 ) + 1
39         median[cat= values[cat,i] }
40       for( cat in median ) printf"%s%.2f\n", cat, median[cat] ) }

The awk script for computing the mode of each category looks like this:


 1 #!/usr/bin/awk -f
 2 # Find the modes of categories in a CSV database
 3 # Database structure: Item,Category,Value
 4 
 5 function strip_quote( str ){
 6         return substr( str, 2, length( str ) - 2 )
 7 }
 8 
 9 BEGIN { FS = "," }
10 { item = strip_quote( $1 )
11   cat = strip_quote( $2 )
12   value = $3
13   cats[cat= 1
14   if!count[cat,value] ) count[cat,value= 0 # Initialize if doesn't exist
15   count[cat,value]++ }
16 END { for( cat in cats ){
17         maxcount = 0
18         for( i in count ){
19           split( i, subscr, SUBSEP ) # Separate out indexes
20           if( cat == subscr[1&& count[i] > maxcount ){
21             maxcount = count[i]
22             mode[cat= subscr[2] } } }
23       for( cat in mode ) printf"%s%.2f\n", cat, mode[cat] ) }

I’m now speculating that it may even be possible to do more advanced statistics work such as symbolic machine learning in awk, though you would have to implement all the algorithms from the ground up and that would take a lot of code. It would also be slow as fuck, because you’d be implementing it in a high-level interpreted language. But at least for basic statistical operations, awk appears to be a suitable tool. Anyway, I greatly enjoyed doing this, and I’m glad I had the opportunity to share it with everyone. Farewell and happy coding!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s