Forensics Tool to Detect Encrypted Files

Well shit, looks like I haven’t posted in about two weeks. I really didn’t mean to do that. This absence was entirely unplanned and served no real purpose. But here I am again, and I’ll try to update on a semi-daily basis for the forseeable future. Oh, and it looks like I’m getting regular hits from Google, Bing, and Yahoo! Search now. My SEO goals were realized while I was gone, without me having to even do anything. Interesting how things work out.

Anyway, I just thought I’d share a simple digital forensics program that I wrote just now. This program analyzes files to detect encryption. Encrypted files are of special interest in any kind of legal investigation, because although you can’t always figure out what was encrypted, you can still see that something was encrypted, and various context clues sprinkled throughout the filesystem, in log files, in the registry, in the file metadata, etc. can lead to an idea of what the person encrypting the file was trying to hide. It is thus desirable if you’re a digital forensics examiner to be able to quickly find files that have been encrypted.

The system I intend to eventually build will have two main parts: The first part will be a file signature analysis. This will basically just look at the first few bytes of the file in question (the magic number) and compare them to a database full of known file signatures to see if the signature matches any known file types. If the file signature is unencrypted, the file itself is probably also unencrypted, so we can skip these files. I have not yet implemented this part.

The second part, which I have implemented, is an algorithm that scans the file, counts the occurrences of different byte values and stores them in an array, then uses the values in the array to calculate a mean, and from that it calculates a mean variance. For those of you who don’t know statistics, variance is a measure of how much a data point differs from the mean value. Taking the mean of the variances of all the data points will yield the mean variance of the function, which gives an indication of how much the function varies overall. If the function is flat, it will have a variance of precisely zero. If the function is all over the place, it will have a high variance. In this case, we are looking for byte values that are much more or less frequent than the average. These indicate low entropy. An encrypted file, ideally, will have a very high entropy, which translates to a low mean variance. Thus if the mean variance is above a certain threshold, we can rule the file out.

I have two code files: The first is a C program that examines a single file and determines whether it is encrypted. The second is a frontend shell script that runs this program over all the files in a directory and lists all the files that are encrypted. Remember, these are both rough drafts and the system is very much in the alpha stage, so it’s not at all fit for real-world use at this point.

crypto-detect.c:


 1 /***********************************************
 2  * Crypto-Detect v. 0.1 (Alpha)                *
 3  * Description: A digital forensics tool that  *
 4  * calculates a file's rough entropy to detect *
 5  * encryption.                                 *
 6  * Author: Michael Warren                      *
 7  * Date: Sun, May 12, 2019                     *
 8  * License: Michael Warren FSL                 *
 9  ***********************************************/
10 
11 #include <stdio.h>
12 #include <stdlib.h>
13 #include <string.h>
14 #include <errno.h>
15 
16 #define SIGNIFICANT 0.05
17 
18 int f[256];
19 
20 int mainint argc, char **argv ){
21         forint i = 0; i < 256; i++ ){
22                 f[i] = 0;
23         }
24         FILE *fp;
25         if( !(fp = fopen( argv[1], "r" )) ){
26                 fprintfstderr"%s%s%s:\n", argv[0], argv[1], strerror( errno ) );
27                 return errno;
28         }
29         fseek( fp, 0SEEK_END );
30         long flen = ftell( fp );
31         rewind( fp );
32         int c;
33         unsigned char b;
34         while( (c = fgetc( fp )) != EOF ){
35                 ungetc( c, fp );
36                 fread( &b, 11, fp );
37                 f[b]++;
38         }
39         fclose( fp );
40         // Part 1: Compute mean frequency:
41         float mean = 0;
42         forint i = 0; i < 256; i++ ){
43                 mean += f[i];
44         }
45         mean /= 256;
46         // Part 2: Compute mean variance:
47         float variance = 0;
48         int nzcount = 0;
49         forint i = 0; i < 256; i++ ){
50                 if( f[i] ){
51                         variance += abs( f[i] - mean );
52                         nzcount++;
53                 }
54         }
55         variance /= nzcount;
56         variance /= flen;
57         variance *= 100;
58         if( variance < SIGNIFICANT )
59                 printf"Encrypted file detected.\nMean variance: %f%%\n", variance );
60         else
61                 printf"File not encrypted.\nMean variance: %f%%\n", variance );
62         /* The return value allows a script
63          * to read the variance value from
64          * this program during a scan of the
65          * entire filesystem.
66          */
67          int rval = (int) (variance * 100);
68          if( rval > 255 ) rval = 255;
69          return rval;
70 }

crypto-scan.sh:


 1 #!/usr/bin/env bash
 2 # Uses the Crypto-Detect program to
 3 # scan a directory for encrypted files
 4 
 5 declare -i variance=0
 6 C_PROGRAM="./crypto-detect"
 7 
 8 for file in *
 9 do
10         command $C_PROGRAM "$file" 1>/dev/null
11         let variance=$?
12         if [[ $variance -lt 5 ]]
13         then
14                 echo "$file"
15         fi
16 done
17 
18 unset variance file

Let’s do a few test runs. When we run the Crypto-Detect program on a regular ASCII text file we get the following output:


File not encrypted.
Mean variance: 1.072634%

One of the problems I encountered when trying to get an accurate measure of file entropy was that for ASCII files, most of the byte values have counts of zero, and given that the mean value is going to be only 1/256 the size of the file, this resulted in large numbers of zero values that were fairly close to the mean since the mean was so low, which drove the mean variance way down and made it disproportionately low for ASCII files compared to binary files. To address this anomaly, I added the nzcount, which stands for “not zero count”, so that only nonzero values are counted. This allows the program to focus on byte values with a high degree of variance without skewing the results and thinking that ASCII text files are encrypted.

Now let’s run this program on a binary executable file:


File not encrypted.
Mean variance: 0.483920%

The mean variance is lower this time, which makes sense because this is a binary file and the byte values are all over the place, rather than being focused on the printing ASCII characters. This means that they will exhibit a higher degree of randomness, and thus higher entropy. Remember that a high variance for the actual ASCII values translates to a low variance for the byte counts.

Now let’s run this program on a file encrypted with AES:


Encrypted file detected.
Mean variance: 0.011587%

Notice that the mean variance is much lower this time, being less than 1/40 of what was derived for the unencrypted binary file. In fact it’s low enough that it passes the threshold for detecting an encrypted file. Let’s run it on a different encrypted file:


Encrypted file detected.
Mean variance: 0.029049%

Again, the threshold is passed. I noticed that for every encrypted file I ran this program on, the value I got was under 0.03%, while for most unencrypted files the value was at least 0.4%, which leaves a very comfortable window between the two ranges. I chose 0.05% as the cutoff point. Of course there were a few exceptions, which I’ll get to later, but in most cases this was the behavior I observed.

Now let’s run the shell script on my Programming/security directory, which has a lot of test files and other detritus in it.


a.encrypted
asinstall.exe
encrypted.bin
encrypted1.bin
encrypted2.bin
encrypted3.bin
encrypted4.bin
ika-musume-arch-linux-169.png
test.png
Wallpapers

As we can see here, the program successfully detected all of the encrypted files that were in that directory. However, it also picked up a few false positives, mostly image files. This is where the file signature analysis (which I haven’t implemented yet) comes in. Simply checking the magic number on all those files would eliminate all of those false positives, with the only exception being the directory (and testing a file to see if it’s a directory is a trivial task in Unix shell programming).

The other problem with this system is that it’s not very efficient. In order to efficiently scan for encrypted files, it would have to do a recursive scan of a directory, rather than just enumerating the files immediately included in the current directory. Like I said, the code I’ve written so far is only a quick-and-dirty solution. I can implement recursive scanning functionality in later versions. I’ve also noticed that the scan is often rather slow, especially when scanning large files. I don’t know if there’s a way to make it faster, or if it’s just inevitable. I suppose one thing I could do is just scan a small section of the file, and if the test comes back negative for that section, move on to the next file. These are all changes to incorporate in future versions of this program. What I have at the moment is very primitive, but it could evolve into something much more sophisticated.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s