Data Science in C: Combining CSV with SQL

In the last episode of Data Science in C I talked about the parser portion of my CSV library, which I implemented using automata. Now that we’ve parsed the CSV code and converted it into an abstract table structure, it’s time to implement some operations on the table data so that we can actually use CSV tables in our algorithms. And because a CSV file is essentially a simplified database, I thought why not use SQL and the relational database model as the framework for these table operations?

The goal of the parser was to get the data into a form where it can be more easily processed. The goal of the SQL component is to provide the actual primitives for reading and modifying data in the database. To get a better idea of what exactly we need to do, let’s look at a sample machine learning application: the K-Nearest-Neighbor or KNN algorithm. I designed a flowchart for a program based on this algorithm some weeks back, but haven’t been able to implement it because I didn’t have the CSV underpinnings yet. Now that the CSV library is at least partially implemented, I’m going to revisit this program. Here’s the flowchart I came up with:

Flowchart for program implementing the KNN machine learning algorithm

The KNN algorithm is fairly simple in concept. It basically involves classifying a data point in one of several categories by looking at the K other data points nearest to that point and seeing what categories those points are in. The category with the most points among the K nearest is the category that is selected for the point in question. My implementation of the KNN algorithm determines the K nearest neighbors by computing the Pythagorean distance for each point, sorting the points by this metric, and then tallying the first K elements of the sorted array.

Now let’s say the data points for this algorithm are provided by a CSV file, which we can convert into a table structure using the algorithms shown in the previous article. What operations do we need to be able to carry out? Well, first of all, if we look at the end of the program, we can see that it appends the new data points to the end of the table, so we need a method for adding and removing records. We also need a method for reading the coordinates of each data point from the table, as well as updating them when new information becomes available. We also need to be able to generate subsets of the CSV records for the training set and the testing set.

The operations listed above correspond to the SQL commands INSERT, DELETE, UPDATE, and SELECT. Because the table is represented as a linked list, we also need a way of navigating it. So with that end in mind, I want to create a function for moving the cur pointer to the next record, and a function for rewinding the cur pointer to the beginning of the table. It would also be prudent to have a way to create and destroy table structures, so I will be implementing functions for the CREATE TABLE and DROP TABLE commands as well.

First of all, let’s create some typedefs for our data types as well as a namespace for all the functions in the CSV library. This can be done in the header file like so:


// Metadata field for the table header
typedef struct {
        char *name;
        enum types type;
} csv_field;

// Single record in linked list table structure
struct _csv_record {
        void **record;
        struct _csv_record *next;
};

typedef struct _csv_record csv_record;

// Handle for abstract table structure
typedef struct {
        int rlen;           // # of fields in each record
        csv_field **header; // Table metadata
        csv_record *start;  // Pointer to first record
        csv_record *cur;    // Pointer to current recordcsv_table;


__BEGIN_DECLS
bool csv_validate_fileFILE *, bool );
csv_table *csv_read_tableFILE *, bool );
void csv_write_tableFILE *, csv_table *, bool );
csv_table *csv_create_tableintcsv_field ** );
void csv_drop_tablecsv_table * );
csv_record *csv_next_recordcsv_table * );
void csv_rewindcsv_table * );
void csv_insert_recordcsv_table *, void ** );
void csv_insert_new_recordcsv_table * );
void csv_delete_current_recordcsv_table * );
dfloat64_t *csv_get_number_field_by_namecsv_table *, char * );
dfloat64_t *csv_get_number_field_by_indexcsv_table *, int );
char *csv_get_string_field_by_namecsv_table *, char * );
char *csv_get_string_field_by_indexcsv_table *, int );
void csv_set_number_field_by_namecsv_table *, char *, dfloat64_t * );
void csv_set_number_field_by_indexcsv_table *, intdfloat64_t * );
void csv_set_string_field_by_namecsv_table *, char *, char * );
void csv_set_string_field_by_indexcsv_table *, intchar * );
__END_DECLS

The simplest functions to implement are the ones for moving the cur pointer, which I will be calling the Current Record Pointer or CRP from now on:


// Select next record
// Return NULL if end of table is reached
csv_record *csv_next_recordcsv_table *table ){
        if( table->cur->next ){
                table->cur = table->cur->next;
                return table->cur;
        }
        return NULL;
}

// Rewind to the beginning of the table
void csv_rewindcsv_table *table ){
        table->cur = table->start;
}

Notice that the csv_next_record() function returns a pointer to the next record, or NULL if there is no record after the current one. This is so that we can tell when the end of the table has been reached when traversing the records in the table. This makes it easy to loop through an entire table with code like this:


whilecsv_next_record( table ) ){
        // Do stuff with the current record
}

Theoretically, you could also use the syntax table->cur = csv_next_record( table ); to move the CRP if you think it’s more readable, though this is entirely a stylistic preference.

The function for implementing SQL’s CREATE TABLE command takes a record length and a vector of field structs as parameters and initializes a blank table with the appropriate heading:


// Equivalent to CREATE TABLE in SQL
csv_table *csv_create_tableint rlen, csv_field **field_vector ){
        csv_table *table;
        int f;
        table = (csv_table *) mallocsizeofcsv_table ) );
        table->rlen = rlen;
        table->header = (csv_field **) calloc( rlen, sizeofcsv_field * ) );
        for( f = 0; f < rlen; f++ ){
                table->header[f] = (csv_field *) mallocsizeofcsv_field ) );
                memcpy( table->header[f], field_vector[f], sizeofcsv_field ) );
        }
        table->start = (csv_record *) mallocsizeofcsv_record ) );
        table->cur = table->start;
        return table;
}

The function for implementing DROP TABLE is somewhat longer:


// Equivelent to DROP TABLE in SQL
void csv_drop_tablecsv_table *table ){
        csv_record *tmp;
        int f;

        free( table->header );

        // Free table records:
        csv_rewind( table );
        whilecsv_next_record( table ) ){
                for( f = 0; f < table->rlen; f++ ){
                        free( table->cur->record[f] );
                }
                tmp = table->cur;
                table->cur = table->cur->next;
                free( tmp );
        }

        // Next part prevents dangling pointer problems.
        table->start = NULL;
        table->cur = NULL;
        table->header = NULL;
        free( table );
        table = NULL;
}

This function works by traversing every data structure in the table, starting with the fields of each record, then the record structs, and then finally the table itself, and freeing every last one of them. This operation is vital if we are working with many different tables (for example subset tables created through SELECT commands) and we don’t want our data segment getting filled up with malloc‘d data structures that we are no longer using.

Next we have the functions for implementing the INSERT and DELETE commands. There are three in total – two for inserting a new record at the end of the table, and one for deleting the current record. There are two versions of the INSERT function because I first implemented csv_insert_record() with the idea that someone would malloc a void pointer array for that record and then use it to initalize a new record in the table. I then realized that this would be rather tedious in practice, so I created another function called csv_insert_new_record() that just inserts a blank record, thinking that a programmer would use this function and then use the UPDATE functions shown later to populate the values of that record. This is a much more streamlined interface that doesn’t involve such heavy use of malloc, calloc, strncpy, and memcpy (not to mention freeing the initialization array when done using it), but the original interface is still available if someone wants to use that method (who knows, there may be a reason for wanting to initialize a new record all at once that I just haven’t thought of).

The two INSERT functions are as follows:


// Equivalent to INSERT INTO in SQL
void csv_insert_recordcsv_table *table, void **record ){
        csv_record *save;
        int f;
        int len;
        save = table->cur;
        whilecsv_next_record( table ) );
        table->cur->next = (csv_record *) calloc1sizeofcsv_record ) );
        csv_next_record( table );
        table->cur->record = (void **) calloc( table->rlen, sizeofvoid * ) );
        for( f = 0; f < table->rlen; f++ ){
                if( table->header[f]->type == csv_number ){
                        table->cur->record[f] = mallocsizeofdfloat64_t ) );
                        memcpy( table->cur->record[f], record[f], sizeofdfloat64_t ) );
                }
                else if( table->header[f]->type == csv_string ){
                        len = strlen( (char *) record[f] );
                        table->cur->record[f] = malloc( len + 1 );
                        strncpy( table->cur->record[f], record[f], len + 1 );
                }
        }
        table->cur->next = NULL;
        table->cur = save;
}

// INSERT a blank record and move to that record
void csv_insert_new_recordcsv_table *table ){
        int f;
        whilecsv_next_record( table ) );
        table->cur->next = (csv_record *) calloc1sizeofcsv_record ) );
        csv_next_record( table );
        table->cur->record = (void **) calloc( table->rlen, sizeofvoid * ) );
        for( f = 0; f < table->rlen; f++ ){
                if( table->header[f]->type == csv_number )
                        table->cur->record[f] = mallocsizeofdfloat64_t ) );
                else if( table->header[f]->type == csv_string )
                        table->cur->record[f] = malloc1 );
        }
        table->cur->next = NULL;
}

As for the DELETE function, it is simply an implementation of the standard method for removing a node from a linked list. I figured a function that simply deletes the current record and then moves to the one before it would be easier to implement as well as use than one that follows the SQL DELETE command to the letter.


// Equivalent to DELETE FROM in SQL except it deletes the
// current record rather than those matching a condition
void csv_delete_current_recordcsv_table *table ){
        csv_record *tmp1;
        csv_record *tmp2;
        int f;
        tmp1 = table->start;
        tmp2 = NULL;
        while( tmp1->next != table->cur )
                tmp1 = tmp1->next;
        if( table->cur->next )
                tmp2 = table->cur->next;
        for( f = 0; f < table->rlen; f++ )
                free( table->cur->record[f] );
        free( table->cur );
        if( tmp2 )
                tmp1->next = tmp2;
        else
                tmp1->next = NULL;
        table->cur = tmp1;
}

Finally we have the setter and getter functions. These can be thought of as being equivalent to SQL’s UPDATE and SELECT commands provided we’re only selecting a single field from a single record. Selecting an entire subset of a table like the actual SELECT command does requires the implementation of an entire new set data type, and is thus beyond the scope of this article (though I am currently working on implementing this functionality). With that said, the setter and getter functions are as follows:


// Used to retrieve numeric values from the table
dfloat64_t *csv_get_number_field_by_namecsv_table *table, char *name ){
        dfloat64_t *df;
        int f;
        for( f = 0; f < table->rlen; f++ ){
                if( !strcmp( table->header[f]->name, name ) )
                        break;
        }
        if( f == table->rlen )
        // Error: Name not found
                return NULL;
        if( table->header[f]->type != csv_number )
        // Error: Type mismatch
                return NULL;
        df = (dfloat64_t *) mallocsizeofdfloat64_t ) );
        memcpy( df, table->cur->record[f], sizeofdfloat64_t ) );
        return df;
}

// Used to retrieve numeric values from the table
dfloat64_t *csv_get_number_field_by_indexcsv_table *table, int index ){
        dfloat64_t *df;
        if( index < 0 || index >= table->rlen )
        // Error: Out-of-bounds
                return NULL;
        if( table->header[index]->type != csv_number )
        // Error: Type mismatch
                return NULL;
        df = (dfloat64_t *) mallocsizeofdfloat64_t ) );
        memcpy( df, table->cur->record[index], sizeofdfloat64_t ) );
        return df;
}

// Used to retrieve string values from the table 
char *csv_get_string_field_by_namecsv_table *table, char *name ){
        char *str;
        int f;
        int len;
        for( f = 0; f < table->rlen; f++ ){
                if( !strcmp( table->header[f]->name, name ) )
                        break;
        }
        if( f == table->rlen )
        // Error: Name not found
                return NULL;
        if( table->header[f]->type != csv_string )
        // Error: Type mismatch
                return NULL;
        len = strlen( table->cur->record[f] );
        str = (char *) malloc( len + 1 );
        strncpy( str, table->cur->record[f], len + 1 );
        return str;
}

// Used to retrieve string values from the table 
char *csv_get_string_field_by_indexcsv_table *table, int index ){
        char *str;
        int len;
        if( index < 0 || index >= table->rlen )
        // Error: Out-of-bounds
                return NULL;
        if( table->header[index]->type != csv_string )
        // Error: Type mismatch
                return NULL;
        len = strlen( table->cur->record[index] );
        str = (char *) malloc( len + 1 );
        strncpy( str, table->cur->record[index], len + 1 );
        return str;
}

// Used to set the value of a number field
void csv_set_number_field_by_namecsv_table *table, char *name, dfloat64_t *val ){
        int f;
        for( f = 0; f < table->rlen; f++ ){
                if( !strcmp( table->header[f]->name, name ) )
                        break;
        }
        if( f == table->rlen )
        // Error: Name not found
                return;
        if( table->header[f]->type != csv_number )
        // Error: Type mismatch
                return;
        memcpy( table->cur->record[f], val, sizeofdfloat64_t ) );
}

// Used to set the value of a number field
void csv_set_number_field_by_indexcsv_table *table, int index, dfloat64_t *val ){
        if( index < 0 || index >= table->rlen )
        // Error: Out-of-bounds
                return;
        if( table->header[index]->type != csv_number )
        // Error: Type mismatch
                return;
        memcpy( table->cur->record[index], val, sizeofdfloat64_t ) );
}

// Used to set the value of a string field
void csv_set_string_field_by_namecsv_table *table, char *name, char *val ){
        int f;
        int len;
        for( f = 0; f < table->rlen; f++ ){
                if( !strcmp( table->header[f]->name, name ) )
                        break;
        }
        if( f == table->rlen )
        // Error: Name not found
                return;
        if( table->header[f]->type != csv_string )
        // Error: Type mismatch
                return;
        len = strlen( val );
        table->cur->record[f] = realloc( table->cur->record[f], len + 1 );
        strncpy( table->cur->record[f], val, len + 1 );
}

// Used to set the value of a string field
void csv_set_string_field_by_indexcsv_table *table, int index, char *val ){
        int len;
        if( index < 0 || index >= table->rlen )
        // Error: Out-of-bounds
                return;
        if( table->header[index]->type != csv_string )
        // Error: Type mismatch
                return;
        len = strlen( val );
        table->cur->record[index] = realloc( table->cur->record[index], len + 1 );
        strncpy( table->cur->record[index], val, len + 1 );
}

What you have seen in this article constitutes csv_table.c, which is one of the C modules in my CSV library (called libcsv for obvious reasons). You can see this module, as well as all the other files comprising libcsv, in my libcsv repository on GitHub.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s