Data Science in C: Programming a Turing Machine to Parse CSV Code

Okay, so maybe I ripped my featured image from the Hack-A-Day site, but that image of a personification of a Turing machine as an intelligent robot was too relevant to the topic of this post to pass up, so sue me. 😛 I want to talk about the first step to building data science or machine learning programs in C, which is to have a way to read the serialized data from a CSV file and convert it into a form that is usable by the algorithm. All AI programming is based on data – that’s literally all AI is: finding patterns in data. So the conversion of a textual representation of the data into actual data structures that a machine learning algorithm can process is a crucial first step for any data science application.

In the C program I’m about to describe, we are going to be employing a Turing machine – a classic model of computation invented by Alan Turing and described in the famous Church-Turing thesis, which was pretty much the foundation of modern computer science. The Turing machine is used to validate the CSV code before we actually interpret it. So there are two steps here – first validate the CSV to make sure it’s well-formed, and then convert the CSV into a data structure that we can use in our program. Validating the input file is a very important step because it allows us to make certain assumptions about the code, which makes the actual interpretation step a hell of a lot easier.

Before we begin, I should probably give some background – a quick primer on the theory of computation for those of you who aren’t familiar with the concepts I will be using. Since my blog is geared towards computer people in general and not necessarily those with an academic background, I feel this would be prudent. (If you’re already familiar with the theory of computation, feel free to skip this and the next three paragraphs.) There are two concepts that are pretty central to the interpretation of any sort of code – formal languages and automata. Formal languages are theoretical models of the languages understood by computers. Through formal grammars, the formal language model provides very rigid grammatical rules that specify a language precisely enough that a dumb computer can understand it. The way we determine if a program or file written in a language is valid is by running it through an automaton. Automata (also called state machines) are basically mathematical computers. There are two broad classes of automata – acceptors, which simply validate code or strings in a given language, and transducers, which actually produce their own output strings. At any given time, an automaton can be in one of several predefined states. An automaton consumes characters from the input string one at a time and may transition into one of several other states depending on the input character.

If all this seems confusing, don’t be discouraged. It will become clearer as I lead into some concrete examples from my own CSV parser. I will also provide some additional reading materials at the end, for those who want to educate themselves on some of the theory behind what I’m doing in order to understand it better.

One more thing I need to explain is that there is a fairly rigid mathematical hierarchy of languages and automata (this is commonly known as the Chomsky hierarchy). This hierarchy ranks languages and automata in terms of their power and complexity. The simplest languages in our discussion are regular languages. Examples of regular languages are the set of all valid floating point literals and the set of all valid (in terms of structure) email addresses. Higher-level programming languages like Javascript and PHP typically parse regular languages using regular expressions, which most programmers are probably familiar with. Internally, a regular expression uses a rather simple form of automaton called a deterministic finite acceptor, or DFA (also called a finite state machine or FSM). DFAs are the most basic, bare-bones automata, consisting of a set of states and state transitions and not much else.

The next level of complexity after regular languages is context-free languages, which are recognized by pushdown automata. Pushdown automata are like DFAs, except in addition to the basic states and state transitions, they also have their own stack. Context-free languages are specified using context-free grammars, usually expressed in the Backus-Naur Form (BNF), which you will see an example of later. Most programming languages are context-free languages if we ignore things like type-checking and other semantic rules. If we want to enforce these semantic rules, we have to move on to the next class of languages – context-sensitive languages, so called because we now have to pay attention to the context of different language elements. To validate a context-sensitive language, we would need a Turing machine, which I will describe in more detail later.

Because CSV records consist of discrete numbers and text strings separated by commas, the automaton I’m using can easily be broken down into smaller parts that are simpler to diagram and work with. It turns out both the number fields and string fields in CSV are amenable to their own finite automata, which can be seen as smaller computational units within the larger automaton that parses the entire CSV file. I have designed two DFAs, one for parsing string literals and the other for parsing numeric literals. Here is the state diagram for the string DFA:

Finite automaton for parsing a CSV string

And here is the state diagram for the number DFA:

Finite automaton for parsing CSV numbers

As I said, the automaton consumes one character or input symbol at a time and transitions to another state depending on that input. If it is in one of the final states when the end of the string is reached (these are the states represented by double circles) the string is accepted. If it is in a nonfinal state, the string is rejected. In the state diagrams, I used my own metacharacters to stand for different sets of characters: the # character stands for any numerical digit, and the * character stands for any character not included in the other state transitions. (I had to do these state diagrams directly in SVG BTW, because no program I could find would do them in the format I wanted. They all used rectangular boxes for the states for some reason, which confuses it with an entity-relationship diagram. Go figure.)

It’s worth pointing out here that the start states in both these automata actually translate to the same state in the larger Turing machine. This is the Master state; it is in fact the start state for the Turing machine, and it branches into either the number DFA or the string DFA depending on the input parameters. The final states of these two sub-automata have transitions back to both the Master state (if between two fields) or to the final state for the entire Turing machine (if the end of a valid record is reached). You’ll be able to see exactly how this all comes together when I show the C code that implements the Turing machine later on in this article.

For my CSV parser, I want to enforce certain rules about the structure of the CSV file, so that I know it will be in a form that can easily be turned into a data structure by the CSV interpreter. One of these rules is that all records must have the same number of fields. Another is that corresponding fields in different records should all have the same data type. So if the first field in one record is a number, all other records must have a number for their first fields as well (the exception to this is the optional header, which I will get to in a bit). The two rules I just listed are context-sensitive in nature. They require the automaton to remember structural and semantic details of earlier parts of the CSV file to determine if the current symbol or record is valid. Thus, CSV is a context-sensitive language, and this is why a Turing machine is required to parse it.

If we want to know how to parse any language, we need a precise enumeration of all the rules we want to enforce. The rules for CSV are as follows:

  • CSV files consist of one or more records, separated by line breaks.
  • CSV records consist of one or more fields, separated by commas.
  • There are two types of fields: Number and String.
  • Number is a sequence of decimal digits with an optional minus sign at the beginning and an optional decimal point in the middle.
  • String is any sequence of non-quote, non-newline ASCII characters inside double quote symbols.
  • A header is optional; if present it is the first record of the CSV file and must contain all String fields.
  • All records must include the same number of fields.
  • All records except for the optional header must have the same types for corresponding fields.
  • Blank lines are not allowed.

We can express the context-free rules for CSV in Backus-Naur form as follows:


<CSV-file> ::= [<header>]<newline><record>(<newline><record>)*[<newline>]
<newline> ::= [\r]\n
<header> ::= <String>[,<String>]*
<record> ::= <field>[,<field>]*
<field> ::= <Number> | <String>
<Number> ::= [-]<digit><digit>*[.<digit><digit>*]
<String> ::= "<ASCII>*"
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<ASCII> ::= <All ASCII characters minus " and \n>

(See the links at the end of the article for more information about BNF.)

If we simply wanted to enforce these context-free grammatical rules, we could just use a pushdown automaton to parse the CSV file. But we also want to enforce the two additional rules mentioned above, that is, consistency in the number and types of fields for all CSV records. And this of course leads to the long-awaited discussion of the Turing machine model.

Conceptually, a Turing machine consists of a read/write head moving up and down an infinitely long tape with symbols written on each square of the tape. The symbols on the tape are from the tape alphabet, which may be different from the alphabet of input symbols. A Turing machine has states like a DFA, and these states allow it to validate context-free aspects of the input string, but the tape serves as an additional form of memory that helps the Turing machine keep track of important details about earlier parts of the string, such as, in this case, how many fields there are supposed to be in a record and what data type each field should be. The Turing machine I’ve written gets this information when it reads the optional header and the first non-header record. It therefore writes symbols to the tape to remember what the first record looked like, and then reads the tape when reading subsequent records to make sure they have the same format.

In my Turing machine implementation, the tape is represented as a doubly linked list, with each node being a struct that represents a single square on the tape. This struct contains an enumeration type for the symbol written to that square (the enum declaration for this type is just a list of all the symbols in the tape alphabet) as well as pointers to the nodes immediately to the right and to the left of it on the list. The read/write head is represented by a pointer that moves up and down this linked list the same way the head moves up and down the tape. I have defined the states for this Turing machine, as well as macros for the mechanical operations of the Turing machine, which include moving right by one unit, moving left by one unit, reading the current tape symbol below the head, and writing a symbol to the current location on the tape. The tape structure, tape alphabet, mechanical operations, and complete set of states for my Turing machine are specified in the following code:


// Tape alphabet for Turing machine
enum symbols { STRINGNUMBERBLANKSTART1START2STOPSTRING_STOPNUMBER_STOP };
// STRING and NUMBER indicate String and Number fields
// STRING_STOP and NUMBER_STOP indicate last field in a record
// BLANK is a temporary placeholder symbol written when parsing the header
// START1 and START2 are written to the left end of the tape
// START1 indicates that the first non-header record has not yet been read
// STOP is written to the unit corresponding to the last field when parsing the header

// Unit for tape used by Turing machine
struct tape {
        enum symbols symbol;
        struct tape *right;
        struct tape *left;
};

// States for Turing machine
#define MASTER   0 // Start state for whole machine, also entered from final states of Number and String DFAs when a comma is encountered
#define FINAL    1 // Final state for whole machine, entered from final states of Number and String DFAs when end of line is reached
#define TRAP     2 // Trap state, entered when an invalid character is encountered
#define END_HDR  3 // State for when the end of the optional header is reached
#define N_MINUS  4 // Accounts for optional minus sign at the beginning of a number
#define N_BEFORE 5 // Final state for Number, entered from MASTER state when digit is encountered and Number field is indicated
#define N_POINT  6 // Entered when decimal point is encountered
#define N_AFTER  7 // Final state for Number, entered from N_POINT when a digit is encountered
#define S_TEXT   8 // Nonfinal state for string, entered from MASTER state when start quote is encountered and String field is indicated
#define S_FINAL  9 // Final state for String, entered when end quote is encountered

// Turing machine actions
#define move_head_left( head ) head = head->left
#define move_head_right( head ) head = head->right
#define head_read( head ) head->symbol
#define head_write( head, s ) head->symbol = s

The only thing missing now is a function that implements the actual execution of the Turing machine program – a function that reads characters one-by-one from the input file and performs all the necessary state transitions, head movements, and tape reads/writes needed to validate CSV code. The function that follows is a rather long one, but it is clearly commented and thus is fairly easy to read.


bool validateFILE *fp, bool has_header ){
        long pos;
        int c;
        int state;
        struct tape *head;
        pos = ftell( fp );
        rewind( fp );

        // Turing machine setup:
        state = MASTER;
        head = (struct tape *) calloc1sizeofstruct tape ) );
        head->symbol = START1;

        // Turing machine main loop:
        if( debug ) puts("-------------------------------------------------------------------------------" );
        while( (c = fgetc( fp )) != EOF ){
                if( debug ){
                        printf"Current state: %s\n", state_strings[state] );
                        printf"Input symbol: %c,0x%s%x\n", c, (c<0x10)?"0":"", c );
                        printf"Current tape symbol: %s\n", symbol_strings[head->symbol] );
                }
                if( state == TRAP ); // Do nothing
                else if( c == '\r' ); // Ignore carriage returns
                else if( head->right == NULL && head->symbol != STOP && head->symbol != STRING_STOP && head->symbol != NUMBER_STOP ){
                // True if reading the first record
                        if( has_header ){
                        // True if reading the header
                                if( state == MASTER ){
                                        if( c == '\"' ){
                                                head->right = (struct tape *) calloc1sizeofstruct tape ) );
                                                head->right->left = head;
                                                move_head_right( head );
                                                head_write( head, BLANK );
                                                state = S_TEXT;
                                        }
                                        else state = TRAP;
                                }
                                else if( state == S_TEXT ){
                                        if( c == '\"' ) state = S_FINAL;
                                        else if( c == '\n' ) state = TRAP;
                                }
                                else if( state == S_FINAL ){
                                        if( c == ',' ) state = MASTER;
                                        else if( c == '\n' ){
                                                head_write( head, STOP );
                                                state = END_HDR;
                                        }
                                        else state = TRAP;
                                }
                        }
                        else{
                        // True if reading the first record in a CSV file with no header
                                ifhead_read( head ) == START1 ){
                                // True if at the beginning of the first record
                                        head_write( head, START2 );
                                }
                                if( state == MASTER ){
                                // True if at the beginning of a field
                                        head->right = (struct tape *) calloc1sizeofstruct tape ) );
                                        head->right->left = head;
                                        move_head_right( head );
                                        ifisdigit( c ) ){
                                                head->symbol = NUMBER;
                                                state = N_BEFORE;
                                        }
                                        else if( c == '-' ){
                                                head->symbol = NUMBER;
                                                state = N_MINUS;
                                        }
                                        else if( c == '\"' ){
                                                head->symbol = STRING;
                                                state = S_TEXT;
                                        }
                                        else state = TRAP;
                                }
                                else if( c == '\n' && (state == N_BEFORE || state == N_AFTER || state == S_FINAL ) ){
                                // True if newline is encountered at the end of a valid field
                                        ifhead_read( head ) == NUMBER ) head_write( head, NUMBER_STOP );
                                        else ifhead_read( head ) == STRING ) head_write( head, STRING_STOP );
                                        state = FINAL;
                                }
                                else if( c == ',' && (state == N_BEFORE || state == N_AFTER || state == S_FINAL ) ){
                                // True if comma is encountered at the end of a valid field
                                        state = MASTER;
                                }
                                else if( state == N_AFTER ){
                                // True if in a number field after the decimal point but not at the end
                                        if( !isdigit( c ) ) state = TRAP;
                                }
                                else if( state == N_POINT ){
                                // True if just after the decimal point in a number field
                                        ifisdigit( c ) ) state = N_AFTER;
                                        else state = TRAP;
                                }
                                else if( state == N_BEFORE ){
                                // True if inside a number field where neither the decimal point nor the end of the field has been encountered
                                        if( c == '.' ) state = N_POINT;
                                        else if( !isdigit( c ) ) state = TRAP;
                                }
                                else if( state == N_MINUS ){
                                // True if at the beginning of a number field with a negative value
                                        ifisdigit( c ) ) state = N_BEFORE;
                                        else state = TRAP;
                                }
                                else if( state == S_FINAL ) state = TRAP;
                                // Any character other than comma or newline after the end quote is invalid
                                else if( state == S_TEXT ){
                                        if( c == '\"' ) state = S_FINAL;
                                        else if( c == '\n' ) state = TRAP;
                                }
                        }
                }
                else if( state == END_HDR && head_read( head ) == STOP ){
                // True if end of header has just been reached
                        whilehead_read( head ) != START1 ){
                                 move_head_left( head );
                        }
                        head_write( head, START2 );
                        move_head_right( head );
                        ifisdigit( c ) || c == '-' ){
                                ifhead_read( head ) == BLANK )
                                        head_write( head, NUMBER );
                                else ifhead_read( head ) == STOP )
                                        head_write( head, NUMBER_STOP );
                                ifisdigit( c ) ) state = N_BEFORE;
                                else state = N_MINUS;
                        }
                        else if( c == '\"' ){
                                ifhead_read( head ) == BLANK )
                                        head_write( head, STRING );
                                else ifhead_read( head ) == STOP )
                                        head_write( head, STRING_STOP );
                                state = S_TEXT;
                        }
                        else state = TRAP;
                }
                else if( state == MASTER && (head_read( head ) == BLANK || head_read( head ) == STOP) ){
                // True if entering a new field in the first non-header record in a CSV file with a header
                        ifisdigit( c ) || c == '-' ){
                                ifhead_read( head ) == BLANK )
                                        head_write( head, NUMBER );
                                else ifhead_read( head ) == STOP )
                                        head_write( head, NUMBER_STOP );
                                ifisdigit( c ) ) state = N_BEFORE;
                                else state = N_MINUS;
                        }
                        else if( c == '\"' ){
                                ifhead_read( head ) == BLANK )
                                        head_write( head, STRING );
                                else ifhead_read( head ) == STOP )
                                        head_write( head, STRING_STOP );
                                state = S_TEXT;
                        }
                        else state = TRAP;
                }
                else if( c == ',' && (state == N_BEFORE || state == N_AFTER || state == S_FINAL) ){
                // True if at the end of a valid field other than the last one in any non-header record
                        ifhead_read( head ) == STOP || head_read( head ) == NUMBER_STOP || head_read( head ) == STRING_STOP ){
                        // True if the record has too many fields
                                state = TRAP;
                        }
                        else{
                                move_head_right( head );
                                state = MASTER;
                        }
                }
                else if( c == '\n' && (state == N_BEFORE || state == N_AFTER || state == S_FINAL) ){
                // True if at the end of a valid last field in any non-header record
                        ifhead_read( head ) == NUMBER_STOP || head_read( head ) == STRING_STOP )
                                state = FINAL;
                        else state = TRAP;
                }
                else if( state == N_AFTER ){
                // True if in a number field after the decimal point but not at the end
                        if( !isdigit( c ) ) state = TRAP;
                }
                else if( state == N_POINT ){
                // True if just after the decimal point in a number field
                        ifisdigit( c ) ) state = N_AFTER;
                        else state = TRAP;
                }
                else if( state == N_BEFORE ){
                // True if inside a number field where neither the decimal point nor the end of the field has been encountered
                        if( c == '.' ) state = N_POINT;
                        else if( !isdigit( c ) ) state = TRAP;
                }
                else if( state == N_MINUS ){
                        ifisdigit( c ) ) state = N_BEFORE;
                        else state = TRAP;
                }
                else if( state == S_FINAL ) state = TRAP;
                // Any character other than comma or newline after the end quote is invalid
                else if( state == S_TEXT ){
                // True if inside quotes
                        if( c == '\"' ) state = S_FINAL;
                        else if( c == '\n' ) state = TRAP;
                }
                else if( state == MASTER ){
                // True if entering any field in any record other than the header or the first non-header record
                        ifhead_read( head ) == NUMBER || head_read( head ) == NUMBER_STOP ){
                                ifisdigit( c ) ) state = N_BEFORE;
                                else if( c == '-' ) state = N_MINUS;
                                else state = TRAP;
                        }
                        else ifhead_read( head ) == STRING || head_read( head ) == STRING_STOP ){
                                if( c == '\"' ) state = S_TEXT;
                                else state = TRAP;
                        }
                }
                else if( state == FINAL ){
                // True if encountering another character after the end of a valid non-header record has been reached
                        whilehead_read( head ) != START2 ){
                                move_head_left( head );
                        }
                        move_head_right( head );
                        ifhead_read( head ) == NUMBER || head_read( head ) == NUMBER_STOP ){
                                ifisdigit( c ) ) state = N_BEFORE;
                                else if( c == '-' ) state = N_MINUS;
                                else state = TRAP;
                        }
                        else ifhead_read( head ) == STRING || head_read( head ) == STRING_STOP ){
                                if( c == '\"' ) state = S_TEXT;
                                else state = TRAP;
                        }
                }
                else{
                        fprintfstderr"Error: Turing machine fall-though.\nIf you are seeing this message, it means that there is a possible set\nof Turing machine parameters that the programmer failed to account for.\nPlease notify Michael Warren a.k.a. Psycho Cod3r using the email address\nlisted on his GitHub.\n" );
                        exit( -1 );
                }
                if( debug ){
                        printf"Next state: %s\n", state_strings[state] );
                        printf"Next tape symbol: %s\n", symbol_strings[head->symbol] );
                        putchar'\n' );
                        if( state == FINAL ){
                                puts"--------------------------------------------------------------------------------" );
                        }
                        else if( state == MASTER ){
                                puts"----------------------------------------" );
                        }
                        else{
                                puts"--------------------" );
                        }
                }
        }

        // Section to account for non-newline-terminated last line:
        if( (state == N_BEFORE || state == N_AFTER || state == S_FINAL) && (head_read( head ) == NUMBER_STOP || head_read( head ) == STRING_STOP) ){
                state = FINAL;
        }

        // Free tape structs:
        whilehead_read( head ) != START1 && head_read( head ) != START2 ){
                move_head_left( head );
        }
        while( head->right ){
                move_head_right( head );
                free( head->left );
        }
        free( head );

        fseek( fp, pos, SEEK_SET );
        return (state == FINAL);
}

This function takes two arguments – the file pointer for the CSV file, and a Boolean option that tells it whether the CSV file has a header. This is very important, because the program for parsing a CSV file with a header is different from the program for parsing a CSV file with no header. You might even say that CSV is two different languages, because the validity of a given string depends on whether the has_header option is set. The section of the Turing machine code that reads the first record addresses this disparity with separate blocks of code for a  header vs. a first record that isn’t a header.

The main part of this function is just a giant sequence of if/else-if/else statements that account for all the different combinations of input symbol, tape symbol, and current state. The Turing machine returns true if it’s in the final state when it gets to the end of the file. If it encounters an error anywhere, it immediately goes into the trap state, so called because there is no way to get out of this state. If it goes into this state at any point, it simply does nothing for the rest of the file. If the Turing machine is in any state other than the final state when it finishes parsing the string, that means the file is not valid CSV, so it returns false.

I wanted this validation function to have essentially no side effects, much like functions in purely functional languages like Haskell. Therefore I added a preamble section at the beginning of the function that saves the current file position before rewinding the file pointer. The function then restores the original file position before it exits.

You might notice that there are two other sections – one at the beginning of the loop and one at the end – that print debugging information for the Turing machine. This allows a user to trace the entire operation of the Turing machine and see exactly what it’s doing as it’s reading the CSV file. There’s actually quite a lot of preprocessing code at the beginning of the C file that sets up this optional debugging step, and this is code that I didn’t show previously. The preprocessing code is interleaved with the Turing machine definition like so:


bool debug =
#ifdef _DEBUG
        true;
#else
        false;
#endif

// Tape alphabet for Turing machine
enum symbols { STRINGNUMBERBLANKSTART1START2STOPSTRING_STOPNUMBER_STOP };
// STRING and NUMBER indicate String and Number fields
// STRING_STOP and NUMBER_STOP indicate last field in a record
// BLANK is a temporary placeholder symbol written when parsing the header
// START1 and START2 are written to the left end of the tape
// START1 indicates that the first non-header record has not yet been read
// STOP is written to the unit corresponding to the last field when parsing the header

char *symbol_strings[] =
#ifdef _DEBUG
        { "STRING""NUMBER""BLANK""START1""START2""STOP""STRING_STOP""NUMBER_STOP" };
#else
        { NULL };
#endif

// Unit for tape used by Turing machine
struct tape {
        enum symbols symbol;
        struct tape *right;
        struct tape *left;
};

// States for Turing machine
#define MASTER   0 // Start state for whole machine, also entered from final states of Number and String DFAs when a comma is encountered
#define FINAL    1 // Final state for whole machine, entered from final states of Number and String DFAs when end of line is reached
#define TRAP     2 // Trap state, entered when an invalid character is encountered
#define END_HDR  3 // State for when the end of the optional header is reached
#define N_MINUS  4 // Accounts for optional minus sign at the beginning of a number
#define N_BEFORE 5 // Final state for Number, entered from MASTER state when digit is encountered and Number field is indicated
#define N_POINT  6 // Entered when decimal point is encountered
#define N_AFTER  7 // Final state for Number, entered from N_POINT when a digit is encountered
#define S_TEXT   8 // Nonfinal state for string, entered from MASTER state when start quote is encountered and String field is indicated
#define S_FINAL  9 // Final state for String, entered when end quote is encountered

char *state_strings[] =
#ifdef _DEBUG
        { "MASTER""FINAL""TRAP""END_HDR""N_MINUS""N_BEFORE""N_POINT""N_AFTER""S_TEXT""S_FINAL" };
#else
        { NULL };
#endif

// Turing machine actions
#define move_head_left( head ) head = head->left
#define move_head_right( head ) head = head->right
#define head_read( head ) head->symbol
#define head_write( head, s ) head->symbol = s

To compile the program in debugging mode, you would define the _DEBUG macro at the command line. This triggers the optional debugging setup shown above. In both the Open Watcom compiler that I’m using, as well as the more common gcc tool for the Linux platform, you define optional macros with the -D option. So to do this in gcc, you would use a command like this:


$ gcc -o csv-parser csv-parser.c -D_DEBUG

There is one more section of code I want to share with respect to the CSV validator itself, and that is the code that calls validate() from the main() function. It looks like this:


if( !validate( fp, has_header ) ){
        fprintfstderr"Invalid CSV code.\n" );
        exit( -1 );
}
printf"CSV code is valid.\n" );

Now that we have all the code we need, let’s compile the CSV parser with the _DEBUG option set, and see what happens when we run a sample CSV file through the Turing machine. Here is the sample CSV file we will be using:


"F1","F2"
1,"Hi"
2,"Bye"

Feeding this into the Turing machine, we get the following output:


-------------------------------------------------------------------------------
Current state: MASTER
Input symbol: ",0x22
Current tape symbol: START1
Next state: S_TEXT
Next tape symbol: BLANK

--------------------
Current state: S_TEXT
Input symbol: F,0x46
Current tape symbol: BLANK
Next state: S_TEXT
Next tape symbol: BLANK

--------------------
Current state: S_TEXT
Input symbol: 1,0x31
Current tape symbol: BLANK
Next state: S_TEXT
Next tape symbol: BLANK

--------------------
Current state: S_TEXT
Input symbol: ",0x22
Current tape symbol: BLANK
Next state: S_FINAL
Next tape symbol: BLANK

--------------------
Current state: S_FINAL
Input symbol: ,,0x2c
Current tape symbol: BLANK
Next state: MASTER
Next tape symbol: BLANK

----------------------------------------
Current state: MASTER
Input symbol: ",0x22
Current tape symbol: BLANK
Next state: S_TEXT
Next tape symbol: BLANK

--------------------
Current state: S_TEXT
Input symbol: F,0x46
Current tape symbol: BLANK
Next state: S_TEXT
Next tape symbol: BLANK

--------------------
Current state: S_TEXT
Input symbol: 2,0x32
Current tape symbol: BLANK
Next state: S_TEXT
Next tape symbol: BLANK

--------------------
Current state: S_TEXT
Input symbol: ",0x22
Current tape symbol: BLANK
Next state: S_FINAL
Next tape symbol: BLANK

--------------------
Current state: S_FINAL
Input symbol: 
,0x0a
Current tape symbol: BLANK
Next state: END_HDR
Next tape symbol: STOP

--------------------
Current state: END_HDR
Input symbol: 1,0x31
Current tape symbol: STOP
Next state: N_BEFORE
Next tape symbol: NUMBER

--------------------
Current state: N_BEFORE
Input symbol: ,,0x2c
Current tape symbol: NUMBER
Next state: MASTER
Next tape symbol: STOP

----------------------------------------
Current state: MASTER
Input symbol: ",0x22
Current tape symbol: STOP
Next state: S_TEXT
Next tape symbol: STRING_STOP

--------------------
Current state: S_TEXT
Input symbol: H,0x48
Current tape symbol: STRING_STOP
Next state: S_TEXT
Next tape symbol: STRING_STOP

--------------------
Current state: S_TEXT
Input symbol: i,0x69
Current tape symbol: STRING_STOP
Next state: S_TEXT
Next tape symbol: STRING_STOP

--------------------
Current state: S_TEXT
Input symbol: ",0x22
Current tape symbol: STRING_STOP
Next state: S_FINAL
Next tape symbol: STRING_STOP

--------------------
Current state: S_FINAL
Input symbol: 
,0x0a
Current tape symbol: STRING_STOP
Next state: FINAL
Next tape symbol: STRING_STOP

--------------------------------------------------------------------------------
Current state: FINAL
Input symbol: 2,0x32
Current tape symbol: STRING_STOP
Next state: N_BEFORE
Next tape symbol: NUMBER

--------------------
Current state: N_BEFORE
Input symbol: ,,0x2c
Current tape symbol: NUMBER
Next state: MASTER
Next tape symbol: STRING_STOP

----------------------------------------
Current state: MASTER
Input symbol: ",0x22
Current tape symbol: STRING_STOP
Next state: S_TEXT
Next tape symbol: STRING_STOP

--------------------
Current state: S_TEXT
Input symbol: B,0x42
Current tape symbol: STRING_STOP
Next state: S_TEXT
Next tape symbol: STRING_STOP

--------------------
Current state: S_TEXT
Input symbol: y,0x79
Current tape symbol: STRING_STOP
Next state: S_TEXT
Next tape symbol: STRING_STOP

--------------------
Current state: S_TEXT
Input symbol: e,0x65
Current tape symbol: STRING_STOP
Next state: S_TEXT
Next tape symbol: STRING_STOP

--------------------
Current state: S_TEXT
Input symbol: ",0x22
Current tape symbol: STRING_STOP
Next state: S_FINAL
Next tape symbol: STRING_STOP

--------------------
Current state: S_FINAL
Input symbol: 
,0x0a
Current tape symbol: STRING_STOP
Next state: FINAL
Next tape symbol: STRING_STOP

--------------------------------------------------------------------------------
CSV code is valid.

Hopefully this example will help you understand what the Turing machine is doing under the hood. If you want to see more, I do plan on creating a GitHub and uploading this project there sometime soon, so you’ll be able to download the CSV parser I’ve written and experiment with it yourself.

Now that we’ve enforced a rigid structure for the CSV file, we can much more easily interpret it and deserialize it into an abstract data structure readable by a statistical algorithm. First we need to define a few more data types to keep track of all the information we need. The first type encapsulates all the metadata information for each field, the second type encapsulates a single record, and the third type is a handle through which we access the entire table structure. The table is implemented as a linked list of record structs.


enum types { numberstring };

// Metadata field for the table header
struct csv_field {
        char *name;
        enum types type;
};

// Single record in linked list table structure
struct csv_record {
        void **record;
        struct csv_record *next;
};

// Handle for abstract table structure
struct csv_table {
        int rlen;                  // # of fields in each record
        struct csv_field **header; // Table metadata
        struct csv_record *start;  // Pointer to first record
        struct csv_record *cur;    // Pointer to current record
};

The CSV interpreter uses a three-step process to build the metadata for the table. This metadata is very important, because without it, we have no way of knowing the data types of the fields we are accessing, among other vital information. The steps are as follows: 1. determine the number of fields in each record, 2. determine the name of each field (using generic names if there is no header), and 3. determine the type of each field. These steps are performed using finite automata, as before, except this time we are using transducers rather than acceptors. The specific output symbols generated by these transducers give us the metadata information we need. There are three automata used in this function: a field counter, a type determiner, and a field separator.

The field counter has two states: IN and OUT, which simply keep track of whether it’s inside or outside a string. If it encounters a comma while in the OUT state, it increments the line count. Theoretically, this can be thought of as counting using a unary notation, where the output alphabet consists of only a tally mark, and the transducer outputs another tally mark every time it encounters a comma while in the OUT state.

The type determiner also has an IN state and an OUT state, but in addition to those two it has a START state, which is the state it’s in when reading the first character in a field. Upon reading the first character (which it uses to determine the data type), it transitions immediately into either IN or OUT depending on whether the data type is string or number respectively, and it outputs a symbol from its output alphabet, which really only has two symbols: string and number. The output for this transducer of course goes to the field struct corresponding to the current field.

The field separator is used to isolate the field strings so they can be read more easily. This automaton is used in two places: once for the header and once for the non-header records. Like the other two automata it transitions between the IN and OUT states, except the output alphabet is the same as the input alphabet, plus an additional null character, and it produces an output symbol with every state transition. This output symbol is either the same symbol that was input, or a null character in the case of a quote or newline or a comma encountered outside a string field, and it is written back to the same string that the input symbols are read from. By overwriting the quote, comma, and newline characters with nulls, we can effectively separate the fields into their own tokens. This is in fact how the strtok() function in string.h works, and I could have used that function here if not for the fact that it would trip over any string that has a comma inside of it.

Now that I’ve described the automata for building the metadata, we can move on to the code. First we need to define the states for these three automata:


// States for field counter
#define START 0
#define IN    1
#define OUT   2

The interpreter function goes through the three steps described above, then goes to the beginning of the first non-header record in the CSV file and starts reading one line at a time. After reading a line, it employs the field separator automaton to isolate the fields of that CSV record. It then dynamically allocates memory for a new record structure and loops through the newly separated strings, reading each one into another dynamically allocated area of memory corresponding to that field using either atof() or strncpy() depending on the type read from the corresponding field of the header (I used memcpy() to copy the number value from an auxiliary double variable since you can’t directly perform numerical operations on a variable that’s accessed through a void pointer). Once it reaches the end of the CSV file, the interpreter rewinds the cur pointer to the beginning of the table and returns the handle for the table. All of this is accomplished with the following code:


struct csv_table *read_tableFILE *fp, bool has_header ){
        long pos;
        double tmpf;
        struct csv_table *table;
        int i, f, c, x, state, len, size;
        char *buf = NULL;
        char *ptr = NULL;
        pos = ftell( fp );
        rewind( fp );
        table = (struct csv_table *) mallocsizeofstruct csv_table ) );

        // Field-counting automaton:
        table->rlen = 1;
        state = OUT;
        read_line( buf, size, fp );
        len = strlen( buf );
        for( i = 0; i < len; i++ ){
                if( state == OUT ){
                        if( buf[i] == '\"' ) state = IN;
                        else if( buf[i] == ',' ) (table->rlen)++;
                }
                else if( state == IN && buf[i] == '\"' ) state = OUT;
        }

        table->header = (struct csv_field **) calloc( table->rlen, sizeofstruct csv_field * ) );
        for( f = 0; f < table->rlen; f++ ){
                table->header[f] = (struct csv_field *) mallocsizeofstruct csv_field ) );
        }

        // Determine field names:
        if( has_header ){
                // Isolate field strings:
                state = OUT;
                for( i = 0; i < len; i++ ){
                        if( state == OUT ){
                                if( buf[i] == '\"' ) state = IN;
                                else if( buf[i] == ',' ) buf[i] = '\0';
                        }
                        else if( state == IN && buf[i] == '\"' ) state = OUT;
                        if( buf[i] == '\"' || buf[i] == '\r' || buf[i] == '\n' ) buf[i] = '\0';
                }
                // Read field strings:
                ptr = buf + 1;
                for( f = 0; f < table->rlen; f++ ){
                        len = strlen( ptr );
                        table->header[f]->name = (char *) malloc( len + 1 );
                        strncpy( table->header[f]->name, ptr, len + 1 );
                        ptr += len;
                        while( ptr[0] == '\0' ) ptr++;
                }
        }
        else{
                for( f = 0; f < table->rlen; f++ ){
                        // Format for field names will be "x<number>"
                        x = ceillog( (double) i ) / log10.0 ) ) + 2;
                        table->header[f]->name = (char *) malloc( x );
                        snprintf( table->header[f]->name, x, "x%d\0", f );
                }
        }

        // Automaton to determine types of fields:
        if( !has_header ) rewind( fp );
        state = START;
        read_line( buf, size, fp );
        len = strlen( buf );
        f = 0;
        for( i = 0; i < len; i++ ){
                if( state == START ){
                        if( buf[i] == '\"' ){
                                state = IN;
                                table->header[f]->type = string;
                        }
                        else{
                                state = OUT;
                                table->header[f]->type = number;
                        }
                        f++;
                }
                if( state == IN ){
                        if( buf[i] == '\"' )
                                state = OUT;
                }
                if( state == OUT ){
                        if( buf[i] == ',' )
                                state = START;  
                }
        }

        // Code to build the table structure:
        rewind( fp );
        table->start = (struct csv_record *) calloc1sizeofstruct csv_record ) );
        table->cur = table->start;
        if( has_header ) read_line( buf, size, fp );
        while( (c = fgetc( fp )) != EOF ){
                table->cur->next = (struct csv_record *) calloc1sizeofstruct csv_record ) );
                table->cur = table->cur->next;
                table->cur->record = (void **) calloc( table->rlen, sizeofvoid * ) );
                ungetc( c, fp );
                read_line( buf, size, fp );
                len = strlen( buf );
                // Isolate field strings:
                state = OUT;
                for( i = 0; i < len; i++ ){
                        if( state == OUT ){
                                if( buf[i] == '\"' ) state = IN;
                                else if( buf[i] == ',' ) buf[i] = '\0';
                        }
                        else if( state == IN && buf[i] == '\"' ) state = OUT;
                        if( buf[i] == '\"' || buf[i] == '\r' || buf[i] == '\n' ) buf[i] = '\0';
                }
                // Read field strings:
                ptr = buf;
                while( ptr[0] == '\0' ) ptr++;
                for( f = 0; f < table->rlen; f++ ){
                        len = strlen( ptr );
                        if( table->header[f]->type == string ){
                                table->cur->record[f] = malloc( len + 1 );
                                strncpy( table->cur->record[f], ptr, len + 1 );
                        }
                        else if( table->header[f]->type == number ){
                                table->cur->record[f] = mallocsizeofdouble ) );
                                tmpf = atof( ptr );
                                memcpy( table->cur->record[f], &tmpf, sizeofdouble ) );
                        }
                        ptr += len;
                        while( ptr[0] == '\0' ) ptr++;
                }
        }
        table->cur = table->start;

        fseek( fp, pos, SEEK_SET );
        return table;
}

Although this function is still fairly complex, it is only half the size of the validator function, and it’s also significantly smaller than it would have been had we not validated the CSV file beforehand. As an added bonus it’s fairly modular and easily lends itself to being debugged in smaller, more manageable chunks. This relative simplicity and elegance comes from the fact that validating the input file gives us license to assume that the code is well-formed and conforms to a certain predictable structure. If it doesn’t, the program won’t make it this far, so it’s an absolute guarantee that the CSV interpreter will run successfully. If we combined the validator and interpreter into one step, the code would be a lot messier, longer, and more error-prone, and it would be much harder to read, understand, modify, maintain, and debug.

Oh, and that read_line() function is actually a macro that I defined towards the beginning of the program. You might remember it from my previous post on debugging with a disassembler. Its code looks like this:


// Used in case the last line of the
// file is not newline-terminated
bool is_eofFILE *fp ){
        int c;
        bool end;
        end = ((c = fgetc( fp )) == EOF );
        ungetc( c, fp );
        return end;
}

// Read a single line from the file
#define read_line( buf, size, fp )\
        if( buf ) free( buf );\
        size = 64;\
        buf = (char *) malloc( size );\
        fgets( buf, size, fp );\
        while( buf[strlen( buf ) - 1] != '\n' && !is_eof( fp ) ){\
                size <<= 1;\
                buf = (char *) realloc( buf, size );\
                fgets( buf + (size >> 1) - 1, size >> 1, fp );\
        }\
        size = 64

Well, that’s about all for now. Obviously there’s tremendous potential for what I could do with this CSV parser in the future. I plan to write an entire C library for working with CSV files, which I will publish on my GitHub at some point in the near future. This will also serve as a basis for the general machine learning and deep learning libraries I’m writing. Overall, I’m very excited about this and can’t wait to get these projects off the ground. For now, farewell and happy coding!

Also, references:

Formal languages – Wikipedia
Finite state machines – Wikipedia
State diagrams – Wikipedia
Turing machines – Wikipedia
Chomsky hierarchy – Wikipedia
Backus-Naur form – Wikipedia

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s