libdfloat: A C Library for Exact Representation of Decimal Floating Point Numbers

Guys, something really awesome just happened!… I actually finished one of my large-scale coding projects! I’d like to introduce you to libdfloat, a C library for representing decimal numbers without any rounding errors.

I started this project as an offshoot of my CSV library when I realized that there was a need for a mechanism for reading data from a CSV file, processing it, and then writing it back as it was before. To illustrate this need, let’s say you have the number 1.2 written to a CSV record. A data analysis program reads the CSV table into a table structure using atof() and strcpy(), as was shown in this article from last week, and it processes the data in the table without modifying that one field. So when it writes the data back, it should write back 1.2, right? Problem is, 1.2 can’t be represented exactly in binary notation, so what will happen is the number will get truncated when it is converted from decimal to binary and then it will get truncated again when it is converted back to decimal. We want a guarantee that we won’t lose precision when we read the numbers from and write the numbers back to the CSV file, and if we are converting between binary and decimal formats, we don’t have this guarantee. So we have to implement a new type that allows decimal numbers to be represented exactly within the program.

The way I see it, there are basically two ways to implement a decimal floating point type: One is to use binary-coded decimal format (BCD) where each decimal digit is represented by four bits of a byte. I realized this would be rather slow and cumbersome as digits would have to be packed and unpacked from the containing bytes, and I would also have to completely re-implement all arithmetic operations from the ground up. So I discarded this idea and instead used a solution that is halfway between the extremes of BCD and pure binary: I used a mantissa and exponent that are both technically represented in pure binary, but synthesized the decimal operations on these magnitudes using precise calculations with exponents and logarithms.

I have specified four types for use in decimal floating point arithmetic, based on four different widths. They are defined as follows:


#include <stdint.h>

typedef struct {
        int8_t mantissa;
        int8_t exponent;
} dfloat16_t;

typedef struct {
        int16_t mantissa;
        int16_t exponent;
} dfloat32_t;

typedef struct {
        int32_t mantissa;
        int32_t exponent;
} dfloat64_t;

typedef struct {
        int64_t mantissa;
        int64_t exponent;
} dfloat128_t;

This decimal floating point type has basically the same internal implementation as binary floating point types in the IEEE-754 standard, which specifies four components: mantissa, exponent, sign of the mantissa, and sign of the exponent. In my implementation, the mantissa and mantissa sign are combined into one value, as are the exponent and exponent sign, simply by using signed values for the mantissa and exponent fields. The mantissa is just a binary representation of the number represented by all the decimal digits of the number from the most significant to the least significant figures, multiplied by such an exponent that there are no nonzero digits after the decimal point as well as no trailing zeroes before the decimal point. Put more succinctly, the exponent is such that the mantissa is an integer that is not a multiple of 10. The exponent gives the power of 10 you would have to multiply the mantissa by to get the number represented by the dfloat type. This may all seem rather convoluted on paper, but it’s actually the simplest way to express a decimal floating point number exactly, since all we need are two signed integers of the same width.

Since the goal here is to be able to read and write floating point numbers to/from text files, the initialization of the corresponding variables should be done from a text string, and there is little need to declare these variables directly. In fact doing so wouldn’t make sense because there is no way to specify a new literal format for a user-defined type except by using strings. So the first thing to do is create a decimal equivalent of the atof() function.

The function definition that follows looks rather odd because it’s actually a macro for four function definitions (one for each width/type). It uses the arguments small and big, representing the size of the two dfloat components and the size of the dfloat itself, respectively. These arguments are used to generate the symbols for the corresponding dfloat types and fixed-width integer types through token-pasting – a C preprocessor feature that is not used that much, but that I’ve now found a good niche use for, employing it heavily in the implementation of this library. As for the function itself, there’s nothing particularly remarkable about its implementation, as it simply reads the number in the most common-sense fashion that occurred to me. (From now on I will simply refer to a group of functions defined by one of these macros as a “function”, with the understanding that this “function” in fact encompasses several functions.)


#define dfloatN_atof( small, big )\
dfloat ## big ## _t *dfloat ## big ## _atof( char *src ){\
        int i, len;\
        int point = 0;\
        char *cpy;\
        dfloat ## big ## _t *dst;\
        len = strlen( src );\
        cpy = (char *) malloc( len+1 );\
        strncpy( cpy, src, len+1 );\
        dst = (dfloat ## big ## _t *) malloc( sizeof( dfloat ## big ## _t ) );\
        dst->exponent = 0;\
\
        /* Search for decimal point: */\
        for( i = 0; i < len; i++ ){\
                if( cpy[i] == '.' ) point = i;\
        }\
\
        /* Determine exponent: */\
        i = len;\
        while( cpy[--i] == '0' );\
        if( !point ){\
                dst->exponent = len - i - 1;\
        }\
        else if( i == point ){\
                while( cpy[--i] == '0' )\
                        dst->exponent++;\
        }\
        else{\
                dst->exponent = point - i;\
        }\
\
        /* Determine mantissa: */\
        if( point ){\
                for( i = point; i < len; i++ )\
                        cpy[i] = cpy[i+1];\
                len--;\
        }\
        i = len;\
        while( cpy[--i] == '0' )\
                cpy[i] = '\0';\
        dst->mantissa = (int ## small ## _t) atoi( cpy );\
\
        free( cpy );\
        return dst;\
}

dfloatN_atof( 8, 16 )
dfloatN_atof( 16, 32 )
dfloatN_atof( 32, 64 )
dfloatN_atof( 64, 128 )

The function for writing a dfloat back to a string is a lot more convoluted and difficult to understand and implement. This is due to the litany of different possibilities for combinations of mantissa and exponent out there. There are a number of different classes of such combinations, each demanding a different method for printing the value.


#define dfloatN_ftoa( small, big )\
char *dfloat ## big ## _ftoa( dfloat ## big ## _t *src ){\
        int ## small ## _t size1, size2;\
        int ## small ## _t whole_part, frac_part;\
        double shift_factor;\
        int ## small ## _t whole_magnitude, frac_magnitude;\
        int i;\
        int zeros, zeros_end;\
        char *buf;\
        size1 = ceil( log10( abs( src->mantissa ) ) );\
        if( abs( src->mantissa ) == 1 )\
        /* Accounts for exact powers of 10 */\
                size1++;\
        if( src->mantissa == 0 ){\
                buf = (char *) malloc( 2 );\
                sprintf( buf, "0\0" );\
        }\
        else if( src->exponent == 0 ){\
                /* This code simply prints the mantissa. */\
                size2 = size1 + 2;\
                buf = (char *) malloc( size2 );\
                sprintf( buf, "%d\0", src->mantissa );\
        }\
        else if( src->exponent > 0 ){\
                /* This code prints the mantissa and then a number of zeros */\
                /* equal to the exponent.                                   */\
                size2 = size1 + src->exponent;\
                buf = (char *) malloc( size2 + 2 );\
                sprintf( buf, "%d", src->mantissa );\
                if( src->mantissa < 0 ){\
                /* Accounts for a minus sign at the beginning */\
                        size1++;\
                        size2++;\
                }\
                for( i = size1; i < size2; i++ ){\
                        buf[i] = '0';\
                }\
                buf[size2] = '\0';\
        }\
        else if( src->exponent < 0 ){\
                /* This code shifts the fractional part out of the mantissa */\
                /* to get the whole part, then shifts in zeros and subtracts*/\
                /* from the original mantissa to get the fractional part.   */\
                shift_factor = pow( 10, src->exponent );\
                whole_part = abs( src->mantissa ) * shift_factor;\
                frac_part = abs( src->mantissa ) - whole_part / shift_factor;\
\
                whole_magnitude = whole_part?(ceil( log10( whole_part ) )):1;\
                frac_magnitude = ceil( log10( frac_part ) );\
\
                /* This code accounts for exact powers of 10. */\
                if( whole_magnitude == log10( whole_part ) )\
                        whole_magnitude++;\
                if( frac_magnitude == log10( frac_part ) )\
                        frac_magnitude++;\
\
                /* This code accounts for fractional parts less than 0.1 */\
                zeros = -(src->exponent) - frac_magnitude;\
\
                if( src->mantissa < 0 )\
                /* Add one character for the minus sign */\
                        whole_magnitude++;\
                size2 = whole_magnitude + frac_magnitude + 2;\
                buf = (char *) malloc( size2 );\
                sprintf( buf, "%s%d.", (src->mantissa < 0)?"-":"", whole_part );\
                i = whole_magnitude + 1;\
                if( zeros > 0 ){\
                        zeros_end = zeros + whole_magnitude + 1;\
                        for( i = whole_magnitude+ 1; i < zeros_end; i++ )\
                                buf[i] = '0';\
                }\
                sprintf( buf + i, "%d\0", frac_part );\
        }\
        return buf;\
}

dfloatN_ftoa( 8, 16 )
dfloatN_ftoa( 16, 32 )
dfloatN_ftoa( 32, 64 )
dfloatN_ftoa( 64, 128 )

Basically this function computes the number of decimal digits it needs for the number’s textual representation by taking its logarithm, then it goes into one of four branches depending on the value of the exponent. If the mantissa is zero, that means the number is zero, so it can just print “0” and skip over everything else. If not, it has a different branch for a zero exponent, positive exponent, or negative exponent. A zero exponent means you can simply print the mantissa as it is, because the value of the number is literally equal to its mantissa. A positive exponent means there are zeroes at the end, so the mantissa is printed first, followed by a number of zeroes equal to the exponent. The case of a negative exponent (which simply means there are digits past the decimal point) is the trickiest of all – it involves breaking the number down into its whole and fractional components and then printing each in turn, separated by a decimal point and however many leading zeros there are past the decimal point. I had a hell of a time getting this function to work properly, again, because there were so many different possibilities to account for.

The only other function that took any great effort to implement was the addition function. This was due to the need to shift the mantissas so that the exponents are equal, thus allowing you to add the mantissas properly. Like the former function, this function took quite a bit of trial and error to get right.


#define dfloatN_add( small, big )\
void dfloat ## big ## _add( dfloat ## big ## _t *dst, dfloat ## big ## _t *src ){\
        int i;\
        long log_max, log_diff;\
        /* log_max gives the maximum number of times  */\
        /* you can multiply by 10 before overflowing. */\
        int ## small ## _t src_mantissa, dst_mantissa;\
        int ## small ## _t src_magnitude, dst_magnitude;\
        int ## small ## _t smaller_exponent;\
        int ## small ## _t larger_magnitude;\
        int ## small ## _t target_mantissa, target_exponent;\
        int ## small ## _t larger_mag_exponent;\
        src_mantissa = src->mantissa;\
        dst_mantissa = dst->mantissa;\
        src_magnitude = ceil( log10( abs( src->mantissa ) * pow( 10, src->exponent ) ) );\
        dst_magnitude = ceil( log10( abs( dst->mantissa ) * pow( 10, dst->exponent ) ) );\
        /* magnitude is the number of digits before the decimal point */\
        /* or the number of zeros after the decimal point if negative */\
        if( abs( src_mantissa ) == 1 )\
                src_magnitude++;\
        if( abs( dst_mantissa ) == 1 )\
                dst_magnitude++;\
        /* Increment accounts for exact powers of 10 */\
        smaller_exponent = (src->exponent < dst->exponent)?src->exponent:dst->exponent;\
        /* A lower exponent indicates a higher degree of precision for a number. */\
        larger_magnitude = (src_magnitude > dst_magnitude)?src_magnitude:dst_magnitude;\
        larger_mag_exponent = (src_magnitude > dst_magnitude)?src->exponent:dst->exponent;\
\
        /* Next part figures out the target exponent for each number to    */\
        /* be shifted to so they can be added.                             */\
        /* If log_diff <= log_max then the exponent should be log_diff.    */\
        /* Otherwise (if there's overflow and some digits need to be cut   */\
        /* off) the exponent should be equal to the exponent of the number */\
        /* with the larger magnitude, minus abs(log_diff-log_max).         */\
        log_max = ceil( log10( ((int ## small ## _t) 1) << (small-3) ) );\
        log_diff = larger_magnitude - smaller_exponent;\
        target_exponent = (log_diff <= log_max)?smaller_exponent:larger_mag_exponent-abs(log_diff-log_max)+1;\
\
        /* Shift both src and dst until they have the same exponent:       */\
        /* Changes to mantissa and exponent should cancel each other out.  */\
        src_mantissa *= pow( 10, src->exponent - target_exponent );\
        dst_mantissa *= pow( 10, dst->exponent - target_exponent );\
\
        target_mantissa = src_mantissa + dst_mantissa;\
\
        /* Count number of trailing zeros and adjust */\
        /* mantissa and exponent accordingly:        */\
        for( i = 0; i < log_max; i++ ){\
                if( target_mantissa % 10 ) break;\
                target_mantissa /= 10;\
                target_exponent++;\
        }\
        dst->mantissa = target_mantissa;\
        dst->exponent = target_exponent;\
}

dfloatN_add( 8, 16 )
dfloatN_add( 16, 32 )
dfloatN_add( 32, 64 )
dfloatN_add( 64, 128 )

Like the number printing function, the addition function uses logarithms to calculate the sizes of the decimal representations. Basically we want to know the distance between the most significant figure and the least significant figure among the two numbers, because this tells us how much space we need for the result. This distance is represented by the difference between the larger magnitude and the smaller exponent. If the difference between these two numbers is less than the maximum number of decimal digits representable by that width, then we know we can add them without any data loss, simply by shifting the larger number until its exponent equals that of the smaller number. We do this by multiplying by a positive power of 10 while subtracting the equivalent amount from the exponent (note that the changes in mantissa and exponent must cancel each other out if the value of the number is to be preserved). If the difference is greater than the maximum size, then this means the result is too big to fit into that width, so some of the least significant decimal digits have to be shifted out. This is done by dividing the smaller number’s mantissa by a certain amount after multiplying the larger number’s mantissa until it reaches its size limit. The two shift amounts (meaning their absolute values) must add up to the difference between the two exponents for this to work.

Once we have implemented addition, the other three arithmetic operations, as well as the one comparison operation, are fairly easy to implement. The subtraction function simply calls the addition function with the second mantissa negated. The multiplication function multiplies the mantissas and adds the exponents. The division function divides the mantissas and subtracts the exponents. The division function also artificially inflates the numerator beforehand so it doesn’t get truncated to zero, and it has an additional precision argument that determines how many decimal places the result will come to. The comparison function subtracts the second operand from the first and looks at the sign of the result.


#define dfloatN_sub( small, big )\
void dfloat ## big ## _sub( dfloat ## big ## _t *dst, dfloat ## big ## _t *src ){\
        dfloat ## big ## _t *tmp;\
        tmp = (dfloat ## big ## _t *) malloc( sizeof( dfloat ## big ## _t ) );\
        dfloat ## big ## _cpy( tmp, src );\
        tmp->mantissa *= -1;\
        dfloat ## big ## _add( dst, tmp );\
        free( tmp );\
}

dfloatN_sub( 8, 16 )
dfloatN_sub( 16, 32 )
dfloatN_sub( 32, 64 )
dfloatN_sub( 64, 128 )

#define dfloatN_mul( small, big )\
void dfloat ## big ## _mul( dfloat ## big ## _t *dst, dfloat ## big ## _t *src ){\
        int i;\
        int ## small ## _t log_max;\
        log_max = ceil( log10( ((int ## small ## _t) 1) << (small-3) ) );\
        dst->mantissa *= src->mantissa;\
        dst->exponent += src->exponent;\
\
        /* Count number of trailing zeros and adjust */\
        /* mantissa and exponent accordingly:        */\
        for( i = 0; i < log_max; i++ ){\
                if( dst->mantissa % 10 ) break;\
                dst->mantissa /= 10;\
                dst->exponent++;\
        }\
}

dfloatN_mul( 8, 16 )
dfloatN_mul( 16, 32 )
dfloatN_mul( 32, 64 )
dfloatN_mul( 64, 128 )

#define dfloatN_div( small, big )\
void dfloat ## big ## _div( dfloat ## big ## _t *dst, dfloat ## big ## _t *src, int precision ){\
        int i;\
        /* First section shifts the destination mantissa */\
        /* so it doesn't become zero when divided.       */\
        long log_max;\
        int ## small ## _t dst_magnitude;\
        int ## small ## _t shift_factor;\
        log_max = ceil( log10( ((int ## small ## _t) 1) << (small-3) ) );\
        dst_magnitude = ceil( log10( dst->mantissa * pow( 10, dst->exponent ) ) );\
        shift_factor = pow( 10, log_max - dst_magnitude );\
        dst->mantissa *= shift_factor;\
\
        dst->mantissa /= src->mantissa;\
        dst->exponent -= (src->exponent+precision);\
        dst->mantissa /= (shift_factor/pow( 10, precision ));\
\
        /* Count number of trailing zeros and adjust */\
        /* mantissa and exponent accordingly:        */\
        for( i = 0; i < log_max; i++ ){\
                if( dst->mantissa % 10 ) break;\
                dst->mantissa /= 10;\
                dst->exponent++;\
        }\
}

dfloatN_div( 8, 16 )
dfloatN_div( 16, 32 )
dfloatN_div( 32, 64 )
dfloatN_div( 64, 128 )

#define dfloatN_cmp( small, big )\
int dfloat ## big ## _cmp( dfloat ## big ## _t *df1, dfloat ## big ## _t *df2 ){\
        dfloat ## big ## _t *cpy;\
        int result;\
        cpy = (dfloat ## big ## _t *) malloc( sizeof( dfloat ## big ## _t ) );\
        dfloat ## big ## _cpy( cpy, df1 );\
        dfloat ## big ## _sub( cpy, df2 );\
        if( cpy->mantissa > 0 )\
                return 1;\
        if( cpy->mantissa < 0 )\
                return -1;\
        return 0;\
}

dfloatN_cmp( 8, 16 )
dfloatN_cmp( 16, 32 )
dfloatN_cmp( 32, 64 )
dfloatN_cmp( 64, 128 )

There are two other functions in the library, and they include one for copying a source operand to a destination operand, and one for casting a dfloat to another size. They are as follows:


#define dfloatN_cpy( small, big )\
void dfloat ## big ## _cpy( dfloat ## big ## _t *dst, dfloat ## big ## _t *src ){\
        dst->mantissa = src->mantissa;\
        dst->exponent = src->exponent;\
}

dfloatN_cpy( 8, 16 )
dfloatN_cpy( 16, 32 )
dfloatN_cpy( 32, 64 )
dfloatN_cpy( 64, 128 )

#define dfloatM_castN( smallM, bigM, smallN, bigN )\
dfloat ## bigN ## _t *dfloat ## bigM ## _cast ## bigN ( dfloat ## bigM ##_t *src ){\
        dfloat ## bigN ## _t *dst;\
        dst = (dfloat ## bigN ## _t *) malloc( sizeof( dfloat ## bigN ## _t ) );\
        dst->mantissa = (int ## smallN ## _t) src->mantissa;\
        dst->exponent = (int ## smallN ## _t) src->exponent;\
        return dst;\
}

dfloatM_castN( 8, 16, 16, 32 )
dfloatM_castN( 8, 16, 32, 64 )
dfloatM_castN( 8, 16, 64, 128 )

dfloatM_castN( 16, 32, 8, 16 )
dfloatM_castN( 16, 32, 32, 64 )
dfloatM_castN( 16, 32, 64, 128 )

dfloatM_castN( 32, 64, 8, 16 )
dfloatM_castN( 32, 64, 16, 32 )
dfloatM_castN( 32, 64, 64, 128 )

dfloatM_castN( 64, 128, 8, 16 )
dfloatM_castN( 64, 128, 16, 32 )
dfloatM_castN( 64, 128, 32, 64 )

Aaaand that’s pretty much the whole library. If you want to see the full code, along with the license and installation instructions, I have posted this project to my GitHub, and you can access its repository here. This is the first complete project I’ve posted there, so needless to say I’m very excited about this. See you next time and happy coding!

5 thoughts on “libdfloat: A C Library for Exact Representation of Decimal Floating Point Numbers”

Mina (@ConchitaMendez2) says:

May 3, 2021 at 6:59 am

Very clever implementation, indeed!

This nicely solves issues for importing from, say, CSV files.
I still wonder: Maybe it tries to solve a non-existent problem?

There is still the problem with number representations in general: 1/3 or exp(2) can’t be expressed exactly, neither as a binary number nor as a decimal. Therefore, as long as we’re doing numerical work, we will always have to deal with rounding errors. This library deals with a small subset of the issues that derive from this problem: Import data from sources where numbers are represented as decimals. In most cases this will be financial and/or statistical data.

With this library, you can import your figures as exact numbers. Still, almost any mathematical operation (besides addition and multiplication) will convert it into a number with no exact representation as a p-adic number.

However, when we are dealing with financial or statistical data with 2,3 or 6 decimals and represent our data as, say, 64 or 128 bit floats, rounding errors are simply not an issue when we convert them back into the original format.

Just my 2¢. Let me know what you think.

LikeLiked by 1 person

psychocod3r says:

May 3, 2021 at 1:29 pm

Yeah, I’m not sure how much of an issue it will be practically. I’m just reasoning that since floating point numbers are truncated when they are converted to a base that doesn’t have an exact representation, reading and writing decimal numbers in CSV files is likely to result in weird repeating decimals where you would have originally had something like 1.7. I guess this is more of an annoyance than anything else, and this library was largely created out of OCD/autism. Still it’s annoying enough to me that I felt the need to create a library to combat the problem.

LikeLike

Mina (@ConchitaMendez2) says:

May 3, 2021 at 2:04 pm

My personal solution to this issue is: Whenever I write numbers which are supposed to be looked at by humans, I always round them to the desired precision, so 1.7000000000012 becomes 1.70.
As long as only computers deal with them, I let them in their native format.

LikeLiked by 1 person

1. psychocod3r says:
  
  May 3, 2021 at 2:30 pm
  
  I guess that works too. But I wanted to make sure that if you read a number from a file and don’t modify it, you will write it back to the file exactly as it was. Unless you keep track of how many digits it had past the decimal point, you don’t always know what precision to use when rounding. Again, I’m being anal-retentive and autistic about this for the most part.
  
  LikeLike
  
Mina (@ConchitaMendez2) says:

May 3, 2021 at 2:24 pm

Dealing with numbers is always tricky. I remember, when I was teaching maths to engineering students, and they turned in their homework, they would write at the end of a calculation:

The result is log(3) = 1.09861228867

It drove me near crazy…

LikeLiked by 1 person

libdfloat: A C Library for Exact Representation of Decimal Floating Point Numbers

Published by psychocod3r

5 thoughts on “libdfloat: A C Library for Exact Representation of Decimal Floating Point Numbers”

Leave a comment Cancel reply

SHARE THE LOVE:

Related

Published by psychocod3r

5 thoughts on “libdfloat: A C Library for Exact Representation of Decimal Floating Point Numbers”

Leave a comment Cancel reply