Unix vs. Windows: How to Check and Convert Between the Two File Formats

Different operating systems use different formats for line breaks, and this can cause confusion and difficulty for a lot of newbies. In this article I will explain the difference between the Unix file format used by Linux and MacOS and the DOS format used by Windows, and I will also show you how to check which format a file is in and how to convert between the two formats using a simple C program. All of this is very easy once you understand how text files are structured in these two architectures.

Before I explain the difference, it is necessary to understand ASCII control codes. These are characters 0x00 through 0x20 of the ASCII table, and they originated with early terminals as a way to send commands between a terminal and a mainframe on a time-sharing network. These commands are called control characters, and although most of them are no longer used, there are several that have applications even in today’s computers. The control characters used for ending lines are the carriage return (CR) with ASCII code 0x0D and the line feed (LF) with ASCII code 0x0A. On old teletype terminals, the carriage return would return the teletype’s head (the carriage) to the left side of the paper tape, and the line feed would feed the paper tape through by one line. Together they would start a new line of text on the paper tape.

When teletypes were replaced with video terminals, the distinction between carriage return and line feed became somewhat moot, but some operating systems preserved this relationship. In DOS, and later in Windows, all newlines are done with a carriage return and then a line feed (CR LF). In Unix-based systems like MacOS and Linux, only a line feed is used. Classic Macintosh operating systems used only a carriage return, while the TCP/IP protocol stack uses the network format, which is the same as the DOS format. This is why if you’re using Linux, files have to be converted between the two newline formats when they are sent and received across the network. This difference in formats also creates problems if you’re sending a text file prepared in Linux or MacOS to a Windows computer. Because both the carriage return and the line feed are needed to start a new line, the text will be all mashed together when you try to open it up in Notepad.

So obviously we need a way of knowing which format a text file is in, and we need a way of converting between the two. You can’t always determine the format of a file just by looking at it in a text editor. For this you will need to look at the actual hex codes in the file. So open the file in a hex editor, or if you’re on Linux, just use the hexdump command, and look at the ends of the lines.

Looking at the hex output of this file, we can highlight the end of a line, guided by the character output column on the right, and in this case we see only a 0A for line feed, meaning that this file is in the Unix format. If it were in the DOS format, we would see a 0D 0A in that location. It is fairly easy to write a C program to convert between these two file formats. Here is the program for converting from Unix format to DOS format:

 1 #include <stdio.h>
 3 int mainint argc, char **argv ){
 4         FILE *fp = fopen( argv[1], "r" );
 5         int c;
 6         while( (c = fgetc( fp )) != EOF ){
 7                 if( c == '\n' ) putchar'\r' );
 8                 putchar( c );
 9         }
10         fclose( fp );
11         return 0;
12 }
Here is the program for converting DOS format to Unix format:

 1 #include <stdio.h>
 3 int mainint argc, char **argv ){
 4         FILE *fp = fopen( argv[1], "r" );
 5         int c;
 6         while( (c = fgetc( fp )) != EOF ){
 7                 if( c != '\r' ) putchar( c );
 8         }
 9         fclose( fp );
10         return 0;
11 }
If you want to write scripts to work with files across different operating systems (say you’re running Unix scripts in Windows through Cygwin), then you can account for differences in file formats using a regular expression that accommodates both. The following is an awk script that uses a regular expression to determine when the end of a line has been reached. It works on both file formats:

#!/usr/bin/awk -f

BEGIN { idx = 0 }
$0 !~ /^\r*$/ { array[++idx= $0 }
END { srand()
      r = intrand() * idx )
      print array[r] }

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s