File format reverse engineering tips

29 June 2007, Anthony Dunk.

Introduction

This page is aimed at programmers, and is concerned with the process of understanding proprietary binary file formats with a view to writing code to read and/or write those formats in your own programs.

Why would you do this ? Often there is no free library available to connect your software up to a proprietary format, and if the owner of the format provides a library it is often prohibtively expensive - especially if your application wants to support a whole raft of different file formats, not just this one. So, often you have no choice but to create the library to read/write the file format yourself - after the laborious process of reverse engineering it that is!

Another good reason for reverse engineering file formats, which I think will become increasingly important over time, is the need to read data from "legacy" file formats for which the original application is no longer available. Sometimes useful data is stored in archaic formats and needs to be recovered so that it can be stored in a more open format for use in the future.

Getting started

The first things you need are obviously some examples of the files you want to reverse engineer. The more examples the better. Ideally you should have access to the actual software that creates and reads these files so you can fully experiment with all possible parameters, but if not, you can get started with just a few example files.

The next thing you need is some software which can display a binary file in various ways, including raw hex data, and character representations. A good program for this sort of work is my own HexToolkit program.

Now that you have some example files and software to view the binary data, open up some files and scroll through the data to get an idea of the layout of the file.

Example file header

Components of a binary file format

Almost all binary file formats have a header of some sort at the start. This header often begins with some identifying bytes that confirm to the software reading this file that it is of the correct format and/or version.

Also included in the header is usually information relating to the size of the other parts of the file, lengths of various tables stored in the file, and the like.

Finding string information in the header is usually pretty straightforward if you are using a tool like HexToolkit which displays both the hex data and the character representations at the same time. The strings stand out easily from the other seemingly random characters.

Finding various numbers contained in the header is trickier because they can be in a number of different formats and have different byte alignments and byte orders. The most common formats are integer (2-byte or 4-byte), and floating point (4-byte or 8-byte). To find these numbers using HexToolkit you can swap the interpretation information in the right pane between the supported types such as character (the default), short, integer, float, and double. Sometimes it also helps if you increment the file offset by 1 several times when viewing each interpretation to view values that are not aligned correctly.

For doubles you need to increment the offset 7 times to check for all possible alignments! But as you become more familiar with what floating point numbers look like these things become easier to spot in the raw hex data. HexToolkit lets you copy hex values from the data view (using Ctrl-C) and paste them into the "Bytes" field (using Ctrl-V) so that you can convert the bytes to any suported format.

If all this is not enough, some files use big-endian byte order and others use small-endian. Which type they use usually depends on the machine architecture which the file was original designed to run on. Unix and Windows/DOS use opposite byte orders. In the Windows/DOS world the least significant byte comes first (e.g. The 2-byte integer value 0x4105 is actually stored as 0x05 0x41 in the file). In the Unix/Motorola world, the most significant byte comes first (so 0x4105 is stored as 0x41 0x05). HexToolkit has a button which allows the byte order for the interpretation to be easily swapped.

What next

Once you've extracted some of the numbers from the header, you need to try and figure out how those numbers correspond to the data in the file. Sometimes this is obvious when you view the file in the original application because the numbers represent concepts which are in the user data (e.g. the number of channels or fields). But a lot of the numbers represent the internal structure of the file, so you next need to look at the layout of the file as a whole.

What you really have to do is browse through all the hex data in the file looking for pattens which stand out. If there is a block of data containing what seems to be a series of regular 4-byte hex values, then maybe these represent 4-byte integers, or maybe 4-byte floating point numbers, or data in some other format.

The process of working out where all the different sections of the file start and stop can be time consuming - especially if the structure of the file is complex.

Once you feel you have some idea for the different parts of the file and what they might represent, you need to go back to the header and see if any of the numbers there relate to the sizes of the various different parts of the file. At this stage it becomes helpful to open up several of your example files and see if you can see commonalities and differences between them - particularly in the header section at the start of the file. Looking at multiple example files will also help you determine how long the header section is - which is not always obvious.

Other complications

So far, we've only considered data stored in the relatively common formats - character, integer, and floating point - but this is certainly not the only sort of data binary files can contain. There are more obscure floating point and fixed point formats (e.g. for VAX machines), date formats, and things like bit flags. Some of these can be very frustrating to try and understand. Often an internet search may be needed to find the structure of these various data types, and even then you can't find some of them and need to try and work them out yourself by trial and error.

Even more difficult than obscure data formats is data compression. Often to save space in binary files the integer, floating point or string data will be compressed using a library such as ZLIB or something more obscure. Finding out which compression algorithm can be challenging, so again an internet search may provide a useful clue.

Documenting the results of your research on the format

I usually find it useful to create a word processor document on a new file format and update it as I find out more information. It helps to create tables showing the layout and byte positions for various useful values. Such tables have three columns - file offset, length in bytes, and a description. For example:

OffsetSizeDescription
0x00008File ID string
0x00084Number of fields

It can take a long time to fully reverse engineer a file format (weeks, months, or years!) so its handy to keep documents like these so that when you find out something new about the format you can come back to the document, refine it and update it.

Putting your hard work to good use

Once you've gained confidence that your understanding of a file format is reasonably good you can start to write some code to read and/or write files in this format.

A good way to start (if you're a C/C++ programmer at least) is to create a header file which contains struct's for each of the table structures you've indentified in the file. The first struct you will usually create describes the byte layout for the file header.

e.g.

#pragma pack(1)
typedef struct
{
    char sIDString[8];
    int  nNofFields;
} MyHeaderStruct;
#pragma pack()

Once you've got the struct's declared you can create a test program that uses these to read or write the format. I find the easiest way to do this is using the low-level fopen(), fread(), fwrite() and fseek() functions in C. However keep in mind that if you want to support very large files (>2Gb) you may need to use functions which can handle 64-bit seeks instead.

A couple of useful C++ functions are ByteSwap() and HexDump().

Now you've got some code that reads or writes the file format you're interested in, the next thing to do is to throw as many examples at it as possible to make sure your implementation handles all cases. Each time your code fails to work correctly you can further refine your understanding of the format.

That's about all I will say for now... But just be warned that reverse engineering is like solving a difficult puzzle and can result in late nights, and lack of sleep. Good luck!

Back to the main HexToolkit page