General Tips for Reverse Engineering Files
The Backstory (Tl;dr: Statewide stay-at-home order causes intense boredom)
I am a big fan of video games. Old (and typically poorly received) games, especially. In fact, there was a point in my life where I wanted to be a character artist and animator. I was even in art school for a time! Obviously, that didn't happen. I dropped out of art school, and I stopped making art all together until last year when I picked up acrylic painting.
Backstory aside, I had some free time lately (much like many of people do as of late), and as my break neared its end, I picked up Primal Prey, a video game I absolutely loved as a kid, and decided that I would try my hand at reverse engineering the file format that the game uses to store meshes and animations.
I'm still not quite done -- there are still some things that I haven't completely figured out about how the data is stored. But I did learn enough about them to be able to render the models and their animations, and I call that an overall success.
But this post isn't going to be about how I did that, or what the format looks like (Though I will get into that in a future post, if there is any interest). Instead I want to provide some general tips that might be beneficial.
Getting Started
Getting the ball rolling on something like this is probably the most daunting part of the entire project -- at least it is for me. You open up a file in a hex editor, and you are instantly in a world in which you may not have any idea as to how it works. That feeling set in when I first looked at these files about two years ago, and it set in again when I looked at them last week.
It helped that I knew what the file was used for, because it gave me an idea of what sort of data it needed to contain, even if I don't know how it was stored yet. The first thing that you should try to determine is where the file header is, how big it is, and what is stored within it.
The File Header
Almost every binary file you will ever encounter has a header: A section dedicated to identifying the file as a particular format, and to describing the data that the file contains. In most cases, that header will be the first piece of data that is present in the file.
Identifying what is and isn't part of the file header can be challenging, and it is helpful here to have multiple files to examine. In general, you are looking for something that is constant in size, and constant in structure, but the data contained within it can change. Your initial determination of the header size will probably change as you learn more about the files you are working with -- and that's okay, as even when our assumptions are rendered invalid, it is still part of the process of discovery.
The File Data
Again, determining the meaning of data will be significantly easier if you have multiple files to reference. The name of the game here is pattern matching -- look for data that is structurally similar, even if the data itself varies wildly. Some patterns are easier to spot than others: Many binary files will separate sections of data with empty space; Some binary files may have data which increases sequentially to either identify sections, or the data within them; Many binary files will have data which follows repeating patterns within a section. If you can identify these patterns, you can crossreference them with the information stored in the file header in order to start unraveling some mysteries about the format.
It is also beneficial to keep the context of the file in mind: What is the file used for? What kind of data needs to be stored in order to perform that function? When was this file format used?
Yes, even the period during which the format was in use can be an important piece of contextual information here. In my particular case, the files I was working with were from a game released in 2001 -- shortly after the advent of the first GPU's, and when storage restrictions were not quite as tight as they were in the 1990s. It was unlikely that I would encounter a 64-bit (or even a 32-bit) integer, unless it was really necessary for something like the storage of color information. As a result, I knew that I was mostly looking for bytes, 16-bit words, and single precision floats.
Compression
File compression can be important. A lot of formats will compress their data using one technique or another. If you're lucky, it will use a known compression algorithm that you can easily reverse. If you're not, they may have rolled their own compression algorithm, in which case it will require a greater amount of work.
If you happen upon missing data that you may expect to be present, it is entirely possible that it is being calculated from data which is present. As an example, suppose that a file stores bones and meshes for a game -- if a vertex can only be weighted against a maximum of 4 bones for animation purposes, we need only store the weights associating a vertex with 3 bones. The weight of the 4th bone can be calculated by subtracting the sum of the 3 stored weights, from the maximum weight value (typically 1.0).
The files I most recently reverse engineered did not do a whole lot of compression -- but when they did compress things, it was pretty weird (and hardly beneficial). For example, the number of meshes stored in a file was encoded as the index of the first entry in a table of triangle indices. All subsequent entries had an index no greater than the value in the initial entry. I verified this as being a mesh index by writing a simple renderer, and observing that the meshes would not display without visual artifacts unless I grouped them according to the index value.
Summary
- Be prepared to make educated guesses or assumptions
- Be prepared to have those guesses and assumptions invalidated as you learn more
- Try to identify the size and structure of the file header
- Look for patterns in the data by referencing as many files as you can
- Empty space, as well as repetition, are quite common
- Crossreference patterns across different files and against the file header
- Be prepared for some expected data to be missing sometimes
- If it can be calculated instead of stored, it's probably being calculated
Conclusion
While this is not the most in-depth post, I do hope that it is beneficial for people stepping into the path of reverse engineering something. Reverse engineering is not my usual forte, and I mostly just apply it to data from old games that I used to enjoy growing up. In any case, I do find it enjoyable when I try my hand at it.