Arz format consists of 5 blocks of binary data: Header (HDR), Record Table (RT), Record Data (RD), String Table (ST) and the Footer (FTR).
The binary format is LittleEndian with SinglePrecision Floats.
The HDR consists of 6 32-bit ints.
- A version number (HDR_VER)
- The offset to the Record Table (HDR_RDPOS)
- The size of the Record Table (HDR_RDSIZE)
- The record count (HDR_RDCNT)
- The offset to the ST (HDR_STPOS)
- The size of the ST (HDR_STSIZE)
The RD directly follows the RT and has the same count of records, so offset/size information is not needed for both blocks. I probably should refactor the code to call the HDR values RTPOS and RTSIZE, but whatever. Strings are stored as int32 (size) + char* (string), referred to a CStrings. In most places, strings are stored as the 32-bit integer index of their position within the String Table (STidx: see below).
Beginning at HDR_RDPOS you can read in HDR_RDCNT Record Table entries with the following format of 32-bit ints plus a CString.
- The record name as an STidx
- The record Class as a CString
- The compressed data offset
- The compressed data size
- The uncompressed data size
- int32 - A, which I call fileLastWriteTimeLow
- int32 - B, which I call fileLastWriteTimeHigh
Once you have read all of these Record Table entries, you have the locations to the compressed data for each record. Using the LZ4 function LZ4_decompress_fast() you can unpack read block of compressed data.
Records are basically just a list of fields or name->value pairs. While each fieldName is stored as STidx, the fieldValues can be of various data types and can be singular values or arrays of these data types.
The uncompressed record data contains a series of fields with the format:
- dataType - a 16-bit integer for an enum of Int,String,Bool,Float
- valueCount - 16-bit integer for the number of values
- nameIndex - 32-bit STidx for the fieldName
This triplet is then followed by valueCount 32-bit integers or single-precision floats. Int,String and Bool are all stored as 32-bit integers. If the dataType is String, then the integer is an STidx. If it is boolean, it is merely 0/1, and if it is integer, then it has the actual value. If the dataType is float, then valueCount floats are stored instead of integers.
The String Table starts with a 32-bit int that is the number of strings in the table, and then just CString after CString of all the strings. When you read then all into an array, the index for each string becomes the STidx. This is the way most strings are stored, once in the ST and everywhere else as the STidx to that ST entry.
The Footer is four 32-bit integers from adler32() (zlib) calls on buffers containing various blocks of the data in binary format. Think of it as a bunch of checksums to ensure data integrity.
- adler32 for the HDR+RT+RD+ST
- adler32 for the ST
- adler32 for the RT
- adler32 for the RD
I determined all this from the source code of TQVault, a modding forum when zlib was switched to LZ4 for RD compression, and Rhis. So do not get your hopes up and think I am some kind of master reverse engineer type. I post this all here in English in case I am wrong about something, then maybe bugs in my code can be fixed without anyone needing to read the code.