The Dink Network

Reply to All about .d script files and why you should never use them

If you don't have an account, just leave the password field blank.
Username:
Password:
Subject:
Antispam: Enter Dink Smallwood's last name (surname) below.
Formatting: :) :( ;( :P ;) :D >( : :s :O evil cat blood
Bold font Italic font hyperlink Code tags
Message:
 
 
December 14th 2014, 05:08 AM
peasantmb.gif
yeoldetoast
Peasant They/Them Australia
LOOK UPON MY DEFORMED FACE! 
This is the second part of my little series about Dink data files that were never included in WC's "ultimate" file format guide. For this segment we'll be covering .d files or "compiled scripts" as they're commonly (and wrongly) called.

These files use a compression algorithm called Byte Pair Encoding (BPE) by Phil Gage that was released in '94. The Wikipedia page explains it quite well and even provides an example of use. Frequently occurring pairs of characters are grouped together into single substituted bytes to reduce file size.

For this to make sense, the most useful way to think of scripts is as a series of numbers in a file. Every character in a script is simply a numerical value that maps to a letter when opened in a text editor. These numbers may have a value from 0 to 127 with "A" corresponding to 65, "B" to 66 and so on. In the case of BPE, numbers above 127 are used to substitute the pairs, so if 65 and 66 occur together often they may be replaced with 128. The two replaced numbers are then moved to the beginning of the file so that they may be decoded later on. This process looks for all possible pairs to substitute and then will perform more passes until there are the maximum amount of pairs (127) or if there aren't any more common pairs in the text.

After this, you're left with a file containing the compressed text, the table of pairs, and a counter indicating how many pairs have been replaced. Here is a crappy diagram illustrating what .d files actually contain:
<1 byte indicator + 128>
<pairs * indicator - 128 of 2 bytes each>
<compressed text>

In order to decompress, you must get the first byte's value, subtract 128, and then go through and get as many pairs as indicated by the value of that byte. The first pair will map onto the number 128, with the next onto 129 etc. From there you must go through the compressed text and find values greater than 127 and replace them with the corresponding pair. This process may need to be repeated several times before there are no values left to change.

This is all well and good, but there's a rather large problem with this implementation that you may already be aware of. If you check in an ASCII code list (look in the DEC column) you'll notice there are tons of values above 127 that map to special characters and foreign letters. If you decide to use any such character in a script the compressor will fail. You can open up a .d script in a hex editor or notepad and see characters like € and ƒ in use as substitutes. This is why you should never make .d scripts.