TextSleuth
TextSleuth is a brute-force search utility to identify non-standard text encoding formats. It supports multi-threading for enhanced performance, as well as a host of flexible options.Its primary target users are hackers and reverse-engineers developing video game translation patches, especially those considered "retro" where custom text encoding formats were often used (rather than standards like ASCII or Shift-JIS).
Current Version
TextSleuth is currently at version 1.0.Changelog
- Version 1.0 (2025-04-19)
- Initial release.
Benchmarks
With support for multi-threading, TextSleuth can be scaled as desired. By default, it will consume one fewer thread than the total logical processor count of the host computer on which it's executed.On an AMD Ryzen 5 4600H running at 3.0 GHz with six cores and 12 logical processors (threads), where TextSleuth is consuming 11 threads, approximately 20 MB of data can be searched per minute.
Usage
TextSleuth is a command-line utility to be invoked as follows.Long option format:
text_sleuth --parameter <value>
Short option format:
text_sleuth --p <value>
Below are a list of all available options, both required and optional.
Code:
Required:
-l, --length NUM - Encoded character byte length (e.g., 1, 2)
-p, --pattern FILE - Path of pattern file
-s, --source DIR or FILE - Path of folder to recursively scan (or single file)
Optional:
-w, --wildcard NUM - Number of wildcard bytes in between encoded characters (e.g., 1, 2)
-i, --ignore STR - Comma-separated list of file extensions to ignore (e.g., sfd,adx,pvr)
-c, --thread-count NUM - Number of threads to use (default is CPU core count minus one)
Example Scenario
Consider the following example for the FM Towns game "Phobos".After some analysis of the game data, it was discovered that in-game dialogue text was not compressed, but definitely stored using a non-standard text encoding format (e.g., not Shift-JIS).
To uncover the custom character encoding format leveraged by the game, the user finds a chunk of text containing a sufficient number of repeating characters. Since TextSleuth will perform a brute-force search, the user wants to eliminate as many false-positives as possible by identifying sequences of characters that are likely to be unique.
Below is one such example, where
たアンドロイド『アーマロイド』を
contains 16 characters, five of which are not unique (i.e., they are repeated).
After identifying such a text chunk, the user must transcribe a pattern using any ASCII characters of their choice. For example, one can assign a given Japanese character to the letter
A
, or to the number 1
.Below is an example of translating the string of text into a valid pattern.

Once the pattern has been identified, it's then to be written to a text file.

With this pattern saved as
phobos.txt
, and the extracted game data stored in a folder named inp
, it's time to construct the first search command.For the initial attempt, the user assumes a two-byte format with no wildcards in between.
text_sleuth.exe --length 2 --pattern phobos.txt --source inp\

As seen above, a match was found on the first attempt, and in a total of five seconds! TextSleuth is reporting that an array of bytes matching the defined search criteria pattern was found at offset
0x892
inside the file SNRP
.Consider the matched byte array. It appears to be potentially valid, as a discernible format begins to take shape for a proposed custom text encoding format.
14ed 1c0a 1c9c 1c43 1c6c 1c1a 1c43 0cbb 1c0a 0cda 1ceb 1c6c 1c1a 1c43 0cc3 1487
As an initial test, the user will repeat the first two-byte sequence (
14 ed
) a total of ten times to see if the change is reflected in the game itself.As seen below, the first character in the text chunk,
た
, is indeed repeated ten times!
It's at this point that the user undergoes the process of mapping out the table of all characters supported by the game, after which text extraction and additional hacking efforts can take place.