Whitespace steganography
Source code embedding
Whitespace programs can be embedded in text files, by modifying spaces in the
target to match the Whitespace program. Edwin’s embed.hs
did this by simply substituting spaces, but that may not always be semantically
equivalent. Instead, for source code, a more sophisticated embedder should
tokenize the target file. Strings and code blocks should typically be left
as-is.
Considerations for various programming and markup languages:
- JSON: Strings should be preserved.
- HTML: HTML whitespace handling
is defined by its tokenization state machine
and dynamically affected by CSS.
- Valid locations for space, tab, and LF (incomplete):
- Invalid locations for space, tab, and LF (incomplete):
Binary file embedding
Whitespace programs can be embedded in file formats, that parse using file
trailers or have comment fields at the start.
Formats that would work:
- PDF: Parsing starts with the
cross-reference table at the end of the file. Files start with a comment
containing the version, like
%PDF-1.7
, followed by LF, so a nop instruction
sequence starting with LF would need to be prepended to the program.
- ZIP:
Parsing starts with a central directory, located at the end of the archive,
which specifies the offsets for each files, so arbitrary data can precede the
files. Some tools like gzip don’t process archives that don’t start with a
file entry at offset 0.
- APK is a ZIP archive.
- EPUB is packaged in a ZIP archive.
- JAR, EAR,
and WAR are ZIP archives.
- Office Open XML (.docx,
.docm, .pptx, .pptm, .xlsx, and .xlsm) are ZIP archives of XML files.
- OpenDocument Format (.odt,
.fodt, .ods, .fods, .odp, .fodp, .odg, .fodg, and .odf) are XML files,
sometimes in ZIP archives.
- XPI (Mozilla extension) is a ZIP
archive.
Formats that may possibly work:
- BMP: The offset of the pixel
array is specified in the header, so arbitrary data could potentially be
included between the headers and data. However, the headers would need to be
valid UTF-8.
- FLAC: A program could be stored in the
optional VORBIS_COMMENT
block, but the STREAMINFO block before it would need to be valid UTF-8.
Formats that won’t work:
- 7z: Although data may occur between
headers and it has a comment property, the magic number is not valid UTF-8.
- GIF: The contents of the header are
invalid as UTF-8.
- gzip: Although it has an optional
comment field, the sequence for the magic number and compression method,
1F 8B 08, is invalid as UTF-8.
- JPEG: Although it
has a COM comment marker, the start of image marker, FF D8, is invalid as
UTF-8.
- TAR: It starts with a file
entry and has no mechanism for comments.
- PNG: The magic number is invalid as
UTF-8.