Safe Documents

Summary

Today, code for input data validation is typically written manually in an ad-hoc manner.

For commonly-used electronic data formats, input validation is, at a minimum, a problem of scale whereby specifications of these formats comprise hundreds to thousands of pages.

Input validation thus translates to thousands or more conditions to be checked against the input data before the data can be safely processed. Manually writing the code to parse and validate input, and then manually auditing whether that code implements all the necessary checks completely and correctly, does not scale.

Moreover, manual parser coding and auditing typically fails even for electronic data formats specifically designed to be easier to perform such tasks, e.g., JSON and XML. A variety of critical vulnerabilities have been found in major parser implementations for these formats.

Widely deployed mitigations against crafted input attacks include (a) trying to prevent the flow of untrusted data to vulnerable software; and (b) testing software with randomized inputs to find and patch flaws that could be triggered by maliciously created inputs. Unfortunately, neither of these approaches offer security assurance guarantees.

Mitigations for preventing the flow of untrusted data to vulnerable software, which can be implemented via network or host-based measures such as firewalls, application proxies, antivirus scanners, etc., neither remove the underlying vulnerability from the target, nor encode complete knowledge of document or message format internals.

Attacker bypasses of such mitigations exploit incompleteness of the mitigations' understanding of the data format to exploit the still-vulnerable targets.

The Safe Documents (SafeDocs) program will develop novel verified programming methodologies for building high assurance parsers for extant electronic data formats, and novel methodologies for comprehending, simplifying, and reducing these formats to their safe, unambiguous, verification-friendly subsets (“safe sub-setting”).

SafeDocs will address the ambiguity and complexity obstacles that hinder the application of verified programming posed by extant electronic data formats. SafeDocs’ multi-pronged approach will combine:

extracting the extant formats’ de facto syntax (including any non-compliant syntax deliberately accepted and substantially used in the wild);
identifying a syntactically simpler subset of this syntax that yields itself to use in verified programming while preserving the format's essential functionality; and
creating software construction kits for building secure, verified parsers for this syntactically simpler subset, and high-assurance translators for converting extant instances of the format to this subset.

The parser construction kits developed by SafeDocs will be usable by industry programmers who understand the syntax of electronic data formats but lack the theoretical background in verified programming. These tools will enable developers to construct verifiable parsers for new electronic data formats as well as extant ones. The tools will guide the syntactic design of new formats, by making verification-friendly format syntax easy to express, and vice versa.

Tools

The following tools can help software developers and cybersecurity/privacy researchers improve their organization’s security posture in handling electronic documents. These range in functionality and specificity for a variety of uses.

PDF

Arlington PDF Model
Document object model for PDF to enhance security/drive new modalities in specifications
Digital Corpora Project Corpus
8 million real-world PDFs
SPARCLUR
Collection of various wrappers for extant PDF parsers/renderer & output comparison/analysis tools

Programmer resources for describing data formats and auto-generating parsing code

DaeDaLus
Data description language for defining data formats and generating memory-safe parsers in a variety of languages
Hammer
Declarative secure parser/scanner construction kit in C
Parsley
Data description language to capture data formats

Document collections and format rules

File Observatory
System to enable visualization, search, discovery of complex file format patterns/data
Format Analysis Workbench
Runs/analyzes output from any number of parsers dealing with a single file or streaming format
Dowker tools
Tools for statistical inference of file format behaviors
PolyFile
Identifies/maps the semantic structure of files

Behavior of existing parser code

PolyTracker
Data/control-flow analysis for programs
GraphTage
Command-line utiility/underlying library for semantically comparing/merging tree-like structures
Fickling
Decompiler, static analyzer, bytecode rewriter for Phython pickle object serializations

Office

Information Innovation Office

Program lead

Dan Wallach

Program Manager

Read Bio

Resources

Best Practices for Secure Data Intake, Data Modeling, and Data Design

Broad Agency Announcement