PEP 330 -- Python Bytecode Verification

PEP:	330
Title:	Python Bytecode Verification
Version:	$Revision: 2090 $
Last-Modified:	$Date: 2005-06-28 01:14:46 -0700 (Tue, 28 Jun 2005) $
Author:	Michel Pelletier <michel at users.sourceforge.net>
Status:	Rejected
Type:	Standards Track
Content-Type:	text/plain
Created:	17-Jun-2004
Python-Version:	2.6?
Post-History:

Abstract

    If Python Virtual Machine (PVM) bytecode is not "well-formed" it
    is possible to crash or exploit the PVM by causing various errors
    such as under/overflowing the value stack or reading/writing into
    arbitrary areas of the PVM program space.  Most of these kinds of
    errors can be eliminated by verifying that PVM bytecode does not
    violate a set of simple constraints before execution.

    This PEP proposes a set of constraints on the format and structure
    of Python Virtual Machine (PVM) bytecode and provides an
    implementation in Python of this verification process.

Pronouncement

    Guido believes that a verification tool has some value.  If
    someone wants to add it to Tools/scripts, no PEP is required.

    Such a tool may have value for validating the output from
    "bytecodehacks" or from direct edits of PYC files.  As security
    measure, its value is somewhat limited because perfectly valid
    bytecode can still do horrible things.  That situation could
    change if the concept of restricted execution were to be
    successfully resurrected.

Motivation

    The Python Virtual Machine executes Python programs that have been
    compiled from the Python language into a bytecode representation.
    The PVM assumes that any bytecode being executed is "well-formed"
    with regard to a number implicit constraints.  Some of these
    constraints are checked at run-time, but most of them are not due
    to the overhead they would create.

    When running in debug mode the PVM does do several run-time checks
    to ensure that any particular bytecode cannot violate these
    constraints that, to a degree, prevent bytecode from crashing or
    exploiting the interpreter.  These checks add a measurable
    overhead to the interpreter, and are typically turned off in
    common use.

    Bytecode that is not well-formed and executed by a PVM not running
    in debug mode may create a variety of fatal and non-fatal errors.
    Typically, ill-formed code will cause the PVM to seg-fault and
    cause the OS to immediately and abruptly terminate the
    interpreter.

    Conceivably, ill-formed bytecode could exploit the interpreter and
    allow Python bytecode to execute arbitrary C-level machine
    instructions or to modify private, internal data structures in the
    interpreter.  If used cleverly this could subvert any form of
    security policy an application may want to apply to it's objects.

    Practically, it would be difficult for a malicious user to
    "inject" invalid bytecode into a PVM for the purposes of
    exploitation, but not impossible.  Buffer overflow and memory
    overwrite attacks are commonly understood, particularly when the
    exploit payload is transmitted unencrypted over a network or when
    a file or network security permission weakness is used as a
    foothold for further attacks.

    Ideally, no bytecode should ever be allowed to read or write
    underlying C-level data structures to subvert the operation of the
    PVM, whether the bytecode was maliciously crafted or not.  A
    simple pre-execution verification step could ensure that bytecode
    cannot over/underflow the value stack or access other sensitive
    areas of PVM program space at run-time.

    This PEP proposes several validation steps that should be taken on
    Python bytecode before it is executed by the PVM so that it
    compiles with static and structure constraints on its instructions
    and their operands.  These steps are simple and catch a large
    class of invalid bytecode that can cause crashes.  There is also
    some possibility that some run-time checks can be eliminated up
    front by a verification pass.

    There is, of course, no way to verify that bytecode is "completely
    safe", for every definition of complete and safe.  Even with
    bytecode verification, Python programs can and most likely in the
    future will seg-fault for a variety of reasons and continue to
    cause many different classes of run-time errors, fatal or not.
    The verification step proposed here simply plugs an easy hole that
    can cause a large class of fatal and subtle errors at the bytecode
    level.

    Currently, the Java Virtual Machine (JVM) verifies Java bytecode
    in a way very similar to what is proposed here.  The JVM
    Specification version 2 [1], Sections 4.8 and 4.9 were therefore
    used as a basis for some of the constraints explained below.  Any
    Python bytecode verification implementation at a minimum must
    enforce these constraints, but may not be limited to them.

Static Constraints on Bytecode Instructions

    1. The bytecode string must not be empty. (len(co_code) > 0).

    2. The bytecode string cannot exceed a maximum size
       (len(co_code) < sizeof(unsigned char) - 1).

    3. The first instruction in the bytecode string begins at index 0.

    4. Only valid byte-codes with the correct number of operands can
       be in the bytecode string.

Static Constraints on Bytecode Instruction Operands

    1. The target of a jump instruction must be within the code
       boundaries and must fall on an instruction, never between an
       instruction and its operands.

    2. The operand of a LOAD_* instruction must be an valid index into
       its corresponding data structure.

    3. The operand of a STORE_* instruction must be an valid index
       into its corresponding data structure.

Structural Constraints between Bytecode Instructions

    1. Each instruction must only be executed with the appropriate
       number of arguments in the value stack, regardless of the
       execution path that leads to its invocation.

    2. If an instruction can be executed along several different
       execution paths, the value stack must have the same depth prior
       to the execution of the instruction, regardless of the path
       taken.

    3. At no point during execution can the value stack grow to a
       depth greater than that implied by co_stacksize.

    4. Execution never falls off the bottom of co_code.

Implementation

    This PEP is the working document for an Python bytecode
    verification implementation written in Python.  This
    implementation is not used implicitly by the PVM before executing
    any bytecode, but is to be used explicitly by users concerned
    about possibly invalid bytecode with the following snippet:

        import verify
        verify.verify(object)

    The `verify` module provides a `verify` function which accepts the
    same kind of arguments as `dis.dis`: classes, methods, functions,
    or code objects.  It verifies that the object's bytecode is
    well-formed according to the specifications of this PEP.

    If the code is well-formed the call to `verify` returns silently
    without error.  If an error is encountered, it throws a
    'VerificationError' whose argument indicates the cause of the
    failure.  It is up to the programmer whether or not to handle the
    error in some way or execute the invalid code regardless.

    Phillip Eby has proposed a pseudo-code algorithm for bytecode
    stack depth verification used by the reference implementation.

Verification Issues

    This PEP describes only a small number of verifications.  While
    discussion and analysis will lead to many more, it is highly
    possible that future verification may need to be done or custom,
    project-specific verifications.  For this reason, it might be
    desirable to add a verification registration interface to the test
    implementation to register future verifiers.  The need for this is
    minimal since custom verifiers can subclass and extend the current
    implementation for added behavior.

Required Changes

    Armin Rigo noted that several byte-codes will need modification in
    order for their stack effect to be statically analyzed.  These are
    END_FINALLY, POP_BLOCK, and MAKE_CLOSURE.  Armin and Guido have
    already agreed on how to correct the instructions.  Currently the
    Python implementation punts on these instructions.

    This PEP does not propose to add the verification step to the
    interpreter, but only to provide the Python implementation in the
    standard library for optional use.  Whether or not this
    verification procedure is translated into C, included with the PVM
    or enforced in any way is left for future discussion.

References

    [1] The Java Virtual Machine Specification 2nd Edition
        http://java.sun.com/docs/books/vmspec/2nd-edition/html/ClassFile.doc.html

Copyright

    This document has been placed in the public domain.