Twelve Thousand Test Cases and Counting:

Twelve Thousand Test Cases and Counting:
a Critique of Lightweight Methodologies in Python Program Development

Phil Pfeiffer

phil@etsu.edu

East TN State Univ.

PyCon 2004 / March 2004

George Washington University

I. Basis for talk: case study in program development

• starting problem: a complex, unstable specification for a DB

– context: resource allocation and tracking system for ORNL CCS (22+ pp. of specs)

• goal: limit risk of instability with compiler

– vision: combine best of two approaches:

• clarity, concision of type definition-based data model

• performance of relational DB

– plan: compile data model to SQL

• Part 1: type def’n grammar à “create table” queries

– grammar: dest-type, enum, union, tuple, map, set, include, reference

– limitation: maps, sets not nestable

• Part 2: object def’n grammar à “insert into table” queries

– object grammar: comparable to values grammar

• Part 3: queries over objects à queries over compiled tables

• progress so far

– coding started April 2003

– part 1 done November 2003

– part 2 to resume in May 2003

II. Issue: strategy for managing development?

• classic methodologies: pessimistic, system-centric

– vision: “Build it so you can trust it. Then don’t trust it”. [Meyer]

– emphases:

• defensive design: “I take care of abnormal cases right away.”

• static analysis (including static type checking)

• assertions (pre- and post-conditions)

• systematic testing (all equivalent inputs/ all program points/ etc.)

[cf. Meyer, Bertrand, “Practice to Perfect”, IEEE Computer, May 1997]

• lightweight methodologies: optimistic, developer-centric

– vision: shoot from hip “with care”; code will work well enough, in time

– emphases

• simple design

– do simplest thing that works

– tolerate simplicity until intolerable, then revise

• “reasonable” care in coding

• “reasonable” set of test cases

– “just write test cases, and you’re done”… add more as needed

[cf. Spring D.C. 2002 Python conference tutorial on agile methods]

III. Initial choice: go lightweight

• strategy

– use Python, SAX

– code bottom up, testing code as written

– use “reasonable” number of reasonably simple tests

– hope to finish in four months

• rationale

– praise for Python, XML, SAX

• Python portrayed as easy to use, easy to learn

• XML portrayed as making language definition simple

• SAX portrayed as making parsing simple

– apparent fit for lightweight methods

• lots of developer experience

– 30+ years in computing: assembler, OS coding, cross-platform development, network programming, class library development

– advanced work in programming languages, including compilers

– experience with functional programming

• no communication overhead (solo development)

IV. One month later: what hit me?

• more code than expected (1,800 SLOC, no end in sight)

• coding slower, code far buggier than expected

• analysis

– Python harder than expected

• interpreted OO different from compiled OO

– dynamic attribute instantiation

– “overloading” as arglist-driven interpretation

– “templating” as metaprogramming

– unfamiliar idioms: self.f(); class.f(self); self.__class__; super(); etc.

• unexpected irregularities in Python: e.g.,

– types without eval’able __repr__ methods (e.g., functions, classes, types)

– uneven support for introspection (e.g., getting fn hash from _getframe()?)

– one-item tuples vs. one-item lists

• interpreted ≠ freeform

– same old structural tedium: copy, deepcopy, eq, ne, __repr__, etc.

– SAX less helpful than expected

• sorting out parsers took time [why all the non-validating parsers?]

• expat documentation uneven, even misleading

• for feature-rich languages, code generation—not parsing—still key challenge

– hit-and-miss testing failing to eliminate errors

V. Rethinking initial choice of methodology

• starting point: Cockburne, Agile Methods

– methodologies should focus on

• developing working software to a “reasonable” standard, and

• enabling future development

– overhead determined by project size, team size, working environment, quality of communication, quality of tool set, and application criticality

• beyond Cockburne: need to carefully consider effect of

– developer expectations for code quality

– absence of static analysis—starting with typing

– possible lack of continuity between current, future development team

VI. Importance of automated analysis: two views

• classic view: analysis :: logic ⇔ ECC :: communication

– types, declarations create redundancy in logic

– redundancy important for automated, static validity checking

• exceptions exist (e.g., ML), but Python isn’t one

– static checking find errors

• simple (but common) errors, like misspellings, misordered parameters, use-before-def errors, incomplete revisions

• deeper errors

“[Compiling] supplies type errors, which in many cases reflect deeper oversights.”

[Meyer]

• “lightweight” view: lack of typing, etc. as freedom

– August 2003 C++ User’s Journal article on Boost.Python

• typing a hindrance to rapid code development

• Boost.Python useful for avoiding overhead of type checking

– initial reception to work described in this talk K

VII. Critique of “static analysis as overhead”

• if your development tools don’t check your code, how do you manage error?

1. ask clients to accept error, permanently

2. hope you can get things right, immediately

3. hope someone gets things right, eventually

4. go heavyweight: improve quality of checking by hand, with

• test cases, and

• dynamically evaluated assertions

Þ consider each in turn.

VIII. #1. Quality too important to slough off

• reasons for not tolerating error: a personal view

– ethical concerns

• welfare of student participants

• welfare of clients
(cf.: ACM/IEEE Code of Ethics; “silver rule”)

– personal concerns

• reputation

• personal standards

– practical concerns

• wasted time (using buggy code to debug bugs)

• lessons of history: American auto industry, mid-1970’s

IX. #2. Errors are too easy to make, and miss

• reasons for distrusting developers: cynical view

– Goodkind, Wizard’s First Rule

– Ellison’s “two most common substances”

• reasons, cont.: humane view

– coding under adversity

• out of sorts? distracted? listening to music? confused? hurried?

– revision dilemma

• revisions, or no revisions: both create problems

– optimism dilemma

• self-confidence required for success as a programmer

• positive attitude runs counter to ability to critique code

“And I am always surprised (even though by now I should know better) when
the violated assertion turns out to be one that I had added for goodness sake, so
convinced was I that it could never fail.” [Meyer]

Þ Been there, done that, too often

X. #3. “We’ll find it eventually” as source of risk

• reasons for distrusting “feature first, quality later”:

– perception: point of view assumes one of three givens:

• “eventual” is “quickly”, because code is “small enough”

• the mañana assumption:

– important errors will emerge, over time

– someone will be there to fix them

– concern: all assumptions problematic

• without static checking, just how small is “small enough”?

– 200 lines? 300 lines?

• mañana risky for

– “walkaway” projects like Model-T (I’m off duty May ’04)

– any project done solo, or in understaffed organization

– any project where critical developers aren’t immortal K

XI. Rethinking lightweight methods, concluded

• reasons for rejecting strategies 1, 2, 3

– “shoot from hip” quality control unsuitable for Model-T

• novice language difficulties and SAX issues, but

• lack of static checking judged primary risk, relative to

• concerns about quality, project size

– needed: new development strategy

• Idea: keep Python, but restore confidence by

• restoring what compiler brings to development

• what remains: embrace care (strategy 4)

– strategy 4.1: introduce all points checking, using

• carefully crafted gray box test suites, with

• simpler test tools than Python library provides

– strategy 4.2: introduce type checking, using

• hand-coded assertions at key points in methods

• supporting library that strengthens built-in support for typing

XII. #4.1 Achieving compiler-strength checks with gray box testing

• ideal: “all methods/all program points/all effects” testing

– nothing less yields compiler-like coverage

• strategy:

– do the hard work of coding test cases L, but as simply as we can K

– key tricks:

• define constructors that optionally init all private attrs

– benefit: simplifies specifying assertions for test results

– designing usable gray box constructors

» move private attrs to right, or treat as named params

» supply intelligent default values

• use declarative testing tool, for concision
Þ developed one, having found none for Python

XIII. ADEPT (A Data-driven, Eval-based Program Tester)

• supports tuple-driven unit testing w. multi-level test stack with support for test logging

– idea: more concise than class-based testing (cf. unittest doc)

• package description

– ADEPT proper: one .py file (adept.py)

– support: summary doc, user manual, validation suite (in ADEPT)

• supported languages

– Python: “from adept import *”

– C/C++: using Boost.Python (up to void *, which Boost doesn’t support)

• supported test types

– Get, Set, Get+Set, Null, Erroneous

– properties tests (Get, Set, Erroneous Get, Erroneous Set)

• test predicates

– provided with package: eqValue, eqRepr, eqContents, containsRE, lacksRE, isInstanceOf, evalsTo

– supports multi-level and/or predicate trees for complex requirements

• extensibility

– designed for user-defined predicate types, tests

• http://csciwww.etsu.edu/phil (check “freeware” link)

XIV. #4.2 Adding compiler-strength type checks

• extend Python typing predicates for subtyping: e.g.,
int is supertype of posInt ⇒ 3 Î posInt
int is supertype of posInt ⇏ -3 Î posInt

⇒ use success of posInt(3), failure of posInt(-3) to infer difference

• derive self-typing subclasses of int, str, tuple, list, dict: e.g.,

isOfType( HomgenList(int, [1, 2, 3] )) == true

isOfType( HomgenList(int, [1, 2, ‘3’] )) == false

isOfType( HomgenList( (int,str), [1, 2, ‘3’] )) == true

• support typing with partially instantiated types: e.g.,

intListType = AsType( (HomgenList, int), “list of integer” )

isOfType( intListType, [1, 2, 3] ) == true

isOfType( intListType, [1, 2, ‘3’] ) == false

• insert type checks into codes in two places:

– on entrance to method calls

– on return from method calls that return complex results

XV. PyRite type library (key features)

• predicates for subtype checking

– isOfType(t, v): is item v an instance of a subtype of t (including t)?

• classes for parameter-based subtyping of Py built-ins

– IntSubrangeValue, StrSubrangeValue:

– HomgenTuple, HetgenTuple:

• constructed by ‘friend’ functions, w. attrs that define relevant constraints

– HomgenList, HetgenList:

• constructors accept constraints for indices, values, index/value pairs

– HomgenDict, HetgenDict:

• constructors accept constraints for keys, values, key/value pairs

• classes for parameter-based subtyping of other classes:

– HomgenSet, HetgenSet:

• constructors accept constraints for keys, values, key/value pairs

• set code leftover from v2.2, which had no built-in set classes

– ManyOneHomgenDict, ManyOneHetgenDict:

• many-one dict: dict that supports aliasing among keys

• constructors accept constraints for keys, values, key/value pairs

XVI. In the final analysis: how the strategy worked

• cost

– compiler on hold, late April–early July (ADEPT / PyRite)

– test cases difficult to generate, even with tools

• simple cases (e.g., __eq__, __ne__) mind-numbingly dull

• complex cases (e.g., symbol table classes) mind-bending

• peaked at 12,000+ before mid-November revision (see below)

• benefit

– full-featured phase 1 compiler done by early Nov.

• allows out of order type definitions

• thorough anomaly checking, including unknown mySQL types;
missing and circular def’ns; dependencies on bad def’ns

• compiles through errors: compiles all types flagged as sound

• supports retargetable back end

– test cases vital for two major post-Nov. overhauls

• goal: use metaprogramming to eliminate repetitious code

• outcome:

– 6,000 SLOC (exclusive of ADEPT, PyRITE) ⇒ 3,000 lines

– 9,000 test cases (exclusive of ADEPT, PyRITE) ⇒ 6,000 test cases

XVII. Python: I’d use it again, gladly …

• negatives of losing static analysis are real…

“Why would anyone want to use an untyped or dynamically typed language? ‘[W]e’ll develop faster that way’ makes no sense to me...” [Meyer]

• … but static analysis isn’t complete, anyway…

– all-points testing important for QC, regardless of language

• typing doesn’t catch everything

– testing type assertions a significant, but small, part of testing

• __eq__ / __ne__ tests, __repr__ tests, etc., don’t go away

• complex tests don’t go away

• …and (Python) interpretation has its plusses.

– fast feedback a major plus

– metaprogramming wonderful for trimming duplicate code

– subject-oriented metaprogramming improves code quality

• upper-layer classes “create” expected features in lower classes at load time

• example: codegen layer adds mySQL codegen methods to AST classes

• benefits: simplifies design, testing of lower-layer logic

– you don’t need to fight with the JVM J

XVIII. … but it would be good to have:

• solid documentation on how to metaprogram in Python (what happens to on-the-fly class and function creation now that new is deprecated?)

• a tool for semi-automated all points test case generation, driven by static analysis of Python code (even if imperfect!)

• a standard, declarative-style test suite driver (improved ADEPT?)

• standard, self-typing versions of standard classes (improved PyRITE?)

• a decent library subtyping predicate
(assumption: derived classes defined from base classes via narrowing)

• object.__ne__(self, *args,**dict) ≜ not object.__eq__(self, *args,**dict)
(might affect tetralemma-based theorem provers☺ -- but asymmetry would simplify careful testing)

• int.__init__(self, *args,**dict) [and similarly for all immutables]
(would simplify dynamic instantiation of subtypes for immutable types, by supporting creation of curried constructors that capture dynamic constraints, via additional constructor params)

XIX. Afterward: thanks to…

• the organizers of PyCon 2004

• you who are here
(for hanging around for the last talk on a Thursday afternoon)

• Alex Martelli
(for Python Cookbook, Python in a Nutshell …
and polite answers to early, stupid questions about Python)

• Smitha Chennu
(for help with testing Model-T under mySQL)

• Dr. Stephen Scott / Dr. Al Geist of ORNL
(for being patient while I worked all this out)

• my wife, Linda
(for being really patient while I worked this out)

X. Selected References

• Cockburne, Alastair, Agile Methods

• Halberstam, David, The Reckoning
(Ford vs. Nissan in the ’70’s: a parable for contemporary American software development)

• Harrison, Wm., and Ossher, Harold, “Subject-Oriented Programming (A Critique of Pure Objects)”, OOPSLA ’93

• Meyer, Bertrand, “Practice to Perfect: The Quality First Model”, IEEE Computer, May 1997

• Yourdon, Edward, Decline and Fall of the American Programmer