(compiled by Ka-Ping Yee ) Compilers and Code Analysis =========================== Jeremy: possible future goal: build Python in Python currently three compilers are distributed: - CPython: - parser via pgen - compiler in C makes Python VM bytecode - JPython: - JavaCC parser (perhaps should/will be ANTLR?) - interpreter compiles to JVM - jpythonc has a separate compiler other compilation projects: - Py2C (Bill Tutt, Greg Stein) - keep most of Python's flexibility; avoid breaking code - handles most of Python (incl. lambdas, classes) - translated pystone is about 30% faster - current Python2C not too good with OO programs because method dispatch isn't optimized yet - optimizes simple case of integer for-loop - future work: - use limited constant propagation technique to reduce some of the overhead of integer arithmetic overflow checking - try to reduce the cost of calling an instance method - try to avoid PyArg_Parse overhead - one Python function -> one C function (extension method style, so uses PyObjects and PyArg_Parse a lot) - try to use optional static type information - would it be easier to do for C++? probably. - Guido: sounds like PyFront. how does quality of C code compare? - global function calls just call C function directly, so faster (but still has to do PyArg_ParseTuple) - turn parse tree into more useful AST - the C code isn't that bad; the most hairy part is dynamically binding default arguments for keyword argument support (need to make callable object containing pointers to default args) - exception handling is also pretty bizarre - when executed on itself (100k) becomes 90k lines of C code - maybe optional type info + type inferencing could turn a Python class into a C struct or a C++ class? - no data flow analysis at all; just a straight translation right now - ESR: the LISP people have been working on this for a long time - ref. Dylan: reduce LISP to something which can be compiled (e.g. "sealing" in Dylan is like declaring classes final) - Guido: this is an important example; glad it has been carried far enough to provide hard data - Paths (Jonathan Riehl) - GRAD -> GRAD/Paths -> Basil -> PyFront - Grammar-based Rapid Application Development - non-intrusive way of binding to other languages - acquire other language grammars - have parser emit intermediate representation that could be translated into Python bindings (like SWIG with no .i) - created coverage/testing tools to instrument Python -> Paths - generate data-flow graphs with data from GRAD; examine code slices to get full test coverage - realized that GRAD wasn't just about applications; more about philosophy of "language-transcendent development": mix and match frontends and backends - Basil: my version of a space potato (vapour with a good name) - PyFront: take data-flow graph from Python, generate C - source -> ast -> sdg: control flow graph of Python bytecodes -> vdg: value dependency graph - went sdg -> C code instead of vdg -> C - see paper in LANL repository - modest improvement: control flow in C, but still calling Python API and manipulating PyObjects - lots of C code, quite redundant - would be better if it was generated from vdg - language-transcendent programming - ye olde programming map - front-end development environment: create language grammar - integrate with Bison/YACC: LALR(1); pgen: LL(1); ANTLR: LL(n); Spark - what's the intermediate format? XML - back-end development environment: metamodel, semantic bindings - walks abstract syntax tree, generates linear stream of data - can then generate: object models (like GRAD), test vectors, optimized code, unoptimized code, fries, etc. "this is all theoretical... unlike the vapourware over there" "keep the impossible from being too daunting" - Martijn: what is this for? i just don't get it - Rattlesnake (Skip Montanaro) - experimental optimizer - tools for manipulating bytecode blocks - modify VM to use 3-address opcodes (register-based) - Viper (Max Skaller) - Python compiler & interpreter in Ocaml - emphasis on optimization - semantics are up for grabs, highly experimental new compiler requirements: - types-sig has a serious implementation challenge - changes to language grammar - needs to implement type checker (some checking in runtime) - optimization efforts - e.g. faster global name lookup - generate native code (target to JVM?) - want warnings framework (optional lint-like checking) do interpreter in Python for experimentation too? - Guido: more effort than building a compiler - ESR: possibly enrich bytecode to make it more amenable to compilation to native code Guido, Greg, Ping, etc.: use abstract syntax trees as intermediate level - actually bytecode does partition easily into basic blocks; but indeed, ASTs better for global analysis Jeremy: stuff like peephole optimization best to do after generating bytecode Guido: stuff like register assignments, CSE, choosing where a local variable should live best done before bytecode ESR: suggest it's good to go source -> ast -> bytecode -> native rather than source -> ast -> native; can optimize in platform-indepedent step ast -> bytecode strawman approach: - keep current parser approach - JPython may move to ANTLR - develop AST to insulate compiler from changes to grammar - Guido: would be nice to have a standard Python AST - common compiler framework for all Python distributions - appropriate intermediate representation? - bytecode like now? John Aycock 1. have tools that may be relevant - package "spark" supports development of languages - can be applied to Python - can handle grammar as big as Python's; a little slow 2. starting a project - type inferencing work showed that it's very hard - seems even more painful to add runtime checks on top - much easier to do type inference dynamically in the interpreter - a bit like Self although Self was more aimed at optimizing dynamically dispatched method calls & attempting to inline ESR: have a hard time seeing how optional type decls can handle hard cases Moshe: want to optimize the common case; may be able to optimize 90% of the code even though 10% runs just like today Guido: tools may help speed freaks who can check out the bytecode and add type information appropriately Guido: like John, would like to aim for a point far off (programmers would love not having to express type declarations) Greg: point in favour of optional type declarations: readability/documentation Ping: doesn't Guido want it for the warnings rather than the speed? Guido: i want it all! ---- Guido: what's the goal? Jeremy: go to a SIG? what's the real usage? Jon: JIT compilation of JVM bytecode can yield 2-3 orders of magnitude faster than CPython; JPython was then about 5-10x slower, so the result is still faster ESR: we're all working way too hard; no strategy yet sounds better than translating to Scheme and letting the Scheme compiler do work Guido: more interested in front-end and program analysis stuff than just speed; recently experimented with parser for Python that has very different properties - the existing parser starts at the top and parses as quick as it can to give you a parse tree to work with - imagine modifying a function in IDLE and incrementally parsing a line, and tell you whether there was an obvious grammar error in the line and highlight the error - need to have a parser that's incremental; also need a parser that is resilient in the face of incidental typos; e.g. if function #2 of 4 has a minor grammar error, should not stop the parser from recovering and parsing the rest of the functions - want to go in phases looking at the code like a human reads it - first look at indentation structure and block out the code - (begin with tokenizing to separate out docstrings) - do parenthesis matching to catch newlines that continue statements - can recover by noticing statement keywords inside parentheses - then look at logical lines - can gradually build parse tree until all the expressions are parsed - look at each line: start with keyword, assignment, or expression - this method of parsing makes it easier to parse incrementally - most of the time you only have to parse one or two lines - would be nice to have a standard form of parse tree (tuple of tuples? instances of node classes? nodes of all the same class with an attribute specifying parse node type?) Jeremy: - possible first goal: determine why version-0 ASTs are not sufficient - possible second goal: write a simple compiler Guido: if your only goal is to generate bytecode, you may throw away too much information ESR: want to seriously look into Scheme translation Jeremy: try mzScheme at Rice -- already has class representation Ken: in older LISPs you inserted directives into the code to help ESR: in the mid-80s you stopped having to do this because data-flow analysis got good enough Jeremy: could we use Jonathan's code to do analysis? Jon: would be a low-hanging fruit to generate something like Scheme since representation is pretty functional Jeremy: we should keep an eye on work in Ocaml and Moby