Sunday, December 21, 2008

I want my Unicode keyboard

"You see the typographical limitations we're still saddled with? The mock-wounded :( and the actually-wounded :( aren't slated to have distinct code points until Unicode 17."
     -- Jon Zeppieri

Sign o' the times

"Who would have thought a discussion about lambda syntax in JavaScript
would go over 120 posts while a simultaneous thread about class syntax
has had little attention outside a handful of posters?

Would this have been reverse 10 years ago?

Sign of the paradigm shift? Maybe folks want an immutable cons cells too?"
     -- Peter Michaux

Friday, December 19, 2008

A small peeve

Whenever I write a function with an external wrapper and an internal recursive function (say, to add an accumulator), I can never come up with a name for the internal function that I find satisfying. This is totally picking nits, but it bugs me for all the recursive calls to be to foo/aux or foo/helper or something like that. Come to think of it, somehow the prime symbol in Haskell and ML feels more natural.

Saturday, December 13, 2008

Hats off to Duff

From an LtU thread:
When I see the phrase "Wiki type systems", I immediately wonder what a typed Wiki could possibly be. (Monadic pages? Contribution combinators? Co-recursive spam?) Of course you mean "Wiki-type systems", systems that are Wiki-like. But I was excited for a minute.
     -- Tom Duff (yes, the Tom Duff)

Thursday, December 11, 2008

Author's summary

I spent a few days visiting my alma mater last week and among the many engaging conversations I had were several with my dear friend Gregg Whitworth about incorporating research into a liberal arts undergraduate education. It's very tricky, but Gregg pointed out several aspects of the field of biology that help. Gregg mentioned that many biology papers include authors' summaries. This just sounds like an all-around great idea, and one that's easily implemented: simply add a brief summary to your publications list on your web site.

In some sense an abstract may be useful, but an author's summary could also be used to put work into broader context. Since it's not subject to the restrictions of the paper itself, it also gives authors more freedom to write whatever they like. And the benefit to students could be enormous: one of the hardest parts of breaking into a research field is understanding the larger conversation each paper participates in.

Thursday, October 30, 2008

Maybe vs. exception vs. failure thunk in Scheme

It's pretty easy to switch back and forth between representing a function with a "maybe" result type as:
  1. returning X or #f;
  2. returning X or raising an exception; or
  3. returning X and taking a failure thunk
In Scheme, sometimes X might include the value #f. In that case I think representation #3 is the best one, since raising an exception is clunky. You can also combine #2 and #3 and take an optional thunk that by default raises an exception.

This is one case where true unions do cause some problems, so maybe I should have tempered my earlier enthusiasm a little. Still, it's not insurmountable, and this cost needs to be weighed against the price of strictly disjoint unions.

Sweet



A Super Mario Bros. clone, implemented in Haskell.

Friday, October 24, 2008

Fun programming exercise

Off and on I work on my implementation of JavaScript in PLT Scheme. Implementing the standard library means translating the spec's low-level, imperative pseudocode (think GW-BASIC) to Scheme. It's a fun and mostly mindless exercise to take code with assignments and goto, translate that into mutually tail-calling basic blocks, eliminate the assignments by parameterization, and then finally polish it into a more idiomatic style. Sort of my version of a Sudoku puzzle or something.

Once you've already implemented it...

A common phenomenon in API design is to expose something simply because you've already put in all the effort required to implement it. For example, if your language runtime implements unbounded recursion, you've already gone a long way towards representing continuations as data. In that case, it's not much more work to offer first-class continuations as a data structure in the language. Similarly, languages often optimize their startup time by dumping the heap into a persistent data structure so they can cache the initialization of the standard library; once they've done this, heap-dumping is an easy feature to offer to users.

On a semantic level, first-class continuations and first-class stores are obviously similar kinds of constructs. But I'd never noticed till now how they can both just sort of crop up as natural consequences of incidental needs in language implementations.

Monday, October 20, 2008

Updated javascript.plt

I've just updated my implementation of JavaScript in PLT Scheme. The latest version fixes some of the subtle ways I'd deviated from the spec with respect to variable binding. Back in the day, I was wooed into thinking that JavaScript was so similar to Scheme that I could pretty easily just compile JavaScript variable bindings directly to similar Scheme variable bindings. For the most part, I got variable hoisting (i.e., the scope of var-declarations being hoisted to the top of their nearest enclosing function body) right. But the real subtlety comes in with the with (dynamically add a computed object value to the runtime environment as a new environment frame) and eval constructs.

I almost had with right: as soon as I detect that the programmer is using with, I enter a "stupid" mode, where I synthesize an explicit runtime environment and populate it with whatever's currently in the static environment. Within the body of the with, variable lookup is performed dynamically in this runtime environment. So outside of a with, you get normal, efficient lexical scope; inside a with, you get stupid, slow, dynamic scope. My only mistake there was forgetting to make the runtime environment entries aliases to the existing variables so that mutations persist after the with-block returns.

But eval is hard: Scheme's eval doesn't have access to the lexical environment where it is called, whereas JavaScript's does. And JavaScript's eval also inherits the ambient "variable object" (ECMA-262 10.1.3), which means that eval code can create variable bindings in the caller's scope. Getting all of this right means simulating the variable object similarly to the way I simulate the runtime environment, and switching into "stupid" mode whenever I detect a potential application of eval. (The standard allows you to refuse to run eval in situations other than a "direct" call to a variable reference spelled e-v-a-l, so you don't have to treat every single function call as a potential call to eval. It's messy, but it's better than forcing you to implement a dynamic environment everywhere.)

Wednesday, October 15, 2008

Unordered multi-arity binders

Cool: GHC knows that "∀" doesn't care what order you bind its type variables:
{-# LANGUAGE RankNTypes, TypeOperators #-}

f1 :: ∀ a b . a → b → b
f1 x y = y

f2 :: ∀ b a . a → b → b
f2 x y = y

h :: (∀ a b . a → b → b) → c → d → d
h f x y = f x y

a = h f1
b = h f2
Think about what this means for implementing a type-checker: to match one ∀-type against another, you have to find the right permutation of bound type variables. (A naive approach requires searching through all n! permutations!) Presumably the actual implementation incrementally unifies the type variables as it recurs into the bodies of the ∀-types.

Does anyone know of a paper that describes this aspect of implementing higher-rank polymorphism?

Tuesday, October 14, 2008

Today's JavaScript kookiness

One of the weirdest aspects of JavaScript's eval is that it not only has complete access to the lexical scope of its caller, but it even gets to add variable declarations to its caller's lexical frame. So I can write
(function() {
eval("var x = 10")
return x
})()
and get 10. Now, bearing in mind that var declares a binding for the entire containing function, eval combines in a truly spectacular way with lexical bindings. What does the following test function produce?
function test() {
var x = 'outer'
return (function() {
{
let x = 'inner'
eval("var x = 'eval'")
}
return x
})()
}
Clearly the let-binding of x seems to be in scope for the eval code. So we might expect the function to return the string 'outer', with the assumption that the eval code simply modified the inner binding. Then again, the eval code is creating a var-binding, so we might expect the function to return the string 'eval', with the assumption that the eval code is declaring a new variable x to be local to the inner function. So what does it really return? undefined.

What?

If I've got this right, what's happening is that the eval code is creating a var-binding for x in its containing function, but its initializer is mutating the current x in scope, which happens to be the let-bound x. So the eval code is actually creating a variable binding two lexical frames out. The evaled var-declaration and its initializer are referring to two completely different bindings of x.

Lest you think this is strictly an unfortunate interaction between the legacy var and eval semantics and the new (post-ES3) let declaration form, there are a couple of other local declaration forms in pure ES3, including catch. Consider:
function test() {
var x = 'outer'
return (function() {
try { throw 'inner' }
catch(x) { eval("var x = 'eval'") }
return x
})()
}

Monday, October 06, 2008

That's not supposed to happen...



When smoke comes out of your laptop, that's...bad, right?

Update: IDE->USB adapters are wonderful things.

Friday, October 03, 2008

Clojure metadata

I watched the video of Rich Hickey's talk about Clojure at the Boston Lisp group this week and was very impressed by his design. One of the features I've been mulling is Clojure's notion of metadata: every value can be associated with extra associative data. That data does not affect the value's behavior under the standard equality predicate, so in some sense it can be seen as "invisible." This is kind of like JavaScript objects' arbitrarily extensible property tables, except for the very important difference that it's immutable. So you can create a value that's more or less equivalent to another (it is distinguishable by identical?, but use of that predicate is discouraged) but with some extra information on the side.

The first use I immediately thought of for this is when you're writing a compiler or interpreter that has AST nodes and you want to save source location information. This information shouldn't affect AST equality. Clojure metadata sounds like it could be a good way to separate that concern.

I think it's confusing, at least for me, that Rich describes metadata as not being "part of the value"--where I come from, a value is the result of evaluating an expression. When you evaluate (with-meta v m) you do get a new value that is associated with metadata m. What Rich is saying is that the metadata is not contained within v. At first when he said it wasn't part of the value, I thought "that sounds brittle." But if I've understood it, a more accurate description might be that values in Clojure may carry with them an extra associative map that doesn't affect their printed representation or their equality.

Thursday, October 02, 2008

Quote of the day

"As goofy as Java is, the JVM is stunning technology."
-- Rich Hickey

Web inching towards actors?

I learned about the new HTML 5 postMessage API recently, and it struck me that we are ever so slowly inching towards actors on the web: now all windows in a browser (including separate tabs and iframes) are communicating sequential processes.

Now, if Alan Kay had had his way, every node in the DOM would be separately addressable (i.e., have its own URI) and would probably itself be an actor.

Monday, September 29, 2008

Representing the cost of control

Every once in a while I play around with this idea for the design of a web language (server-side).

The nice thing about using first-class continuations to implement a web library is that you can write your code in direct style, but a big down-side is that you still have to figure out how to manage the storage of those continuations. But the alternative of essentially representing the control with program data structures, i.e. with explicit representations of the state of the computation, is really painful.

So I've thought of taking a page from Haskell's playbook: what if you separated a web language into two portions? One language would be for writing the "driver" of the program, where you can call send/suspend and computations may be suspended and restarted an arbitrary number of times. The other language would be "pure" (well, let's say full Scheme, but no send/suspend). In other words, replace Haskell's IO monad with the Web monad. As with Haskell, code in the Web monad can call pure code, but not vice versa.

The benefit of this would be that you could still write in direct style, but you've made explicit the portions of the program where control will cost you. Then you could play with different ways of managing that cost--expiration of continuations, back-off for expiration, etc.--by playing with the design of the Web monad, which wouldn't affect the pure language.

JavaScript completion value: pros and cons

In JavaScript the completion value is the result value of a computation, which may result even from evaluating a statement. Simple statements like
{
2 + 2;
}
have result values (in this case 4), but so do complex statements like
for (var i = 0; i < n; i++)
f(i);
Notice that if n is less than [edit: or equal to] 0 in this example, there won't be any computation at all, in which case the statement can't really produce any reasonable value.

I've just figured out how to articulate an intuition I've long felt about the completion value. The great thing is that it mitigates some of the awfulness of statements: even imperative code can still produce values! This means you can have some of the syntactic convenience of statements for writing effectful code, while still producing results. But the thing that always made me uncomfortable was how code like the for-loop above screws with the idea of a tail call. Because the for-loop may or may not produce a value, there's a sort of implicit return happening after the loop.

There might be tricks you could pull using something like continuation marks to allow tail position inside of loops, but I think a simpler answer in a JavaScript with proper tail calls would be to say that loops do not preserve tail position. So essentially tail position would include the last statement in a block, the arms of an if statement, and the arms of a conditional expression, and that's pretty much it. Inside a try-block, a loop, or a switch (because of the reliance on break) there simply wouldn't be any tail positions.

Note that a) JavaScript doesn't have tail calls, and b) the only places where you can even observe the completion value of a statement are at a REPL or with eval. But I think JavaScript could have proper tail calls (potentially), and I've always wondered whether the completion value should remain a hack isolated to eval or actually see its way into the design of other features for JavaScript.

Friday, September 26, 2008

Most productive conference trip ever

I can confidently say I've never accomplished so much actual work at a conference before. I actually solved a research problem during downtime while at ICFP this week. As a fledgling PhD student I used to take advantage of my boundless free time to think about problems for hours on end, often working into the night. As time goes by I have less and less uninterrupted free time, so I guess that's making me more efficient at using my spare cycles in very short bursts. I guess Butler Lampson would classify that under speculative execution.

Monday, September 22, 2008

Debugger as library

The great thing about printf-debugging is that it's got about as low a barrier to entry as possible (unless you're using Haskell--don't get me started). You have implicit access to the program's control flow and I/O and the function is essentially always there. When you want to use a debugger, you usually have to search for a good one, download and install it, learn how it works, figure out how to set breakpoints, learn the language of breakpoint conditions, figure out how to inspect locals and evaluate expressions, etc. etc.

What if debugging was just available as a library? What power would you need? I think PLT Scheme is uniquely suited to making this extremely practical. It's easy to implement a programmatic breakpoint: just call/cc and raise an exception with the captured continuation. You immediately have access to a nice language of boolean predicates for determining breakpoint conditions (Scheme), as well as a language to evaluate expressions with access to the locals (also Scheme). But PLT Scheme also provides continuation marks, which allow you to leave little bread-crumbs through the control flow of a program. What I'd really love to do is use this to allow the programmer to annotate interesting points in the control and connect this up to the DrScheme GUI libraries so that the programmer can do something like:
> (define bp (debug <expression>))
> bp
#<breakpoint>
> (show-control bp)
and have a window pop open that displays a stack trace of the suspended program.

I'm working on a PLaneT library to make this just about as low-tech and seamless as printf-debugging. You'll just need to throw in one extra line somewhere in the module:
(require (planet dherman/debug))
and this will allow you to throw in breakpoints and debug programs much like you could with a full debugging IDE, only with the directness and intra-linguistic style of printf.

Sunday, September 21, 2008

Units of measure in F#

I'm at ICFP and just watched Andrew Kennedy give a delightful invited talk about the support for units of measure in F#. It was a very impressive example of work that's well-designed all the way through, from theory to practice. In essence, it allows you to say things like
let gravity = 9.8<m/s^2>
and the type system tracks not only the type float but also the unit of measure m/s^2 and prevents unit-mismatch errors in the program.

One of the questions I had was based on my brief experience several years ago with statistical natural language processing, where a lot of the computation we did was in logspace. In a situation like that you would like to keep track of both the ordinary units of measure as well as the fact that everything is in logspace.

Dan Grossman took this a step further and pointed out that you want your arithmetic operations to have different types when you're working in logspace. For example, while the + operator usually has a type like:
+: float<'u> -> float<'u> -> float<'u>
in logspace it should instead have the type:
+: float<log 'u> -> float<log 'v> -> float<log ('u · 'v)>
Now, I don't know how much support there is for "unit constructors," but the F# unit system does allow you to use units with custom abstract data types. According to Andrew, F# also supports operator overloading and ties it in with the unit system. So it might in fact be possible to express the logspace version of the + operator!

This integration between overloading and units could be a nice solution to the issue of different representations of numbers. Libraries like Java's BigInteger are a usability disaster, but we only have one syntax for numerals and operators. Tying a type-and-unit system in with overloading could allow you to use various machine representations (fixnums, flonums, IEEE754r decimal, bignums, exact rationals, ...) by annotating literals and variables, and type reconstruction could help with the operators. I'm not a big fan of unification-based inference, but in this case I wonder if simple structurally recursive reconstruction would be enough in practice.

Update: Ken Shan points out that only the implementation would use +, but you really would want to overload the * operator. Conceptually you're doing a multiplication.

Monday, September 15, 2008

True unions, revisited

I've spoken about true unions (or ad-hoc unions, or non-disjoint unions) sort of abstractly before, but Typed Scheme has 'em, and they're great.

In my recent experiences with Haskell and ML, the amount of unnecessary and distracting injection and projection that goes on starts to bloat code to the point of illegibility. Disjoint unions in ML and Haskell provide two orthogonal pieces of functionality in one feature. Typed Scheme separates these two: struct types give you disjointness, and unions give you, well, unions. It's up to you to guarantee that you don't create unions with overlap, but so far I haven't seen that cause any problems.

Then you can build disjoint unions on top of that, as I have in my new types.plt PLaneT package:
(define-datatype Expr
[Var ([name : Symbol])]
[Abs ([var : Symbol] [body : Expr])]
[App ([rator : Expr] [rand : Expr])])
macro-expands to:
(begin
(define-struct: Var ([name : Symbol]))
(define-struct: Abs ([var : Symbol] [body : Expr]))
(define-struct: App ([rator : Expr] [rand : Expr]))
(define-type-alias Expr (U Var Abs App)))

Scheme record syntax

I think if I were designing my own Scheme I'd tighten up some of the aspects of PLT's structs into a smaller syntactic footprint. In PLT, when you say
(define-struct thing (foo bar))
it binds thing to an identifier carrying static information about the struct type (which is needed for other macros such as match), struct:thing to a runtime value containing introspective information about the struct type, and make-thing to a constructor for the struct type.

I think I would just bind the single identifier thing for all three purposes: statically, it would carry the static information for match and friends; dynamically, it would be the constructor. (I never liked the imperative-sounding "make-" prefix.) For dynamic introspection, I would probably include a library form where you could say (struct foo) to reflect the struct type information into a dynamic value.

Wednesday, August 27, 2008

The death of namespaces

As promised in my post on ECMAScript Harmony, I want to talk about the problems with the proposed namespaces feature and why I'm glad it's gone.

In ES3, one of the problems for information hiding is that the language's primary datatype is a mutable table mapping transparent strings to values. As a result, sharing an object creates abstraction hazards: anyone can view -- or even modify! -- an object's internals. Namespaces were an attempt to facilitate hidden properties by generalizing objects in a backwards-compatible way.

So objects became mutable tables mapping namespace/string pairs to values. This has precedent in Common Lisp, and it seems natural. But here's where it started going awry: JavaScript has an ill-conceived history of specifying variable scope by way of a specification construct known as the variable object: a fictional JavaScript object that serves as a rib in the lexical environment. So JavaScript variables are conceptualized as being the same thing as properties. (I'm sure I know where this came from: implementors could use the same internal mechanism to lookup properties in an object inheritance chain as for looking up variables in the environment.) So now with namespaces, variable names were not just strings but also these (namespace × string) pairs.

But namespaces were first-class values. There were even forms for qualifying variable references with a dynamically computed namespace! Lexical scope was slipping away from JavaScript as an almost-was. Even more troubling was the concept of "open namespaces." For convenience and backwards-compatibility, there has to be a reasonable default namespace for looking up unqualified variable and field references. The proposed spec allowed multiple namespaces to be given as candidates for defaults, with a particular search order. Now, variable lookup is tough enough in JavaScript. With bizarre dynamic constructs like with, lexical eval, and the global object, there can be two dimensions of lookup: up the environment and up the inheritance chain of an object. Now with default namespaces there was a third dimension of lookup. For implementors, this creates all sorts of efficiency nightmares. For language users, it would likely be a usability nightmare.

There are ways to simplify namespaces; even Common Lisp's approach is simpler that what was originally on the table for ECMAScript. But when I heard that namespaces, which had been a part of the proposed standard since before I got involved, were on the chopping block, I was thrilled. It had never even occurred to me that they might be cut! I proposed a simpler solution: if we just add gensym to the operations on property names, then we can create private, unguessable properties. No property name pairs, no extra dimension of search, just opaque names. In an OO world, this is more likely to look like new Name() than gensym(), but same difference.

ECMAScript Harmony

There's been a lot of buzz about recent events in Ecma TC39 since Brendan Eich made his announcement of ECMAScript Harmony resulting from the seminal Oslo meeting last month. Some of the public discussion has been so misleading I won't even link to it. I've been getting a number of questions from friends and colleagues, asking if it's true that we've stopped working on ECMAScript Edition 4. I'm giving them all the same answer: don't believe the hype!

As an invited expert, I try to avoid politics. But from my perspective, the Harmony effort is promising to be a great development for the technical quality of the ECMAScript standard. Both sides of the split in the committee have good technical points to make and I've longed for more cooperation. Now we have much more cooperation, and the language design is improving for it.

As for the future of ECMAScript Edition 4: whatever name gets attached to it, the work on improving and standardizing ECMAScript continues apace. We'd already deferred static types for potential future work well before the Oslo meeting, and as part of the Harmony effort we dropped some of the more questionable aspects of the proposed ES4, most notably namespaces. These were not pure political concessions but in fact good technical decisions. (I'll blog about this separately.) The proposed language is changing all the time, but reports of ECMAScript's death have been grossly exaggerated.

Friday, August 08, 2008

Nirvana

I have achieved hygienic macro enlightenment.

Tuesday, June 24, 2008

Implicit homomorphism

One of the really nice features of macros and their mathematical first cousin notational definitions is that they leave their straightforward, recursive expansion implicit. When we write in math:
A ⇔ B = A ⇒ B ∧ B ⇒ A
what we really mean is that all occurrences of ⇔ should be recursively expanded within a term. But there's essentially an implicit structural recursion through all propositions in the logic (which is left open "until" all notational definitions have been provided) which expands any occurrences of notational definitions homomorphically through sub-propositions. We don't require anyone to go through the painstaking and virtually information-free process of explicitly marking all points of recursive expansion.

I would love a similar notational style for defining annotations. I often find myself defining an instrumentation function that inserts one or two kinds of extra pieces of information into a tree. Maybe term-rewriting would work pretty well for this. Say we're writing a function that annotates certain λ-expressions with their height. We could write
λ x . e → λ [ |e| ] x . e
and then leave the compatible, reflexive, and transitive closure of this rewriting rule implicit, since they're obvious.

Then I would really like some way to make this style of definition composable with other definitions, so for example I could define a type-checking-cum-instrumentation algorithm
Γ ⊢ e : τ → e'
where the instrumentation portion (→ e') is mostly left implicit.

Tuesday, June 17, 2008

How to impair a declarative

Declarative programming is a vague term but could loosely be interpreted to mean not imperative programming. It's also about defining a program by breaking it into separate, modular units of partial specification.

Of course, the language's semantics has to have some algorithmic means to combine these declarations into a complete program definition. And if the language admits recursive definitions, the combined program may contains cycles. So the implementation of a declarative language will often have an imperative flavor, combining recursive elements by incrementally updating the set of definitions.

The question is whether it is still possible for the user to understand the semantics without resorting to reasoning about its possibly-imperative implementation. Specifically, the user shouldn't have to worry about time. If the order in which definitions are combined matters, then the semantics becomes imperative. Worse, if the programming interface allows definitions in multiple files, the order of definition might not even be under the user's control--or worse, it might only be controllable through extra-linguistic means like compiler command-line arguments.

Take as an example a language with open recursion, the ability to define a recursive function in multiple, disparate pieces. If an open recursive function has overlapping cases in separate declarations, the language can handle this in one of several ways:
  1. Let rules defined "later" take precedence over rules defined "earlier."
  2. Define a total order on rule specificity, and let more specific rules take precedence over less specific rules.
  3. Disallow overlapping cases statically.
Option #1 reintroduces the notion of time. The resulting semantics is less modular, because it requires understanding not just the individual modules, but subtle aspects of their composition. (If the algorithm for combining definitions is sophisticated--unification, for example--this order may even be a complicated dynamic notion, not just textual order.) The other two options eliminate the imperative flavor, defining time out of the semantics. This frees the programmer to reorder definitions without affecting the program's meaning, and making the program's meaning more modular.

ADTs in JavaScript

Sjoerd Visscher has written a neat post about implementing algebraic data types in JavaScript. There are lots of ways to do this, but this one looks interesting. I don't quite understand it but I thought I'd point it out.

Note the use of expression closures! Good stuff.

Thursday, June 12, 2008

Clumsy existential

This might be a dumb question: I often want to prove something like this:
(∃ v . ev) ⇔ (∃ v ~ v . [[e]] ⇒ v′)
except that the scope doesn't work out: v is only in scope in the first parenthesized proposition, so it can't be referred to in the second one. Of course, it's generally sufficient to prove co-termination and just leave out the relationship between the final values, since that's usually easily derivable. But it's an important part of the intuition behind correctness, so I want the relationship between answers to be in the statement of the theorem.

An awkward way to make this work is to write big clunky conjunctions like:
(e⇓ ⇔ [[e]]⇓) ∧ (∀ v . ev . ∃ v′ ~ v . [[e]] ⇒ v′)
But it would be nice if there were some way to write it more compactly like the first pseudo-proposition I wrote. I like that it reads almost like you'd say in English: "e and [[e]] always get the same answers."

Anyone have any suggestions?

Tuesday, May 20, 2008

The bat-signal operator

Starting now, I propose that every programming language strive to incorporate the bat-signal operator: ~@~

Friday, May 16, 2008

Fail-soft

Automatically repairing errors and continuing execution goes against the grain of software engineering, for good reason, but I've come to see that it's what makes the web go 'round. Robert O'Callahan has an interesting market-oriented take on error recovery:
...the economics of error recovery --- the fact that recovering from malformed input, instead of hard failure, is a competitive advantage for client software so in a competitive market it's sure to develop and you might as well put it in specifications...
In other words: ignore the errors, users will love you, and you win. We may not like it, but you fight economics at your own peril.

Friday, May 09, 2008

N-ary infix predicates

In languages with operator overloading, it's possible to define binary infix predicates such as (using Haskell here):
infix <:
(<:) :: Ty -> Ty -> Bool
x <: y = x `subtype` y -- for some auxiliary subtype function
With arithmetic operators that continue to return elements of the same type (does this make their types endofunctors? I'm no category theorist), so as long as you declare an associativity, the n-ary case is easy to parse. Predicates aren't closed over their input type, though, so in Haskell I can't take the above definition and write:
t1 <: t2 <: t3
But in math, we overload binary predicates to the n-ary case by assuming an implicit "and" between each pair. It would just take a little kludge (but one with the full weight of historical precedent) to let the operator overloading and type system conspire to determine that any binary predicate (i.e., with result type Bool) can be extended to the n-ary case using &&.

(You could maybe also extend this to any type Monad m => m a using >>= instead of &&?)

Thursday, May 08, 2008

Solving the wrong problem

I've been having a depressing couple of weeks trying to solve a really hard problem in my research. Occasionally I thought I had some promising leads, but usually they just led to more despair.

Then I discovered today that I've been tackling the entirely wrong problem all along.

I never knew feeling so stupid could feel so good.

Monday, May 05, 2008

Sigh

> (eqv? +NaN.0 +NaN.0)
#t
> (eq? +NaN.0 +NaN.0)
#t
> (= +NaN.0 +NaN.0)
#f
> (equal? +NaN.0 +NaN.0)
#t

A functional visitor pattern

When your data definitions get big enough, it becomes Tedious and Error-Prone (oh my!) to write multiple operations over the same datatype, so it's useful to define a generic traversal. Here's a pattern I used recently for the ECMAScript Edition 4 reference implementation.

Lots of languages have mutually recursive AST definitions. For example, take a language with mutually recursive expressions Ast.expr, statements Ast.stmt, and declarations Ast.decl. Define an interface VISITOR:
signature VISITOR = sig
datatype cmd = Stop | Skip | Cont

type ('a, 'z) method = 'a * 'z -> 'z * cmd

type 'z visitor = { visitExpr : (Ast.expr, 'z) method,
visitStmt : (Ast.stmt, 'z) method,
visitDecl : (Ast.decl, 'z) method }

val foldExpr : 'z visitor * Ast.expr * 'z -> 'z
val foldStmt : 'z visitor * Ast.stmt * 'z -> 'z
val foldDecl : 'z visitor * Ast.decl * 'z -> 'z
end;
A visitor is a record of operations, one for each variant of AST nodes. (Alternatively, we could have defined a single type datatype ast = Expr of ... | Stmt of ... | Decl of ...;. But there's no need for the extra indirection.) Each operation takes a node and an intermediate result, and produces a new intermediate result along with a "traversal command" cmd.

This simple traversal language instructs the fold either to continue (Cont), to skip all descendants of the current node (Skip), or to abort the traversal entirely (Stop).

The implementation of the fold uses a few auxiliary functions to compute a list of child nodes for each node:
type children = Ast.expr list * Ast.stmt list * Ast.decl list

fun exprChildren (e : Ast.expr) : children = ...
fun stmtChildren (e : Ast.stmt) : children = ...
fun declChildren (e : Ast.decl) : children = ...
The fold itself is (internally) in CPS. At each node, there are three potential continuations: the empty continuation for aborting, the given continuation for skipping the current node's children, or a new continuation that first traverses the current node's children and then uses the given continuation.
type 'z cont = 'z -> 'z

fun cont (result : 'z, cmd : cmd)
(skipk : 'z cont)
(contk : 'z cont)
: 'z =
case cmd of
Stop => result
| Skip => skipk result
| Cont => contk result

fun foldExprK (v as { visitExpr, ... }, expr, init) k =
cont (visitExpr (expr, init))
(fn x => foldAllK v (exprChildren expr) x k)
k

and foldStmtK (v as { visitStmt, ... }, stmt, init) k =
cont (visitStmt (stmt, init))
(fn x => foldAllK v (stmtChildren stmt) x k)
k

and foldDeclK (v as { visitDecl, ... }, decl, init) k =
cont (visitDecl (decl, init))
(fn x => foldAllK v (declChildren decl) x k)
k

and foldAllK (v : 'z visitor)
((exprs, stmts, decls) : children)
(init : 'z)
(k : 'z cont)
: 'z =
let
fun foldList f (v, asts, init) k =
case asts of
[] => k init
| [ast] => f (v, ast, init) k
| ast::asts => f (v, ast, init)
(fn x => foldList f (v, asts, x) k)
in
foldList foldExprK (v, exprs, init) (fn result =>
foldList foldStmtK (v, stmts, result) (fn result =>
foldList foldDeclK (v, decls, result) k))
end
I originally had foldList defined at the top-level, in the same mutual recursion group with foldExprK et al, and got smacked by SML because that would require polymorphic recursion. Luckily I was able to work around it by placing its definition inside the body of foldAllK, but since I'd never run into SML's lack of polymorphic recursion before, it took quite a while to decipher the type error (with help from #sml).

Finally, the top-level functions seed the CPS functions with an initial continuation:
fun id x = x

fun foldExpr x = foldExprK x id
fun foldStmt x = foldStmtK x id
fun foldDecl x = foldDeclK x id
(Note that because of the value restriction, there's no possibility of refactoring these into point-free val-declarations.)

Aren't auto-curried functions non-expansive?

The value restriction in ML is so limited. If I have an auto-curried function
> fun foo (k : int -> 'a) (x : int) = ...;
it should be the case that partial application of foo is always non-expansive, right? It's really annoying that partial applications of auto-curried functions often can't be given general types.
> val raisek (x : int) = raise (Fail (Int.toString x));
> val fooRaise = foo raisek;
Warning: type vars not generalized because of value restriction...
I hate when the type system interferes with standard, semantics-preserving transformations.

Wednesday, April 30, 2008

Dynamic languages need modules

A dynamic language starts as an implementation. And that implementation almost always includes some variation on a REPL. Look at Lisp, Scheme, Python, Ruby, JavaScript, Lua, etc. One of the things that makes these languages "dynamic" is that they're attached to some long-running process: an IDE, a user prompt, an embedding application, a web page. So these languages are simply implemented with some global table of definitions that gets updated through the lifetime of the host process. Easy enough to understand from the implementor's perspective, but what happens to the user of the embedded language?

The problem with the "top-level" (that ever-changing table of program definitions) is the creeping specter of dynamic scope. While static vs. dynamic typing may be a debate that civilization takes to its grave, the dust seems to have more or less settled on dynamic scoping. The fact is, even though many decisions a program makes must depend on its context, it's very hard to understand a program if you can't nail down its definitions.

To some degree, if a dynamic language has lambda, you can escape the top-level with a design pattern that simulates a poor man's module. The module pattern is almost always a variation of what Schemers call "left-left-lambda"--the immediate application of a function literal. Bindings inside the lambda are no longer floating in that airy and transient stratosphere of the top-level; they're nailed to the function that is being applied. And you know that function is being applied only once, because once you've created it, it's applied an discarded.

This pattern goes a long way, and if you have macros, you can create sugar to give the pattern linguistic status. But a module system it ain't.

Linking: Nothing in this pattern deals with the relationships between modules. There's no way to declare what a module's imports and exports are. In fact, if you want a module to communicate with any other modules, the top-level's poison tends to seep back in. To export a value, a lambda-module can mutate some global variable, or it can return a value--but where does the caller save the value? You can always nest these solutions within more lambda-modules, but ultimately there's a problem of infinite regress: in the end you have to have at least one special top-most module surrounding your program.

Separate development: And that's only if you have control over the whole program. If you want to create a library and share it, there needs to be some shared common framework in which people and organizations can share code without stomping on each other's invariants or polluting each other's global environments. To be sure, a built-in module system doesn't eliminate all such issues (you still need conventions for naming and registering modules within a common framework), but modules help to standardize on these issues, and they can provide helpful errors when modules step on each other's toes, rather than silently overwriting one another.

Loading: There's not much flexibility in the loading of an immediately applied function. If your language involves multiple stages of loading, the implementation may be able to be smarter about loading and linking multiple modules at once.

Scoping conveniences: Lexical scope is a widget in the programmer's UI toolkit, and for different scenarios, there are different appropriate designs. The tree shape of expressions makes the lexical scoping rule ("inner trumps outer") appropriate; it favors the local over the global. But, ignoring nested modules for the moment, modules aren't tree shaped; they're more like a global table. In a sense, all modules are peers. So when you import the same name from two different modules, which one should win? You could say that whichever you import later wins, but this is much more subtle than the obvious nesting structure of ordinary variable bindings. I find it's more helpful for the module system to give me an error if I import the same name from different sources (unless it's a diamond import). Other useful facilities are selective import, import with renaming, or import with common prefixes. These are subtle usability designs where modules differ from lambda.

Extensibility: In PLT Scheme, we've used the module system as the point for language design and extension. By allowing modules to be parameterized over their "language", we have a natural way for introducing modalities into PLT Scheme. As languages grow, these modalities are an inevitability (cf. "use strict" in Perl and ECMAScript Edition 4). Buried within pragmas or nested expressions, this makes the design of the language much harder. But within a module, bindings are sacrosanct and interactions with other modules are limited to imports and exports. This significantly cuts down on the space of possible interactions and interferences between the language's modalities. As an example, Sam Tobin-Hochstadt has made good use of this for the design of Typed Scheme, a statically typed modality of PLT Scheme that can still interact reliably with dynamically typed modules.

The unrestricted mutation of the top-level environment is a good thing thing for many purposes: interactive development, self-adjusting execution environments, etc. But it's terrible for nailing down program definitions. Modules are a way of circumscribing a portion of code and declaring it "finished". It can still be useful to have an environment where modules can be dynamically loaded and possibly even replaced, but it's critical for the language to provide the programmer with basic invariants about the definitions within a module.

All of this is stuff I wish I'd had a clearer head about earlier in the ES4 process. But I hope that, down the road, we'll consider a module system for ECMAScript.

Monday, April 28, 2008

Literate code

I used to find the notion of literate code attractive, but right now my dissertation prototype implementation is undergoing some serious redevelopment and the horribly out-of-date docs are dragging me down. Now I find myself wishing I'd kept them separate.

Tuesday, April 22, 2008

How to spell StopIteration

Some of the pseudo-constants in JavaScript are actually mutable global variables: undefined is a notorious example. Luckily, there's generally some reliable way to generate such a value without having to hope that no one has reassigned a global variable.

For undefined, it's the void operator, which, given any argument, produces the value that undefined is initially bound to. Conventionally, people write void(0).

In JavaScript 1.7 and later, there's a special constant called StopIteration that's thrown after an iterator runs out of values. This too is a mutable global. But this value is a little trickier to get at reliably. Here's the simplest expression I could come up with that produces the StopIteration value:
(function(){try{(function(){true||(yield)})().next()}catch(e){return e}})()
The inner function is a generator by virtue of textually containing the yield keyword, but it returns immediately, yielding no values, because of the short-circuited logical or. (Note that because of yield's finicky grammar, it has to be parenthesized.) So by calling next() on it, it immediately raises StopIteration, which the outer function catches and returns.

Sunday, April 20, 2008

Compilation, obfuscation, encryption

It's interesting that we tend to use compilation as a proxy for encryption. It seems to me it would help clarify the system design of a language semantics and architecture if you separate the concerns of code ownership with performance. Compilation is primarily used to cache the transformation of one semantics into another--a partial evaluation of an interpreter. Using this as a weak encryption mechanism is pretty silly. If you want to encrypt program source, why not use real encryption? That way you can use whatever representation of the code you want; source, byte-code, microcode, whatever.

Friday, April 11, 2008

A poem for the weekend

Listen to Mustn'ts, child, listen to the Don'ts.
Listen to the Shouldn'ts, the Impossibles, the Won'ts.
Listen to the Never Haves, then listen close to me.
Anything can happen, child, Anything can be.

--Shel Silverstein

Tuesday, April 08, 2008

Different perspectives on parameterized types

To the PL theorist, parameterized types are a necessity for the expressiveness of a statically typed language. Imagine Hindley-Milner without them: you'd have to duplicate definitions all over the place. But in practice, statically typed languages have loopholes--long before generics, Java always had the type Object so that you could essentially "turn off" type-checking when you couldn't type things precisely. This meant that you could always write generic operations and generic collections, but without type-checking.

So in practice, people don't look at parameterized types as increasing the expressiveness of Java at all; it just looks like a way to increase the amount of type-checking in the language, to make it stricter. And they're right.

So you have two correct perspectives on parameterized types: roughly, that they make the language less restrictive and more restrictive. It's not actually a contradiction, but it's enough to have caused me confusion in some conversations.

Sunday, April 06, 2008

Keep your /tmp clean

For a couple years my laptop's cygwin installation has been dramatically, inexplicably slow to start up the first time after rebooting. I mean, like, 5 to 10 minutes to start up. (You can imagine how much more this makes me love Windows Update for forcing a reboot every time they patch the latest buffer overrun that authorizes Solitaire to initiate a nuclear first strike against North Dakota.)

Well, today I found the culprit. A program I haven't used in quite a long time had left thousands of orphaned temp files in /tmp. Don't ask my why, but this was enough to obliterate the performance of cygwin across the board, and especially on startup. Maybe /tmp is memory-mapped? No idea. Anyway, I wish I'd known to keep my /tmp clean. Now you know.

ESOP in a heartbeat

I presented my latest paper, A Theory of Hygienic Macros, co-authored with Mitch, at ESOP in Budapest this past week. Because I'm a genius, I managed to let my passport expire. Conveniently, there's an expiration notification service at every airport known as "not letting you on the plane."

It's a long story, but it involves emergency passport renewal, an extra weekend in San Francisco, and very merciful conference organizers and attendees.

All of this means I got to spend a total of about 48 hours in lovely Budapest--about long enough to see the Danube and a few of its glorious monuments, experience cheap beer, try delicious Tokaji, present a paper, chat with a few friends, and get right back on a plane again.

What a shame I couldn't have stayed longer. But the talk went great, and I'm still very glad I got to go.

For the record, the San Francisco passport office is staffed exclusively with miracle workers.

Case exhaustiveness for interactivity

In functional programming we're very good at exhaustive case analysis of our data definitions. But I've found I'm pretty bad at reasoning about cases in interactive programs. (Embarrassingly, I recently lost a week's worth of data in an Ajax application I wrote due to a bad response to an unpredicted server failure.)

What are all the possible actions the user could perform? What are all the failure cases? What are all the possible interleavings of interactions? These are harder questions than enumerating variants of a disjoint union.

I think this is where dynamic safety (e.g., runtime tag checking in dynamically typed languages or case-dispatch catchalls in statically typed languages) is so critical: if you don't have a sound model for enumerating the space of possible behaviors, you have to make sure that there's some reasonable recovery programmed into the system for when you encounter an event you didn't predict. This reminds me of recovery-oriented computing and Erlang's "let it fail" philosophy.

Wednesday, April 02, 2008

Monkey-patching evolves

Do you like the ability to dynamically mutate the definition of your program, but keep getting thwarted by sophisticated static analyses such as grep? Your prayers have been answered: ninja-patching is here.

Sunday, March 23, 2008

Makefile for HUnit tests

Here's a nice idiom for running HUnit tests from a Makefile. Jesse Tov gave me the idea that you can use make's dependencies to test only modules that have changed since the last time you ran the tests. Pick a standard name for each module's tests--tests, say--and make sure it's defined in the first column (since we'll be searching for it with a simple grep). Every time we test a module, we'll touch a dummy file of the same name in some hidden directory--call it .test. So we have two lists of files, the modules of the application and the test dummy files:
SOURCES := $(wildcard *.hs) $(wildcard *.lhs)
TESTS := $(foreach src,$(SOURCES),.test/$(src))
To run the tests, first we collect the list of modified modules and save the names of their tests (Foo.tests, Bar.tests, etc.) in a temporary file .test/args.
.test/%hs: %hs
@if grep '^tests ' $< > /dev/null 2>&1 ; then \
touch $@ ; \
echo $* | sed -e 's/\..*/.tests/' >> .test/args ; \
fi
Now every time we run the tests, we first make a fresh .test/args file, and then we run GHC with a command-line option to evaluate an expression that runs those tests:
test: teststart $(TESTS)
@echo ghc Test -e "\"Test.run [ `xargs -a .test/args | tr ' ' ','` ]\""
@ghc Test -e "Test.run [ `xargs -a .test/args | tr ' ' ','` ]"

teststart:
@mkdir -p .test
@$(RM) -f .test/args
@touch .test/args
It's also useful to have a target that always runs all the tests:
retest: testclean test

testclean:
$(RM) -rf .test
This assumes the existence of a simple test harness module:
module Test (run) where
import Test.HUnit
run :: [Test] -> IO ()
run ts = (runTestTT $ TestList ts) >> (return ())

Some tools just don't like each other

brit·tle, adj.
  1. Multi-line string literals in Haskell require so-called "string gaps": one '\' character to terminate a line and another '\' to start the next line.
  2. With GHC, Haskell programs may be preprocessed with CPP, which coincidentally strips the "\ ... \" characters from the source, resulting in an illegal Haskell string literal.
  3. Mercifully, it also happens that CPP doesn't strip the characters if the first '\' character is followed by a space before the newline.
  4. But of course, a commonly used feature of emacs is to silently strip trailing whitespace at the end of lines on every save.
  5. Not that you can see the difference, given the well-known human limitations at visually distinguishing whitespace.
(Sensitive types should take this neither as a criticism of GHC nor as implicit condonement of using CPP for Haskell programs--I know it's shunned. You've gotta admit this is beautiful, though.)

Friday, March 21, 2008

When to use point-free style

I've often struggled with the question of when/whether to use point-free style. It can be dangerously addictive, especially in languages with syntactic support for it like Haskell. But it's notorious for creating dense and impenetrable code. Yesterday I linked to advice from the GHC community on when not to create needless abstractions, advice which could be applied when considering a point-free abstraction (man, does it ever take self-restraint not to go wild with the puns here).

The reason why pointful style can be so helpful is it allows us to think about the definition of a computation in terms of particular examples. For any number, let's call it x, ... This is also why set comprehensions in math, and consequently list comprehensions in programming languages, are so approachable: they describe a set by appeal to representative examples.

I think the key to successfully using point-free style is when you want to treat a function itself as a single data point. For example, if you're sorting a list with some comparison function, you don't want to have to drop down a level of abstraction to think about the individual pairs of elements being compared; you just want to think about the comparison function (like "numeric" or "reverse alphabetical") as the object of your attention.

Wednesday, March 19, 2008

Well said

From the GHC wiki:
It's much better to write code that is transparent than to write code that is short.

Often it's better to write out the code longhand than to reuse a generic abstraction (not always, of course). Sometimes it's better to duplicate some similar code than to try to construct an elaborate generalisation with only two instances. Remember: other people have to be able to quickly understand what you've done, and overuse of abstractions just serves to obscure the really tricky stuff.

Friday, March 14, 2008

Not fade away

From the old-languages-never-die dept.:
Obsolete programming languages such as FORTRAN and COBOL are still widely used.

-- J.J. Duby, 1971

Monday, March 10, 2008

Another cute Haskell function

join :: Eq a => [(a,b)] -> [(a,c)] -> [(a,(b,c))]
join a1 a2 = [ (a,(b,c)) | (a,b) <- a1, c <- [ c | (a',c) <- a2, a == a' ] ]

infixr 5 ⋈
(⋈) :: Eq a => [(a,b)] -> [(a,c)] -> [(a,(b,c))]
(⋈) = join

-- for example:
[("a",1),("b",2),("c",3)] ⋈ [("a",'a'),("b",'b'),("c",'c')]

Friday, March 07, 2008

Crazy debugging feature idea

Here's a crazy idea. Closures are like objects with private fields, right? Specifically, outside of the function body, you're not allowed to refer to the variables in the closed environment. But one of the most annoying things that happen when I'm debugging is that I want to call some local function, but it's not available from the REPL:
(define (foo x y)
(define (helper z)
... x ... y ...)
...)
What if, at the REPL (i.e., in debug mode only), closures were openable? You could use dot-notation to refer to them, and all you'd have to do is somehow provide the extra bindings needed to fill in their environment. Keyword arguments, maybe?
> ((close foo.helper [#:x 42] [#:y 'blah]) "value of z")
'result-value-ftw!

Wednesday, March 05, 2008

Xor

While I'm at it, let's continue the series of logical Maybe operators with xor:
infixr 2 ^^^

(^^^) :: Maybe a -> Maybe a -> Maybe a
Nothing ^^^ r@(Just _) = r
l@(Just _) ^^^ Nothing = l
_ ^^^ _ = Nothing

My current favorite little Haskell function

Here's my pet Haskell combinator du jour:
infixr 2 |||

(|||) :: Maybe a -> Maybe a -> Maybe a
Nothing ||| x = x
x ||| _ = x
This gives you a concise notation for branching in the Maybe monad, so instead of having to write:
case x `lookup` env1 of
Just v -> Just $ f1 v
Nothing -> case x `lookup` env2 of
Just v -> Just $ f2 v
Nothing -> Nothing
you can write:
do { v <- x `lookup` env1; return $ f1 v } |||
do { v <- x `lookup` env2; return $ f2 v }
This gives you the same kind of idiom that Schemers use or for. Come to think of it, I guess you could also write the analog of and:
infixr 3 &&&

(&&&) :: Maybe a -> Maybe a -> Maybe a
Just _ &&& r@(Just _) = r
_ &&& _ = Nothing
I haven't used this one yet, but it seems natural.

Tuesday, March 04, 2008

Intentional capture

Procedural hygienic macro systems like the syntax-case system make it possible to write capturing macros--macros which, depending on your philosophy, you might call "non-hygienic." The classic example is the "anaphoric" conditional form if-it, which implicitly binds a variable it to the result of the test expression:
(if-it 42 (+ it 1) #f) ; => 43
The difficulty in getting such a macro right comes when you try to write another macro that expands into if-it. To quote the mzscheme manual's section on macros, "macros that expand into non-hygienic macros rarely work as intended."

Andre van Tonder's SRFI 72 document contains a perfect and concise example, due to Kent Dybvig, of two different ways a macro might expand into a capturing macro. On the one hand, we might want to write when-it, a simple "one-armed" conditional that implicitly binds it in the same way as if-it:
(when-it 42 (+ it 1)) ; => 43
On the other hand, we might want to use if-it to implement the hygienic or macro, which shouldn't capture any variables.
(let ([it 10]) (or #f it)) ; => 10
First, here's the implementation of if-it: we create an identifier for it with the same lexical context as the operator of the expression:
(define-syntax (if-it stx)
(syntax-case stx ()
[(op e1 e2 e3)
(with-syntax ([it (datum->syntax #'op 'it)])
#'(let ([it e1])
(if it e2 e3)))]))
The references that will be captured by the introduced binding of it are the ones that were introduced into the program in the same expansion step as the occurrence of if-it in the macro call. In particular, if the occurrence of if-it was in the original program (i.e., written explicitly by the programmer), it captures references to it that were in the original program; if the occurrence of if-it is the result of a macro expansion, it captures only those references to it that were generated in that same expansion step.

This means that a hygienic macro that expands into if-it will work as expected:
(define-syntax or
(syntax-rules ()
[(op e1 e2)
(if-it e1 it e2)]))
Since the reference to it appears in the same expansion step as the occurrence of if-it, that reference is captured, but no references to it within subexpressions e1 or e2 (which had to have already been there before this expansion step) are captured.

If you want to write another capturing macro that expands into if-it, it's a little more work. Essentially, you have to capture it all over again. The moral of the story is that you always have to ask explicitly for a macro to capture an introduced identifier.
(define-syntax (when-it stx)
(syntax-case stx ()
[(op e1 e2)
(with-syntax ([it* (datum->syntax #'op 'it)])
#'(if-it e1 (let ([it* it]) e2) (void)))]))
Here we once again create an identifier with the same lexical context as the operator, and we bind it to the occurrence of it introduced by if-it.

These are good defaults for a hygienic macro system: it's easier to write hygienic macros but still possible (albeit a little harder) to write macros that capture. This is even true when you abstract over capturing macros: macros that expand into capturing macros are hygienic by default, but with a little more work again, you can create capturing macros that abstract over other capturing macros.

Friday, February 22, 2008

True unions

I am a Super Big Fan of disjoint union datatypes in programming languages, but there are places where they are really inconvenient. Having to inject and project data between related types can be prohibitively cumbersome, especially when dealing with large data definitions such as the AST definitions for realistic programming languages. I know of a couple of languages where "true" unions are used instead, including Typed Scheme and the draft ECMAScript Edition 4. In both cases, unions are being added to a language where the runtime is already paying the price for runtime type tags, so keeping the variants distinct where they don't overlap doesn't introduce any extra burden.

But I was thinking this morning, what would happen if you tried to add true unions to Hindley-Milner languages? For concreteness, let's imagine extending Haskell. So instead of writing
data Foo = I Int
| S String
you could instead write
type Foo = Int | String
Now if you want to pattern match on such a construct, you have to explicitly mention the type; there's no other way to distinguish the variants. So you could imagine type-annotating the patterns in a match expression:
showFoo :: Foo -> String
showFoo foo = case foo of
n :: Int -> show n
s :: String -> s
Note that because there's nothing prevent overlap in the variants of a true union, the order in which you match the patterns is significant. Alternatively, you could write the definition of showFoo in pattern-definition style:
showFoo :: Foo -> String
showFoo (n :: Int) = show n
showFoo (s :: String) = s
Consider the possible drawbacks to such a feature:

Cost of tagging:

One of the benefits of (sound) static typing is the ability to compile to efficient runtimes that avoid tagging runtime objects with their datatype. The Foo type, by contrast would require its Ints and Strings to be tagged. But this is the wrong way of looking at it; the language requires you to tag them anyway if you use the disjoint union type, so there's no additional cost over what you would already have been paying. For non-union types, you can still do the same amount of erasure.

Possible overlap:

You can express overlapping types like (Int | (Int | String)) with true unions, which makes the order of patterns significant, could result in surprising (order-dependent) logical relationships between case dispatches, and could generally lead to messy type definitions. Maybe a more principled way of looking at it is a disjoint union can be thought of as a type abstraction, whereas with a true union you might have to know its full definition to use it effectively. But hey, the same is true of type-definitions but they're still useful; and besides, nothing's preventing you from using disjoint unions when they're more appropriate.

Case exhaustiveness:

Standard ML's pattern matching facility is carefully designed to allow the compiler to prove or disprove case exhaustiveness. I tend to think this is a bad trade-off; the language is significantly crippled to enable a very weak theorem prover to prove a theorem that's of limited utility. Sure, it has its uses, but when you know things the theorem prover doesn't, you have to abandon pattern matching entirely. Other language designers seem to agree with me, since Haskell and I think Ocaml also don't check for case exhaustiveness.

Inference problems:

I have no expertise in Hindley-Milner inference, but I'm sure true unions are not inferrable. But one thing recent Haskell research has demonstrated is that convenient but uninferrable extensions to Hindley-Milner type systems can often be safely added with the requirement that they be explicitly annotated. I bet you could add true unions, require explicit type annotations when they're used, and get along just fine.

Function types and polymorphic types:

This is where it gets tougher. How do you tag these kinds of types? Do you do runtime subtyping checks for things like the more-polymorphic-than relation? I don't have an answer. A simple one is to restrict unions to types with simple tags, although those kinds of arbitrary restrictions lead to lack of compositionality in language design. But perhaps it's possible to make this work with the entire type system.

Some of this design space has been explored with Ocaml's polymorphic variants. I should take a look at the literature on that.

Sunday, January 06, 2008

Thank heavens for PLT v4

I've just finally installed my first pre-release of PLT v4 (actually v3.99.0.9 at present), and there are so many little conveniences that are already qualitatively improving my life. Here are a few of the ways v4 is making me happy:

automatic opt-lambda: As I understand it, the underlying low-level #%plain-lambda is like the original lambda, but in the full scheme language, the automatically available, high-level lambda has support for the functionality of opt-lambda from the old (lib "etc.ss") library. More often than not, that was the only reason I needed to require that library. It's so nice just to be able to reach for opt-args right away, without having to jump through any extra hoops.

keyword integration: Moreover, the high-level lambda also has automatic support for keyword arguments! This means there's a very smooth evolution path for functions: 1) positional when you just need a few arguments; 2) optional when one or two of them have reasonable defaults; and 3) keywords when there are enough arguments with enough defaults that it's too hard for the user to remember their order.

built-in pattern matching: No need to require (lib "plt-match.ss") -- it's already in the full scheme language by default. I stumbled on that by accident!

generalized define-struct: There used to be a huge gap between what you could do with define-struct and what you could do with the very general but hard-to-use make-struct-type. But thanks to the expressivity of keywords, the new define-struct form is much more powerful. Plus it's so much easier to understand what
(define-struct posn (x y) #:transparent)
means, as opposed to
(define-struct posn (x y) #f)

built-in list libraries: Many more commonplace list operations are just automatically there when you use the full scheme language, such as foldl, foldr, last, andmap, ormap, build-list, etc. It used to be confusing trying to remember what was in (lib "list.ss") and what was in (lib "list.ss" "srfi" "1"). Now, if I'm not mistaken, I just don't ever need (lib "list.ss") anymore.

my first two columns are mine again: Last but not least, the new #lang module declaration form means I no longer have to indent the entire body of a module.

It looks like the PLT crew have really worked hard on fixing lots of these little inconveniences (some of which I only noticed consciously now that I no longer have to deal with them!).

Thursday, January 03, 2008

The Culpepper Method of Good Writing

Ryan Culpepper's algorithm for good writing:

Blather. Condense. Repeat.