Now:
- Use tree-structured pattern for scalar IndexReduce/IndexScan
- infer which arguments don't need to be copied back to the GPU
 

Next:
- Indexing by boolean masks
- Support 'output' parameter of ufuncs 
- PreallocArrays to move locally used allocation of arrays out functions into their calling scope
- Garbage collection (or, at least, statically inferred deallocations)

On pause:
- Adverb semantics for conv
- Code generation for conv

Maybe never?
- Adverb-level vectorization 

Old:
- Only run tiling on perfectly nested code
