Brief Description
Disk space, memory and time are increasingly common constraints in applied statistics. The size of
datasets is growing at a fast pace, yet the computational tools needed for this type of analysis are not
growing at a sufficient pace. This document is an attempt to describe the incorporation of 3 ideas to
improve this situation: 1) Having a standard language for describing a statistical model, as in Ch 2 of
“Statistical Models in S” 2) Having a fast implementation (eg compiled and not ineractive) of the S
language which 3) uses ScaLAPACK to create a layer of transparency to problem size. Thus, a user
should be concerned with creating a statistical model and analyzing its output, not having to worry
about the computational platform and details underlying its implementation.
Specific Objective
It was to try and see if there was any interest in brute force tools for large datasets.
Product/Results
Status
Out, presumed dead. See postmortem.
Postmortem
Well, at the time I didn’t have much experience with big databases. There was pretty much no interest. In my own perspecitve, I came to see this as a brute force approach to the issue, and found the bootstrap to be much more elegant and applicable here. Further, I several failures trying to write SQL to do matrix operations. It’s possible that there may be some OLAP solutions which are more elegant, but I don’t know much about that. I did find some research about doing statistics on massive SQL datasets that seemed a little more promising.
Further, a lot of these tools were developed for other communities (see PDE folks), who don’t currently have any elegant tricks for getting around things like Navier-Stokes. I think. Bummer for them. But at least they actually have a valid excuse for working on big parallel machines.




