Thanks Lars - I would really like to clarify the problems with my
suggestion, in particular if/how a CLI interface would break the scikit
learn interface. You obviously can immediately identify the problems.
The kind of thing I would like to do is run vowpal-wabbit from within
scikit learn. There are lots of programs out there implementing a single
algorithm. What would be nice is to have an easy way of investigating them
[ doing preprocessing, cross validation etc, metrics in scikit learn]
I am just suggesting that CLI might be a useful additional interface to
enable quick incorporation of new algorithms. The number of algo's/bugs
grows each year - the number of scikit developers doesn't! This is
different from R, where essentially each algorithm is maintained by its own
developer. Clearly a CLI interface will be less polished- but it is better
Post by Sean Violante
b) most ML algos are available from command line with text file input.
Python is a great tool for controlling external programs, but it's
still a hard problem because usually the CLI interfaces to those
programs are poorly defined. Error handling in particular can be very
difficult and installation, deployment, and testing code must
rewritten for each program.
I agree the CLIs are not going to be consistent. What I imagined would be
scikit developers providing some generic utility functions/interface. Then
anyone wanting to use some new algo would write the relevant scripts
mapping parameters to CLI/ text file. How hard could it be? ;).
a) no problem with data copy: the executable loads data from file (you don't
Post by Sean Violante
need to keep in sklearn)
Quite the contrary. What if only raw data (text files, JSON, etc.) is
on disk, and you still need to do feature extraction on it? Then you
need a pipeline of a feature extraction script and a learner, so
you're copying the raw data from disk into the feature extraction
script, then into kernel buffers, and finally into the learning
program. What about feature selection, is that an extra script with
two additional copies?
Whenever you package external algo's you are likely to have a problem that
the internal data format is not the same as numpy array. My issue with data
copy is one of memory limits: ie if you are keeping two copies in memory
you are halving the maximum poss data you can handle.
for me this seems to be the main problem with my suggestion : clearly the
.fit(X,y) interface doesn't allow me to clear the training data within