Discussion:
API design
(too old to reply)
Gael Varoquaux
2010-03-01 20:21:43 UTC
Permalink
Hi there,

Fabian as drafted a small API design discussion for the basic objects:

https://sourceforge.net/apps/trac/scikit-learn/wiki/ApiDiscussion

The idea is not to set in stone the API. We all know that it will be
limited. The idea is more to have a first set of conventions that we can
play with and share code, so that in a few months we are able to turn
around and analyse its limitations based on examples.

We tried to stay quite close to other packages we were aware of (pymvpa,
mdp, scikit.statsmodel).

Gaël
j***@gmail.com
2010-03-01 23:51:23 UTC
Permalink
On Mon, Mar 1, 2010 at 3:21 PM, Gael Varoquaux
Post by Gael Varoquaux
Hi there,
https://sourceforge.net/apps/trac/scikit-learn/wiki/ApiDiscussion
The idea is not to set in stone the API. We all know that it will be
limited. The idea is more to have a first set of conventions that we can
play with and share code, so that in a few months we are able to turn
around and analyse its limitations based on examples.
We tried to stay quite close to other packages we were aware of (pymvpa,
mdp, scikit.statsmodel).
just 2 comments

"Matrices and vectors
Matrices should be written in uppercase letters, whereas vectors
should be lowercase"

here you don't mean matrices in the numpy sense, just 2d ndarrays? I
don't think capitalization by ndim is useful. In statsmodels for
example X (exog) can be 1d, 2d or now also None. Similar for systems
of regression equations we have a 2d endog (y), I think.

For unsupervised you drop y and keep X, in statsmodels we drop X and
keep y (endog), since endog is what we want to explain, however I just
started to use this more extensively for univariate time series)

In general, did you want to have an update method, or a reuse of the
existing design matrix X, e.g fit same X to a new y?
I think mdp does this but I haven't looked at it in a long time.

trailing underscore are a bit ugly as coding for estimated values. On
the other hand, we have, for example, the t-statistic as a function
t(), which always throws me off, because I think of it as an attribute

Josef
Post by Gael Varoquaux
Gaël
------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2010-03-01 23:56:32 UTC
Permalink
Post by j***@gmail.com
"Matrices and vectors
Matrices should be written in uppercase letters, whereas vectors
should be lowercase"
here you don't mean matrices in the numpy sense, just 2d ndarrays? I
don't think capitalization by ndim is useful. In statsmodels for
example X (exog) can be 1d, 2d or now also None. Similar for systems
of regression equations we have a 2d endog (y), I think.
I definitely agree that you hit a weak point of the proposal: a variable
may sometimes be 1D, sometimes 2D. I am not sure whether this convention
will be useful, but we'll try to apply it, and see where it goes.
Post by j***@gmail.com
For unsupervised you drop y and keep X, in statsmodels we drop X and
keep y (endog), since endog is what we want to explain, however I just
started to use this more extensively for univariate time series)
I have no opinions here.
Post by j***@gmail.com
In general, did you want to have an update method, or a reuse of the
existing design matrix X, e.g fit same X to a new y?
I think mdp does this but I haven't looked at it in a long time.
Yes, this is absolutely something that we are planing to add in the long
run, for online estimation. However, as we have no usecase right now, we
decided to leave it undertermined.
Post by j***@gmail.com
trailing underscore are a bit ugly as coding for estimated values. On
the other hand, we have, for example, the t-statistic as a function
t(), which always throws me off, because I think of it as an attribute
I use trailing underscores in my research code quite often, and I think
it has helped me. Think of them as the 'hat' you would put on an
estimated variable.

Thanks for your feedback, Joseph,

Gaël
Yaroslav Halchenko
2010-03-02 02:53:54 UTC
Permalink
Post by Gael Varoquaux
Post by j***@gmail.com
trailing underscore are a bit ugly as coding for estimated values. On
the other hand, we have, for example, the t-statistic as a function
t(), which always throws me off, because I think of it as an attribute
I use trailing underscores in my research code quite often, and I think
it has helped me. Think of them as the 'hat' you would put on an
estimated variable.
smart... I think I might adhere to the same convention whenever
appropriate ;-)

alternatively, may be, you could group all estimates under some
simple container (collection), e.g.

svm.e.support_vectors
svm.e.alphas
...

that leads to the 2nd question on how to resolve Python's lack of method
overloading, whenever someone would like to create a classifier object
with some precomputed estimates (we lack that in PyMVPA too atm)...
making an .e a rw property, which upon fset would do necessary
checking/assignment of estimates might be a solution...? e.g.

svm1.train(x,y)
svm2 = SVM()
svm.e = svm1.e

or may be I am missing more elegant?
also, whenever you come to the issue of copying/storing fitted models, it would be
easy to come up with base class __reduce__ which would simply provide .e
of a given instance...

Also, not to impose anything, but just to share -- in PyMVPA, at many
places, we decided to proceed with such 1st-level groups (which we call
collections). E.g. in classifiers/regressions we have .params which
encapsulates all parameters of the model. So it makes easy, for an
existing instance, to see what parameters it had without any visual or
mechanical filtering of the methods/properties/attributes in its
interface.
Another example is our Dataset -- we have

.a -- generic attributes (no restrictions imposed)
.sa -- samples attributes -- len() of each attribute should match len of
dataset
.fa -- feature attributes -- len() of each attribute should match
shape[1] of dataset (i.e. number of features, which might be
multidimensional)

Also, you might decide right away on a set of exceptions you might
describe as a part of API relevant to the models (e.g.
DegenerateDataError, FailedToConvergeError, FailedToFitError)
which might provide people using the toolbox (e.g. us ;)) better
granularity over handling corner cases.

P.S. you might like to mention in 'Coding Style' that use of pylint is
advised ;) emacs users can enjoy pylint being ran transparently in
background and then having problematic lines highlighted... see
http://github.com/yarikoptic/PyMVPA/blob/yoh/master/doc/misc/emacs
for more information about setup I use
--
.-.
=------------------------------ /v\ ----------------------------=
Keep in touch // \\ (yoh@|www.)onerussian.com
Yaroslav Halchenko /( )\ ICQ#: 60653192
Linux User ^^-^^ [175555]
Gael Varoquaux
2010-03-02 06:42:25 UTC
Permalink
Post by Yaroslav Halchenko
alternatively, may be, you could group all estimates under some
simple container (collection), e.g.
svm.e.support_vectors
svm.e.alphas
...
We have thought about this, but decided not to go for the option right
now. We might change our minds in the long run. We need examples to see
how things go. I don't like the second level of indirection: it makes
easy things harder. I would like the objects to be usable in an 'R' way.
Post by Yaroslav Halchenko
that leads to the 2nd question on how to resolve Python's lack of method
overloading, whenever someone would like to create a classifier object
with some precomputed estimates (we lack that in PyMVPA too atm)...
making an .e a rw property, which upon fset would do necessary
checking/assignment of estimates might be a solution...? e.g.
svm1.train(x,y)
svm2 = SVM()
svm.e = svm1.e
or may be I am missing more elegant?
I am not sure what is your usecase. I hate properties. I believe they are
an expression of wacky design and come in to do impedence matching. I'd
rather not have impedence matching. If I understood what you are talking
about, we shouldn't need properties because the 'fit' or 'update' methods
('update' would be for online fitting) we always set the estimated
parameters.
Post by Yaroslav Halchenko
also, whenever you come to the issue of copying/storing fitted models,
it would be easy to come up with base class __reduce__ which would
simply provide .e of a given instance...
To clone an object, one would need to clone the meta parameters, and the
estimated parameters. However, I am not sure why we would need a special
__reduce__. The default one should work, here.
Post by Yaroslav Halchenko
Also, not to impose anything, but just to share -- in PyMVPA, at many
places, we decided to proceed with such 1st-level groups (which we call
collections). E.g. in classifiers/regressions we have .params which
encapsulates all parameters of the model. So it makes easy, for an
existing instance, to see what parameters it had without any visual or
mechanical filtering of the methods/properties/attributes in its
interface.
Another example is our Dataset -- we have
.a -- generic attributes (no restrictions imposed)
.sa -- samples attributes -- len() of each attribute should match len of
dataset
.fa -- feature attributes -- len() of each attribute should match
shape[1] of dataset (i.e. number of features, which might be
multidimensional)
Well, the code than looks like this (I am purposely not using the
train/test vocabulary, as it is confusing for someone who does not know
machine learning):

svm = SVM()
svm.fit(known_pts, labels)
svm.predict(unknown_pts)
new_labels = svm.fa.classes_

I don't really like the last line. It took me a few seconds to decide
whether I had to put 'fa' or 'sa' in. It doesn't help when
tab-completing, and more importantly, it makes us move further in the
world of object-oriented programming, whether I would like to stay at the
same level of complexity as R.
Post by Yaroslav Halchenko
Also, you might decide right away on a set of exceptions you might
describe as a part of API relevant to the models (e.g.
DegenerateDataError, FailedToConvergeError, FailedToFitError)
which might provide people using the toolbox (e.g. us ;)) better
granularity over handling corner cases.
That's a good idea. Let's leave it for later, though.
Post by Yaroslav Halchenko
P.S. you might like to mention in 'Coding Style' that use of pylint is
advised ;) emacs users can enjoy pylint being ran transparently in
background and then having problematic lines highlighted... see
http://github.com/yarikoptic/PyMVPA/blob/yoh/master/doc/misc/emacs
for more information about setup I use
Yes, suggesting pylint is a good idea.

See you,

Gaël
Yaroslav Halchenko
2010-03-02 14:56:32 UTC
Permalink
Post by Gael Varoquaux
how things go. I don't like the second level of indirection: it makes
easy things harder. I would like the objects to be usable in an 'R' way.
I am ignorant in R, but I've burnt myself few times with 'dot' in R's
function arguments and variable names while interfacing to R through
RPy2... and on the other hand using '$' to access "named components",
which sound to me like attributes... so, I am not sure what was 'R' way
;)
Post by Gael Varoquaux
Post by Yaroslav Halchenko
svm1.train(x,y)
svm2 = SVM()
svm.e = svm1.e
or may be I am missing more elegant?
I am not sure what is your usecase.
not sure (yet)... in any case we could escape just using copy.(deep)copy
to achieve 'copy construction' to some degree
Post by Gael Varoquaux
I hate properties. I believe they are
an expression of wacky design and come in to do impedence matching.
as many people, as many opinions ;)
but what is "impedence matching" in this context?
Post by Gael Varoquaux
I'd rather not have impedence matching. If I understood what you are
talking about, we shouldn't need properties because the 'fit' or
'update' methods ('update' would be for online fitting) we always set
the estimated parameters.
sure! my 'property' argument was indeed a bit 'aside the topic' in this
case ;-) but original suggestion for an alternative to marking
estimates with trailing '_' by groupping them into 'e.' is still in
place ;)
Post by Gael Varoquaux
Post by Yaroslav Halchenko
also, whenever you come to the issue of copying/storing fitted models,
it would be easy to come up with base class __reduce__ which would
simply provide .e of a given instance...
To clone an object, one would need to clone the meta parameters, and the
estimated parameters.
right!
Post by Gael Varoquaux
However, I am not sure why we would need a special
__reduce__. The default one should work, here.
right, for pure Python-based beasties... for swig interfaced ones (read
libsvm) it might puke ;)
Post by Gael Varoquaux
Post by Yaroslav Halchenko
Also, not to impose anything, but just to share -- in PyMVPA, at many
places, we decided to proceed with such 1st-level groups (which we call
collections). E.g. in classifiers/regressions we have .params which
...<
Another example is our Dataset -- we have
.a -- generic attributes (no restrictions imposed)
...<
Well, the code than looks like this (I am purposely not using the
train/test vocabulary, as it is confusing for someone who does not know
svm = SVM()
svm.fit(known_pts, labels)
svm.predict(unknown_pts)
new_labels = svm.fa.classes_
I don't really like the last line.
me too ;)
I blurbed about dataset's attributes just to follow 'Dataset' discussion
part. in your example of cause it should be

new_labels = svm.predict(unknown_pts)

BTW -- I quite liked the
if svm.fit(...):

construct idea... just beaware that by default repr of the result would
get printed if people use module interactively and just do

svm.fit(...)

so it would be good to assure a sane __repr__
P.S. I understand, that in most of the cases it would be sane since only
few parameters would be present... but some clfs might take some
weight vectors, which might get lengthy in real case examples
--
.-.
=------------------------------ /v\ ----------------------------=
Keep in touch // \\ (yoh@|www.)onerussian.com
Yaroslav Halchenko /( )\ ICQ#: 60653192
Linux User ^^-^^ [175555]
Fabian Pedregosa
2010-03-02 10:06:43 UTC
Permalink
Post by Yaroslav Halchenko
Post by Gael Varoquaux
Post by j***@gmail.com
trailing underscore are a bit ugly as coding for estimated values. On
the other hand, we have, for example, the t-statistic as a function
t(), which always throws me off, because I think of it as an attribute
I use trailing underscores in my research code quite often, and I think
it has helped me. Think of them as the 'hat' you would put on an
estimated variable.
smart... I think I might adhere to the same convention whenever
appropriate ;-)
alternatively, may be, you could group all estimates under some
simple container (collection), e.g.
svm.e.support_vectors
svm.e.alphas
...
that leads to the 2nd question on how to resolve Python's lack of method
overloading, whenever someone would like to create a classifier object
with some precomputed estimates (we lack that in PyMVPA too atm)...
making an .e a rw property, which upon fset would do necessary
checking/assignment of estimates might be a solution...? e.g.
svm1.train(x,y)
svm2 = SVM()
svm.e = svm1.e
or may be I am missing more elegant?
also, whenever you come to the issue of copying/storing fitted models, it would be
easy to come up with base class __reduce__ which would simply provide .e
of a given instance...
Also, not to impose anything, but just to share -- in PyMVPA, at many
places, we decided to proceed with such 1st-level groups (which we call
collections). E.g. in classifiers/regressions we have .params which
encapsulates all parameters of the model. So it makes easy, for an
existing instance, to see what parameters it had without any visual or
mechanical filtering of the methods/properties/attributes in its
interface.
Another example is our Dataset -- we have
.a -- generic attributes (no restrictions imposed)
.sa -- samples attributes -- len() of each attribute should match len of
dataset
.fa -- feature attributes -- len() of each attribute should match
shape[1] of dataset (i.e. number of features, which might be
multidimensional)
Also, you might decide right away on a set of exceptions you might
describe as a part of API relevant to the models (e.g.
DegenerateDataError, FailedToConvergeError, FailedToFitError)
which might provide people using the toolbox (e.g. us ;)) better
granularity over handling corner cases.
Indeed, that's a good idea. For now I'd just throw generic exceptions if
parameters are obviously bugus (dimension/type mismatch). I think to
raise ValueError on dimension/type mismatch is a reasonable compromise.

Also, I like the idea that .fit can return None in case it fails to
converge, so that you can do

"""
if clf.fit(X, Y): do_something...
"""

But for now I haven't yet worked on any iterative algorithm that can
fail to converge, so I'll postpone that decision until I have enough
user cases.
Post by Yaroslav Halchenko
P.S. you might like to mention in 'Coding Style' that use of pylint is
advised ;) emacs users can enjoy pylint being ran transparently in
background and then having problematic lines highlighted... see
http://github.com/yarikoptic/PyMVPA/blob/yoh/master/doc/misc/emacs
for more information about setup I use
done, cool!. I suppose I should start by using it myself ...
David Cournapeau
2010-03-02 01:12:40 UTC
Permalink
Post by j***@gmail.com
here you don't mean matrices in the numpy sense, just 2d ndarrays? I
don't think capitalization by ndim is useful. In statsmodels for
example X (exog) can be 1d, 2d or now also None. Similar for systems
of regression equations we have a 2d endog (y), I think.
Yes, I agree as well. I don't think different conventions for
different ndim is that useful, and using upper-case has also the
drawback of clashing with the "upper case = global variable"
convention. What is important is to agree on whereas one uses row or
column to differentiate feature vs instance, and what to do when
having 1d array, especially when/if scikits.learn will grow a set of
online procedures with classifiers trained one sample at a time.

In general, a good example to follow for ML is spider: they manage to
have a reasonably good API with matlab, which is quite an achievement
in itself :)

Another thing which may be useful is to finish the (3 year old !)
dataset proposal (in scikits/trunk/learn/datasets). I have always
found the convention of one dataset format / package annoying - it is
hard to overestimate the advantage of having plain arrays as input to
algorithms IMHO, but this means a set of conventions to easily deal
with datasets is important.

cheers,

David
Gael Varoquaux
2010-03-02 06:29:04 UTC
Permalink
Post by David Cournapeau
Post by j***@gmail.com
here you don't mean matrices in the numpy sense, just 2d ndarrays? I
don't think capitalization by ndim is useful. In statsmodels for
example X (exog) can be 1d, 2d or now also None. Similar for systems
of regression equations we have a 2d endog (y), I think.
Yes, I agree as well. I don't think different conventions for
different ndim is that useful, and using upper-case has also the
drawback of clashing with the "upper case = global variable"
convention. What is important is to agree on whereas one uses row or
column to differentiate feature vs instance, and what to do when
having 1d array, especially when/if scikits.learn will grow a set of
online procedures with classifiers trained one sample at a time.
Yes, we forgot to choose which axis was features and which one was
samples! I guess this is because we already have a convention at the lab,
but this is very important. It seems to me that statsmodel uses the
convention (n_samples, n_features). Joseph, could you please confirm?
Yaroslav, could you enlighten us on your conventions (eventhough you of
dataset objects). Anybody else: what the convention you expect?
Post by David Cournapeau
In general, a good example to follow for ML is spider: they manage to
have a reasonably good API with matlab, which is quite an achievement
in itself :)
OK, let's have a look!
Post by David Cournapeau
Another thing which may be useful is to finish the (3 year old !)
dataset proposal (in scikits/trunk/learn/datasets). I have always
found the convention of one dataset format / package annoying - it is
hard to overestimate the advantage of having plain arrays as input to
algorithms IMHO, but this means a set of conventions to easily deal
with datasets is important.
Yes, I think we agree with you here. Fabian has started looking at this,
but the reason we got an API proposal out first is that people are
starting to implement estimators for the scikit, here at NeuroSpin. So we
decided we had to agree on some conventions.

Gaël
David Warde-Farley
2010-03-02 16:34:53 UTC
Permalink
Post by Gael Varoquaux
Yes, we forgot to choose which axis was features and which one was
samples! I guess this is because we already have a convention at the lab,
but this is very important. It seems to me that statsmodel uses the
convention (n_samples, n_features). Joseph, could you please confirm?
Yaroslav, could you enlighten us on your conventions (eventhough you of
dataset objects). Anybody else: what the convention you expect?
On one hand I usually think of points in n-dimensional space as
columns, and thus a collection of points as a matrix of columns, BUT
most tools and even one-off software I see uses rows for examples and
columns for features. This is what I see in most of the statistics
literature too, "n x p" where p is the number of predictors.

In general the API document on track looks good to me, though I might
add that labels may not be appropriate in e.g. unsupervised settings.
(Getting WAY ahead of ourselves, in a semi-supervised setting you
might have three arguments to fit, where you have a collection of
labeled data, the corresponding labels, and a third argument that is a
collection of unlabeled data.)

I guess that there'd be all sorts of situations where the fit() method
of a particular object would take various keyword arguments modifying
its operation. A more interesting question is whether this API fully
accommodates situations where you don't have all the data available at
once, either because you have too much of it to fit in memory at once
or because it's a setting where you are constantly getting new data
and have to do online updates to the model parameters.

Another interesting question is what methods an unsupervised estimator
should provide in lieu of 'predict'. I think it will heavily depend on
the method. In a k-means estimator, a 'predict' analogue could return
a MAP estimate of the cluster assignment of test points. For a mixture
of Gaussians, you might want that, but you might also want the
posterior probability under each mixture component, or perhaps the
marginal likelihood of a test point under the entire model. I don't
know if there's a one-size-fits-all solution here, but it's something
to think about. :)

David
j***@gmail.com
2010-03-02 18:25:32 UTC
Permalink
Post by David Warde-Farley
Post by Gael Varoquaux
Yes, we forgot to choose which axis was features and which one was
samples! I guess this is because we already have a convention at the lab,
but this is very important. It seems to me that statsmodel uses the
convention (n_samples, n_features). Joseph, could you please confirm?
Yaroslav, could you enlighten us on your conventions (eventhough you of
dataset objects). Anybody else: what the convention you expect?
On one hand I usually think of points in n-dimensional space as
columns, and thus a collection of points as a matrix of columns, BUT
most tools and even one-off software I see uses rows for examples and
columns for features. This is what I see in most of the statistics
literature too, "n x p" where p is the number of predictors.
statsmodel uses the convention (n_samples, n_features)
Yes, I think that's the dominant convention in all econometrics
textbooks, and of course matlab and gauss (also the fortran
tradition). I had briefly the discussion with Ariel on one of the
mailing lists about the default axis=0 in scipy.stats.
I think one of the arguments for this is that this is also the
ordering of structured and record arrays.
The story get's a little bit messier in system of equations
estimators, there the axis layout convention in the literature is not
always the same. (and Skipper uses row observations internally in
sysreg, for vector regression VAR it's also a bit ambiguous).
But from the data, we expect observations in rows.
Post by David Warde-Farley
In general the API document on track looks good to me, though I might
add that labels may not be appropriate in e.g. unsupervised settings.
(Getting WAY ahead of ourselves, in a semi-supervised setting you
might have three arguments to fit, where you have a collection of
labeled data, the corresponding labels, and a third argument that is a
collection of unlabeled data.)
I guess that there'd be all sorts of situations where the fit() method
of a particular object would take various keyword arguments modifying
its operation.  A more interesting question is whether this API fully
accommodates situations where you don't have all the data available at
once, either because you have too much of it to fit in memory at once
or because it's a setting where you are constantly getting new data
and have to do online updates to the model parameters.
the way I understand it is that the API and number of methods will
grow, adding update_fit or other methods as the need for specific
models arises.
For example, one new method that we will add in statsmodels, when we
have more timeseries analysis, will be forecast (instead of predict)
which is like predict but assumes that the predicted observations
directly follow the current sample (eg. because of autocorrelation)
Post by David Warde-Farley
Another interesting question is what methods an unsupervised estimator
should provide in lieu of 'predict'. I think it will heavily depend on
the method. In a k-means estimator, a 'predict' analogue could return
a MAP estimate of the cluster assignment of test points. For a mixture
of Gaussians, you might want that, but you might also want the
posterior probability under each mixture component, or perhaps the
marginal likelihood of a test point under the entire model. I don't
know if there's a one-size-fits-all solution here, but it's something
to think about. :)
What does the predict method return for a classifier, a best guess
point estimate or the posterior distribution? (I don't even know what
the new discrete models in statsmodels do)

I don't think a one-size-fits-all solution will exist across all
possible models, but it should be possible to work with just a few
basic patterns.

Similar to Yaroslav's proposal of collecting result in .someresults.xxx
One problem is that if you want to reuse an estimator instance, then
it should be easy to (partially) wipe previous results. If all
estimation results are attached to the main model instance, then this
requires more attention and work.
statsmodels went to the other extreme in that most results are
attached to a separate results instance. The main advantage is code
reuse for result statistics across models, but I was never very happy
about where to draw the boundary between model and result instance.

Josef
Post by David Warde-Farley
David
------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Yaroslav Halchenko
2010-03-02 18:24:28 UTC
Permalink
Post by Gael Varoquaux
Yaroslav, could you enlighten us on your conventions (eventhough you of
dataset objects). Anybody else: what the convention you expect?
nsamples x nfeatures
i.e.

.shape[0] x .shape[1]

somewhat more specific convention we have, is that if our .samples
(actual data array) is > 2D, then dataset.nfeatures is still just a
.shape[1] (not a prod(.shape[1:]), i.e. simply each 'feature' considered to be
multidimensional... but in most of the cases, e.g. when dataset arrives
to a classifier .samples are 2D (were flattened and possible
subselected)
--
.-.
=------------------------------ /v\ ----------------------------=
Keep in touch // \\ (yoh@|www.)onerussian.com
Yaroslav Halchenko /( )\ ICQ#: 60653192
Linux User ^^-^^ [175555]
Skipper Seabold
2010-03-02 18:58:21 UTC
Permalink
On Tue, Mar 2, 2010 at 1:29 AM, Gael Varoquaux
Post by Gael Varoquaux
Post by David Cournapeau
Post by j***@gmail.com
here you don't mean matrices in the numpy sense, just 2d ndarrays? I
don't think capitalization by ndim is useful. In statsmodels for
example X (exog) can be 1d, 2d or now also None. Similar for systems
of regression equations we have a 2d endog (y), I think.
Yes, I agree as well. I don't think different conventions for
different ndim is that useful, and using upper-case has also the
drawback of clashing with the "upper case = global variable"
convention. What is important is to agree on whereas one uses row or
column to differentiate feature vs instance, and what to do when
having 1d array, especially when/if scikits.learn will grow a set of
online procedures with classifiers trained one sample at a time.
Yes, we forgot to choose which axis was features and which one was
samples! I guess this is because we already have a convention at the lab,
but this is very important. It seems to me that statsmodel uses the
convention (n_samples, n_features). Joseph, could you please confirm?
Yaroslav, could you enlighten us on your conventions (eventhough you of
dataset objects). Anybody else: what the convention you expect?
Post by David Cournapeau
In general, a good example to follow for ML is spider: they manage to
have a reasonably good API with matlab, which is quite an achievement
in itself :)
OK, let's have a look!
Post by David Cournapeau
Another thing which may be useful is to finish the (3 year old !)
dataset proposal (in scikits/trunk/learn/datasets). I have always
found the convention of one dataset format / package annoying - it is
hard to overestimate the advantage of having plain arrays as input to
algorithms IMHO, but this means a set of conventions to easily deal
with datasets is important.
Yes, I think we agree with you here. Fabian has started looking at this,
but the reason we got an API proposal out first is that people are
starting to implement estimators for the scikit, here at NeuroSpin. So we
decided we had to agree on some conventions.
I am coming to this discussion a bit late to contribute too much
maybe, but I have taken David's dataset proposal and used it for
statsmodels.

http://bazaar.launchpad.net/~scipystats/statsmodels/trunk/files/head%3A/scikits/statsmodels/datasets/

Basically, we have

import scikits.statsmodels as sm

data = sm.datasets.longley.Load()

then data has two attributes that hold the data, endog and exog that
are (observations x regressors), because that's how the linear
parametric models are set up. I'm not happy with this, as I don't
think it's general enough, (ie., there's not only one parametric model
that fits with each dataset; the 'names' of regressors are not readily
available), and I've gotten away from it with newer additions, but it
makes it easy to write tests and examples. I also tried to make it
easy to add datasets to approach something like
http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
without me being the only one who does it, so I provided some
templates and convenience functions to add new datasets. I think
having datasets is really important in writing tutorial/cookbook
examples and hope that we can agree on a general structure across
scipy and related packages.

The other thing to consider is that our data is not always numerical
(there could be a column of strings), so it's not as simple as just
having an array.

Skipper
Gael Varoquaux
2010-03-02 20:27:22 UTC
Permalink
Post by Skipper Seabold
then data has two attributes that hold the data, endog and exog that
are (observations x regressors), because that's how the linear
parametric models are set up. I'm not happy with this, as I don't
think it's general enough, (ie., there's not only one parametric model
that fits with each dataset; the 'names' of regressors are not readily
available), and I've gotten away from it with newer additions, but it
makes it easy to write tests and examples. I also tried to make it
easy to add datasets to approach something like
http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
without me being the only one who does it, so I provided some
templates and convenience functions to add new datasets. I think
having datasets is really important in writing tutorial/cookbook
examples and hope that we can agree on a general structure across
scipy and related packages.
I think it would be a nice place for what I call the 'bunch' pattern:

class Bunch(dict):
def __init__(self,**kw):
dict.__init__(self,kw)
self.__dict__ = self

http://code.activestate.com/recipes/52308-the-simple-but-handy-collector-of-a-bunch-of-named/
Post by Skipper Seabold
The other thing to consider is that our data is not always numerical
(there could be a column of strings), so it's not as simple as just
having an array.
You can have an array of arbitrary objects.

Gaël
Fabian Pedregosa
2010-03-11 15:32:45 UTC
Permalink
Post by Gael Varoquaux
Post by Skipper Seabold
then data has two attributes that hold the data, endog and exog that
are (observations x regressors), because that's how the linear
parametric models are set up. I'm not happy with this, as I don't
think it's general enough, (ie., there's not only one parametric model
that fits with each dataset; the 'names' of regressors are not readily
available), and I've gotten away from it with newer additions, but it
makes it easy to write tests and examples. I also tried to make it
easy to add datasets to approach something like
http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
without me being the only one who does it, so I provided some
templates and convenience functions to add new datasets. I think
having datasets is really important in writing tutorial/cookbook
examples and hope that we can agree on a general structure across
scipy and related packages.
dict.__init__(self,kw)
self.__dict__ = self
http://code.activestate.com/recipes/52308-the-simple-but-handy-collector-of-a-bunch-of-named/
I like this. It seems that also scikits.statsmodel implement a buch
pattern for datasets [1]. We should agree on how fields of the bunch
should be named.

In this aspect, the original dataset proposal [2] suggests these names
that are already implemented in most datasets and I believe we should
maintain:

- 'data': this value should be a record array containing the
actual data.

- 'label': this value should be a rank 1 array of integers,
contains the label index for each sample, that is label[i]
should be the label index of data[i]. If it contains float
values, it is used for regression instead.

- 'class': a record array such as class[i] is the class name. In
other words, this makes the correspondance label name > label
index.


except that I would prefer not to use record arrays, I find it a rather
fragile structure (won't transpose correctly, can't know the shape in
axis 1, etc).

I also made a wiki page [3] to compare the different approaches through
example.

Cheers,

fabian

[1]
http://bazaar.launchpad.net/%7Escipystats/statsmodels/trunk/annotate/head%3A/scikits/statsmodels/datasets/copper/data.py

[2]
http://bazaar.launchpad.net/~jsseabold/statsmodels/statsmodels-skipper/annotate/head%3A/scikits/statsmodels/datasets/DATASET_PROPOSAL.txt

[3] http://sourceforge.net/apps/trac/scikit-learn/wiki/DatasetProposal
Post by Gael Varoquaux
Post by Skipper Seabold
The other thing to consider is that our data is not always numerical
(there could be a column of strings), so it's not as simple as just
having an array.
You can have an array of arbitrary objects.
Gaël
------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Skipper Seabold
2010-03-11 16:08:43 UTC
Permalink
On Thu, Mar 11, 2010 at 10:32 AM, Fabian Pedregosa
Post by Fabian Pedregosa
Post by Skipper Seabold
then data has two attributes that hold the data, endog and exog that
are (observations x regressors), because that's how the linear
parametric models are set up.  I'm not happy with this, as I don't
think it's general enough, (ie., there's not only one parametric model
that fits with each dataset; the 'names' of regressors are not readily
available), and I've gotten away from it with newer additions, but it
makes it easy to write tests and examples.  I also tried to make it
easy to add datasets to approach something like
http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
without me being the only one who does it, so I provided some
templates and convenience functions to add new datasets.  I think
having datasets is really important in writing tutorial/cookbook
examples and hope that we can agree on a general structure across
scipy and related packages.
        dict.__init__(self,kw)
        self.__dict__ = self
http://code.activestate.com/recipes/52308-the-simple-but-handy-collector-of-a-bunch-of-named/
I like this. It seems that also scikits.statsmodel implement a buch
pattern for datasets [1]. We should agree on how fields of the bunch
should be named.
In this aspect, the original dataset proposal [2] suggests these names
that are already implemented in most datasets and I believe we should
I got away from this a bit. See below comments. It might just be a
misunderstanding on my part, if this is already convention somewhere.
Post by Fabian Pedregosa
    - 'data': this value should be a record array containing the
      actual data.
Record or structured array?
Post by Fabian Pedregosa
    - 'label': this value should be a rank 1 array of integers,
      contains the label index for each sample, that is label[i]
      should be the label index of data[i]. If it contains float
      values, it is used for regression instead.
I'm not sure what is meant here. If we use structured arrays we have
this already in dtype.names? [Returns after reading it again...] I
think this is what I am suggesting about having string data. Do we
really need to attach a representation of it as numerical data or can
we just create it on the fly when we need it? That's the way we are
probably headed in statsmodels. I can't think of a statistical
package that doesn't handle factors or categorical variables "behind
the scenes." See below.

I *really* like the idea of labels, though differently from what is
suggested here. So much so that I subclass arrays and provide a label
attribute for my work that provides lengthier descriptions of variable
names.
Post by Fabian Pedregosa
    - 'class': a record array such as class[i] is the class name. In
      other words, this makes the correspondance label name > label
      index.
Isn't this just what the names attribute is for?

In [4]: arr = np.array([(0,'obs1',4.5),(2.,'obs2',3.)],
dtype=[('var1',float),('var2','a4'),('var3',float)])

In [5]: arr
Out[5]:
array([(0.0, 'obs1', 4.5), (2.0, 'obs2', 3.0)],
dtype=[('var1', '<f8'), ('var2', '|S4'), ('var3', '<f8')])

In [6]: arr.dtype.names
Out[6]: ('var1', 'var2', 'var3')

In [7]: arr['var2']
Out[7]:
array(['obs1', 'obs2'],
dtype='|S4')

In [8]: arr[list(arr.dtype.names[1:])]
Out[8]:
array([('obs1', 4.5), ('obs2', 3.0)],
dtype=[('var2', '|S4'), ('var3', '<f8')])
Post by Fabian Pedregosa
except that I would prefer not to use record arrays, I find it a rather
fragile structure (won't transpose correctly, can't know the shape in
axis 1, etc).
This is rather key and what I was referring to when I said that we
quite often have data as strings. If we use structured arrays then
it's pretty flexible as far as keeping up with what's what, but when
you want to actually *do* something with them, then you have to take a
view and convert string data to a float (dummy variable)
representation. Some of the latter datasets I have included, I've
switched to using structured arrays, but then I find for all of my
examples and tests I have to do

In [9]: arr[['var1','var3']].view((float,2))
Out[9]:
array([[ 0. , 4.5],
[ 2. , 3. ]])

Which might be a small price to pay.

I also have a helper function in statsmodels to convert string data to
categorical and append it to a nd, structured, or record array.
http://bazaar.launchpad.net/~scipystats/statsmodels/trunk/annotate/head%3A/scikits/statsmodels/tools.py#L21

In [10]: import scikits.statsmodels as sm

In [11]: sm.tools.categorical(arr['var2'])
Out[11]:
array([['obs1', '1.0', '0.0'],
['obs2', '0.0', '1.0']],
dtype='|S8')

In [12]: sm.tools.categorical(arr['var2'], drop=True)
Out[12]:
array([[ 1., 0.],
[ 0., 1.]])

In [13]: sm.tools.categorical(arr, col=1)
Out[13]:
array([(0.0, 'obs1', 4.5, 1.0, 0.0), (2.0, 'obs2', 3.0, 0.0, 1.0)],
dtype=[('var1', '<f8'), ('var2', '|S4'), ('var3', '<f8'),
('obs1', '<f8'), ('obs2', '<f8')])


It's pretty robust, but needs some more name checking when making a
name and returning a structured array.

For the shape of axis 1, I usually do

In [14]: len(arr.dtype.names)
Out[14]: 3
Post by Fabian Pedregosa
I also made a wiki page [3] to compare the different approaches through
example.
I am all for changing the datasets in statsmodels to be consistent
with whatever y'all decide to do. I knew I would have to at some
point anyway. I'm glad this is being discussed though, as I think
having a datasets package would greatly enhance the experience of new
users to python/scipy. I, for one, like concrete examples rather than
generating random data when I'm trying to learn something.

Skipper
Post by Fabian Pedregosa
Cheers,
fabian
[1]
http://bazaar.launchpad.net/%7Escipystats/statsmodels/trunk/annotate/head%3A/scikits/statsmodels/datasets/copper/data.py
[2]
http://bazaar.launchpad.net/~jsseabold/statsmodels/statsmodels-skipper/annotate/head%3A/scikits/statsmodels/datasets/DATASET_PROPOSAL.txt
[3] http://sourceforge.net/apps/trac/scikit-learn/wiki/DatasetProposal
j***@gmail.com
2010-03-11 16:43:53 UTC
Permalink
Post by Skipper Seabold
On Thu, Mar 11, 2010 at 10:32 AM, Fabian Pedregosa
Post by Fabian Pedregosa
Post by Skipper Seabold
then data has two attributes that hold the data, endog and exog that
are (observations x regressors), because that's how the linear
parametric models are set up.  I'm not happy with this, as I don't
think it's general enough, (ie., there's not only one parametric model
that fits with each dataset; the 'names' of regressors are not readily
available), and I've gotten away from it with newer additions, but it
makes it easy to write tests and examples.  I also tried to make it
easy to add datasets to approach something like
http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
without me being the only one who does it, so I provided some
templates and convenience functions to add new datasets.  I think
having datasets is really important in writing tutorial/cookbook
examples and hope that we can agree on a general structure across
scipy and related packages.
        dict.__init__(self,kw)
        self.__dict__ = self
http://code.activestate.com/recipes/52308-the-simple-but-handy-collector-of-a-bunch-of-named/
I like this. It seems that also scikits.statsmodel implement a buch
pattern for datasets [1]. We should agree on how fields of the bunch
should be named.
In this aspect, the original dataset proposal [2] suggests these names
that are already implemented in most datasets and I believe we should
I got away from this a bit.  See below comments.  It might just be a
misunderstanding on my part, if this is already convention somewhere.
Post by Fabian Pedregosa
    - 'data': this value should be a record array containing the
      actual data.
Record or structured array?
Post by Fabian Pedregosa
    - 'label': this value should be a rank 1 array of integers,
      contains the label index for each sample, that is label[i]
      should be the label index of data[i]. If it contains float
      values, it is used for regression instead.
I'm not sure what is meant here.  If we use structured arrays we have
this already in dtype.names?  [Returns after reading it again...] I
think this is what I am suggesting about having string data.  Do we
really need to attach a representation of it as numerical data or can
we just create it on the fly when we need it?  That's the way we are
probably headed in statsmodels.  I can't think of a statistical
package that doesn't handle factors or categorical variables "behind
the scenes."  See below.
I *really* like the idea of labels, though differently from what is
suggested here.  So much so that I subclass arrays and provide a label
attribute for my work that provides lengthier descriptions of variable
names.
Post by Fabian Pedregosa
    - 'class': a record array such as class[i] is the class name. In
      other words, this makes the correspondance label name > label
      index.
Isn't this just what the names attribute is for?
I think this is more additional information for the other axis similar
to categorical below.

I didn't look at the implementation in details, but they two
problems/use-cases I have are usually with dates, getting dates from
strings into a usable form and labels of categorical variables that
are encoded.
for industry/macro modelling, for example NAICS codes and the verbal
industry names, or commodity codes and commodity names.
e.g for NIPA tabels: "Government consumption expenditures and gross
investment" is too long as a variable/column name to be useful, so it
would be useful to have some structure to switch between full names
and encoded or shortcut names.

My other problem is that it still takes too much time and effort to
get the information from a csv file into a pandas.DataMatrix or
la.larry or other format that has full labels for all axis not just
one as in structured arrays. Using structured arrays needs a good
encoding for the row description as one of the columns.

I like now csv files as main storage medium for data, because then it
is also relatively easy to load the data into matlab or R for
comparison.

Just some thoughts from a user, I find the example data in statsmodels
very convenient, but wish it were cheaper to add more.

Josef
Post by Skipper Seabold
In [4]: arr = np.array([(0,'obs1',4.5),(2.,'obs2',3.)],
dtype=[('var1',float),('var2','a4'),('var3',float)])
In [5]: arr
array([(0.0, 'obs1', 4.5), (2.0, 'obs2', 3.0)],
     dtype=[('var1', '<f8'), ('var2', '|S4'), ('var3', '<f8')])
In [6]: arr.dtype.names
Out[6]: ('var1', 'var2', 'var3')
In [7]: arr['var2']
array(['obs1', 'obs2'],
     dtype='|S4')
In [8]: arr[list(arr.dtype.names[1:])]
array([('obs1', 4.5), ('obs2', 3.0)],
     dtype=[('var2', '|S4'), ('var3', '<f8')])
Post by Fabian Pedregosa
except that I would prefer not to use record arrays, I find it a rather
fragile structure (won't transpose correctly, can't know the shape in
axis 1, etc).
This is rather key and what I was referring to when I said that we
quite often have data as strings.  If we use structured arrays then
it's pretty flexible as far as keeping up with what's what, but when
you want to actually *do* something with them, then you have to take a
view and convert string data to a float (dummy variable)
representation.  Some of the latter datasets I have included, I've
switched to using structured arrays, but then I find for all of my
examples and tests I have to do
In [9]: arr[['var1','var3']].view((float,2))
array([[ 0. ,  4.5],
      [ 2. ,  3. ]])
Which might be a small price to pay.
I also have a helper function in statsmodels to convert string data to
categorical and append it to a nd, structured, or record array.
http://bazaar.launchpad.net/~scipystats/statsmodels/trunk/annotate/head%3A/scikits/statsmodels/tools.py#L21
In [10]: import scikits.statsmodels as sm
In [11]: sm.tools.categorical(arr['var2'])
array([['obs1', '1.0', '0.0'],
      ['obs2', '0.0', '1.0']],
     dtype='|S8')
In [12]: sm.tools.categorical(arr['var2'], drop=True)
array([[ 1.,  0.],
      [ 0.,  1.]])
In [13]: sm.tools.categorical(arr, col=1)
array([(0.0, 'obs1', 4.5, 1.0, 0.0), (2.0, 'obs2', 3.0, 0.0, 1.0)],
     dtype=[('var1', '<f8'), ('var2', '|S4'), ('var3', '<f8'),
('obs1', '<f8'), ('obs2', '<f8')])
It's pretty robust, but needs some more name checking when making a
name and returning a structured array.
For the shape of axis 1, I usually do
In [14]: len(arr.dtype.names)
Out[14]: 3
Post by Fabian Pedregosa
I also made a wiki page [3] to compare the different approaches through
example.
I am all for changing the datasets in statsmodels to be consistent
with whatever y'all decide to do.  I knew I would have to at some
point anyway.  I'm glad this is being discussed though, as I think
having a datasets package would greatly enhance the experience of new
users to python/scipy.  I, for one, like concrete examples rather than
generating random data when I'm trying to learn something.
Skipper
Post by Fabian Pedregosa
Cheers,
fabian
[1]
http://bazaar.launchpad.net/%7Escipystats/statsmodels/trunk/annotate/head%3A/scikits/statsmodels/datasets/copper/data.py
[2]
http://bazaar.launchpad.net/~jsseabold/statsmodels/statsmodels-skipper/annotate/head%3A/scikits/statsmodels/datasets/DATASET_PROPOSAL.txt
[3] http://sourceforge.net/apps/trac/scikit-learn/wiki/DatasetProposal
------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Fabian Pedregosa
2010-03-16 10:05:53 UTC
Permalink
Post by j***@gmail.com
Post by Skipper Seabold
On Thu, Mar 11, 2010 at 10:32 AM, Fabian Pedregosa
Post by Fabian Pedregosa
Post by Gael Varoquaux
Post by Skipper Seabold
then data has two attributes that hold the data, endog and exog that
are (observations x regressors), because that's how the linear
parametric models are set up. I'm not happy with this, as I don't
think it's general enough, (ie., there's not only one parametric model
that fits with each dataset; the 'names' of regressors are not readily
available), and I've gotten away from it with newer additions, but it
makes it easy to write tests and examples. I also tried to make it
easy to add datasets to approach something like
http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
without me being the only one who does it, so I provided some
templates and convenience functions to add new datasets. I think
having datasets is really important in writing tutorial/cookbook
examples and hope that we can agree on a general structure across
scipy and related packages.
dict.__init__(self,kw)
self.__dict__ = self
http://code.activestate.com/recipes/52308-the-simple-but-handy-collector-of-a-bunch-of-named/
I like this. It seems that also scikits.statsmodel implement a buch
pattern for datasets [1]. We should agree on how fields of the bunch
should be named.
In this aspect, the original dataset proposal [2] suggests these names
that are already implemented in most datasets and I believe we should
I got away from this a bit. See below comments. It might just be a
misunderstanding on my part, if this is already convention somewhere.
Post by Fabian Pedregosa
- 'data': this value should be a record array containing the
actual data.
Record or structured array?
Post by Fabian Pedregosa
- 'label': this value should be a rank 1 array of integers,
contains the label index for each sample, that is label[i]
should be the label index of data[i]. If it contains float
values, it is used for regression instead.
I'm not sure what is meant here. If we use structured arrays we have
this already in dtype.names? [Returns after reading it again...] I
think this is what I am suggesting about having string data. Do we
really need to attach a representation of it as numerical data or can
we just create it on the fly when we need it? That's the way we are
probably headed in statsmodels. I can't think of a statistical
package that doesn't handle factors or categorical variables "behind
the scenes." See below.
I *really* like the idea of labels, though differently from what is
suggested here. So much so that I subclass arrays and provide a label
attribute for my work that provides lengthier descriptions of variable
names.
Post by Fabian Pedregosa
- 'class': a record array such as class[i] is the class name. In
other words, this makes the correspondance label name > label
index.
Isn't this just what the names attribute is for?
I think this is more additional information for the other axis similar
to categorical below.
I didn't look at the implementation in details, but they two
problems/use-cases I have are usually with dates, getting dates from
strings into a usable form and labels of categorical variables that
are encoded.
for industry/macro modelling, for example NAICS codes and the verbal
industry names, or commodity codes and commodity names.
e.g for NIPA tabels: "Government consumption expenditures and gross
investment" is too long as a variable/column name to be useful, so it
would be useful to have some structure to switch between full names
and encoded or shortcut names.
My other problem is that it still takes too much time and effort to
get the information from a csv file into a pandas.DataMatrix or
la.larry or other format that has full labels for all axis not just
one as in structured arrays. Using structured arrays needs a good
encoding for the row description as one of the columns.
I like now csv files as main storage medium for data, because then it
is also relatively easy to load the data into matlab or R for
comparison.
Thanks for pointing this out. Storage is very important indeed. I added
to the wiki [1] that data should be stored in an csv file. I added that
the first row should contain number of samples, number of features,
name of classes. This is not very pretty, but it is convenient to know
the size of the array before we start parsing. Other, I found no other
place to store name of classes. See [2] for an example on how csv would
look like.

I made a prototype implementation based on this design that I find quite
convenient. This approach makes it extremely easy to add new datasets
from [3], you only have to rename accordingly and add the 'metadata' row
to the csv.

If you want to play with it and send comments, a minimal example would
look like:

In [1]: from scikits.learn import datasets

In [2]: iris = datasets.load('iris')

In [3]: iris.label
Out[3]:
array([ 0., 0., ....])

Please do not consider this as anything definitive, it's just an
experiment to see what weaknesses does this approach have in some real
examples.

~fabian

[1] http://sourceforge.net/apps/trac/scikit-learn/wiki/DatasetProposal
[2]
http://github.com/fseoane/scikit-learn/tree/master/scikits/learn/datasets/data/
[3] http://archive.ics.uci.edu/ml/
Post by j***@gmail.com
Just some thoughts from a user, I find the example data in statsmodels
very convenient, but wish it were cheaper to add more.
Josef
Post by Skipper Seabold
In [4]: arr = np.array([(0,'obs1',4.5),(2.,'obs2',3.)],
dtype=[('var1',float),('var2','a4'),('var3',float)])
In [5]: arr
array([(0.0, 'obs1', 4.5), (2.0, 'obs2', 3.0)],
dtype=[('var1', '<f8'), ('var2', '|S4'), ('var3', '<f8')])
In [6]: arr.dtype.names
Out[6]: ('var1', 'var2', 'var3')
In [7]: arr['var2']
array(['obs1', 'obs2'],
dtype='|S4')
In [8]: arr[list(arr.dtype.names[1:])]
array([('obs1', 4.5), ('obs2', 3.0)],
dtype=[('var2', '|S4'), ('var3', '<f8')])
Post by Fabian Pedregosa
except that I would prefer not to use record arrays, I find it a rather
fragile structure (won't transpose correctly, can't know the shape in
axis 1, etc).
This is rather key and what I was referring to when I said that we
quite often have data as strings. If we use structured arrays then
it's pretty flexible as far as keeping up with what's what, but when
you want to actually *do* something with them, then you have to take a
view and convert string data to a float (dummy variable)
representation. Some of the latter datasets I have included, I've
switched to using structured arrays, but then I find for all of my
examples and tests I have to do
In [9]: arr[['var1','var3']].view((float,2))
array([[ 0. , 4.5],
[ 2. , 3. ]])
Which might be a small price to pay.
I also have a helper function in statsmodels to convert string data to
categorical and append it to a nd, structured, or record array.
http://bazaar.launchpad.net/~scipystats/statsmodels/trunk/annotate/head%3A/scikits/statsmodels/tools.py#L21
In [10]: import scikits.statsmodels as sm
In [11]: sm.tools.categorical(arr['var2'])
array([['obs1', '1.0', '0.0'],
['obs2', '0.0', '1.0']],
dtype='|S8')
In [12]: sm.tools.categorical(arr['var2'], drop=True)
array([[ 1., 0.],
[ 0., 1.]])
In [13]: sm.tools.categorical(arr, col=1)
array([(0.0, 'obs1', 4.5, 1.0, 0.0), (2.0, 'obs2', 3.0, 0.0, 1.0)],
dtype=[('var1', '<f8'), ('var2', '|S4'), ('var3', '<f8'),
('obs1', '<f8'), ('obs2', '<f8')])
It's pretty robust, but needs some more name checking when making a
name and returning a structured array.
For the shape of axis 1, I usually do
In [14]: len(arr.dtype.names)
Out[14]: 3
Post by Fabian Pedregosa
I also made a wiki page [3] to compare the different approaches through
example.
I am all for changing the datasets in statsmodels to be consistent
with whatever y'all decide to do. I knew I would have to at some
point anyway. I'm glad this is being discussed though, as I think
having a datasets package would greatly enhance the experience of new
users to python/scipy. I, for one, like concrete examples rather than
generating random data when I'm trying to learn something.
Skipper
Post by Fabian Pedregosa
Cheers,
fabian
[1]
http://bazaar.launchpad.net/%7Escipystats/statsmodels/trunk/annotate/head%3A/scikits/statsmodels/datasets/copper/data.py
[2]
http://bazaar.launchpad.net/~jsseabold/statsmodels/statsmodels-skipper/annotate/head%3A/scikits/statsmodels/datasets/DATASET_PROPOSAL.txt
[3] http://sourceforge.net/apps/trac/scikit-learn/wiki/DatasetProposal
------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Fabian Pedregosa
2010-03-02 09:44:35 UTC
Permalink
Post by David Cournapeau
Post by j***@gmail.com
here you don't mean matrices in the numpy sense, just 2d ndarrays? I
don't think capitalization by ndim is useful. In statsmodels for
example X (exog) can be 1d, 2d or now also None. Similar for systems
of regression equations we have a 2d endog (y), I think.
Yes, I agree as well. I don't think different conventions for
different ndim is that useful, and using upper-case has also the
drawback of clashing with the "upper case = global variable"
convention. What is important is to agree on whereas one uses row or
column to differentiate feature vs instance, and what to do when
having 1d array, especially when/if scikits.learn will grow a set of
online procedures with classifiers trained one sample at a time.
yeah, maybe that was not a great idea. I've changed that and adopted the
convention "upper case = global variable".

I've also added a description of parameters (see section
Fitting.Parameters) that will hopefully address the ambiguity in the use
of row/columns to denote features/instances [1]
Post by David Cournapeau
In general, a good example to follow for ML is spider: they manage to
have a reasonably good API with matlab, which is quite an achievement
in itself :)
Another thing which may be useful is to finish the (3 year old !)
dataset proposal (in scikits/trunk/learn/datasets). I have always
found the convention of one dataset format / package annoying - it is
hard to overestimate the advantage of having plain arrays as input to
algorithms IMHO, but this means a set of conventions to easily deal
with datasets is important.
Using record arrays, as is done in some datasets is in theory a good
idea, but unfortunately I'm having some issues when plugging those
datasets into some algorithms [2].

When I finish my work with the svm module, I'll concentrate on this, but
to be frank I have not yet a clear idea on how this should be done.

[1]
https://sourceforge.net/apps/trac/scikit-learn/wiki/ApiDiscussion#Parameters
[2] https://sourceforge.net/apps/trac/scikit-learn/ticket/19


Cheers,

fabian
Post by David Cournapeau
cheers,
David
------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Yaroslav Halchenko
2010-03-02 15:23:34 UTC
Permalink
Post by Fabian Pedregosa
yeah, maybe that was not a great idea. I've changed that and adopted the
convention "upper case = global variable".
hm... citing PEP 8:

Constants

Constants are usually declared on a module level and written in all
capital letters with underscores separating words. Examples include
MAX_OVERFLOW and TOTAL.

so it is not per se about globals but about constants usually to be denoted in UPPER CASE.
--
.-.
=------------------------------ /v\ ----------------------------=
Keep in touch // \\ (yoh@|www.)onerussian.com
Yaroslav Halchenko /( )\ ICQ#: 60653192
Linux User ^^-^^ [175555]
Continue reading on narkive:
Loading...