Discussion:
shouldn't models be reusable?
(too old to reply)
Yaroslav Halchenko
2010-05-31 18:08:10 UTC
Permalink
Hi,

I've decided to ask first this basic fundamental question about the
target design, before disclosing/fixing a possibly existing
specific issue:

if I create a model (classifier) and fit it to one data set,
should I be able to "refit" it later on on another data without any
side-effects (e.g. crash, incorrect results, etc)?
--
.-.
=------------------------------ /v\ ----------------------------=
Keep in touch // \\ (yoh@|www.)onerussian.com
Yaroslav Halchenko /( )\ ICQ#: 60653192
Linux User ^^-^^ [175555]
Gael Varoquaux
2010-05-31 18:15:56 UTC
Permalink
Post by Yaroslav Halchenko
if I create a model (classifier) and fit it to one data set,
should I be able to "refit" it later on on another data without any
side-effects (e.g. crash, incorrect results, etc)?
I believe that this is a goal we had set ourselves. There have been a few
challenges to satisfying this goal, but I'd like to keep to it as much as
possible.

At some point, we will probably add a 'refit' function (or similarily
named) for online fitting, recursive estimation, or warm restart. As all
these functionnality do not satisfy the above criterion.

Gaël
Yaroslav Halchenko
2010-05-31 18:25:47 UTC
Permalink
Post by Gael Varoquaux
Post by Yaroslav Halchenko
if I create a model (classifier) and fit it to one data set,
should I be able to "refit" it later on on another data without any
side-effects (e.g. crash, incorrect results, etc)?
I believe that this is a goal we had set ourselves. There have been a
few challenges to satisfying this goal, but I'd like to keep to it as
much as possible.
worth a unittest then. I've hit an issue with LDA -- see attached
("git am" - friendly) tentative patch (decide to check with you
first even though it is quite trivial, either it confirms the notion of
estimates you carry through)

I hit it (I believe) by training first on binary data set and then
trying to fit the same instance on data with 3 labels.
--
.-.
=------------------------------ /v\ ----------------------------=
Keep in touch // \\ (yoh@|www.)onerussian.com
Yaroslav Halchenko /( )\ ICQ#: 60653192
Linux User ^^-^^ [175555]
Fabian Pedregosa
2010-06-01 07:17:13 UTC
Permalink
Post by Yaroslav Halchenko
Post by Gael Varoquaux
Post by Yaroslav Halchenko
if I create a model (classifier) and fit it to one data set,
should I be able to "refit" it later on on another data without any
side-effects (e.g. crash, incorrect results, etc)?
I believe that this is a goal we had set ourselves. There have been a
few challenges to satisfying this goal, but I'd like to keep to it as
much as possible.
worth a unittest then. I've hit an issue with LDA -- see attached
("git am" - friendly) tentative patch (decide to check with you
first even though it is quite trivial, either it confirms the notion of
estimates you carry through)
Thanks Yaroslav. Could you please push it along with a test case?

BTW, this made me spot other bugs:

http://sourceforge.net/apps/trac/scikit-learn/ticket/60

Thanks,

fabian
Post by Yaroslav Halchenko
I hit it (I believe) by training first on binary data set and then
trying to fit the same instance on data with 3 labels.
------------------------------------------------------------------------
------------------------------------------------------------------------------
------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Yaroslav Halchenko
2010-06-01 15:05:07 UTC
Permalink
Post by Fabian Pedregosa
Post by Yaroslav Halchenko
worth a unittest then. I've hit an issue with LDA -- see attached
("git am" - friendly) tentative patch (decide to check with you
first even though it is quite trivial, either it confirms the notion of
estimates you carry through)
Thanks Yaroslav. Could you please push it along with a test case?
ok -- will do
Post by Fabian Pedregosa
http://sourceforge.net/apps/trac/scikit-learn/ticket/60
this one is what I was talking about in terms of "meaningful exceptions"
-- division by zero is unavoidable there by definition of LDA - but it
is better to spit out something more informative
--
.-.
=------------------------------ /v\ ----------------------------=
Keep in touch // \\ (yoh@|www.)onerussian.com
Yaroslav Halchenko /( )\ ICQ#: 60653192
Linux User ^^-^^ [175555]
Yaroslav Halchenko
2010-06-01 15:38:00 UTC
Permalink
Post by Fabian Pedregosa
Thanks Yaroslav. Could you please push it along with a test case?
done -- also pushed 1 little beautification patch (so my emacs screen is
not all blue because of the pylint warnings ;) )
--
.-.
=------------------------------ /v\ ----------------------------=
Keep in touch // \\ (yoh@|www.)onerussian.com
Yaroslav Halchenko /( )\ ICQ#: 60653192
Linux User ^^-^^ [175555]
j***@gmail.com
2010-05-31 18:26:50 UTC
Permalink
On Mon, May 31, 2010 at 2:15 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by Yaroslav Halchenko
if I create a model (classifier) and fit it to one data set,
should I be able to "refit" it later on on another data without any
side-effects (e.g. crash, incorrect results, etc)?
I believe that this is a goal we had set ourselves. There have been a few
challenges to satisfying this goal, but I'd like to keep to it as much as
possible.
Mostly out of curiosity, because I haven't seen a strong case for this
in econometrics yet.

When you refit a model to a new dataset, what are you actually reusing?
Post by Gael Varoquaux
At some point, we will probably add a 'refit' function (or similarily
named) for online fitting, recursive estimation, or warm restart. As all
these functionnality do not satisfy the above criterion.
this I understand, but is not a another dataset.

Josef
Post by Gael Varoquaux
Gaël
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Yaroslav Halchenko
2010-05-31 19:31:58 UTC
Permalink
Post by j***@gmail.com
Post by Gael Varoquaux
Post by Yaroslav Halchenko
if I create a model (classifier) and fit it to one data set,
should I be able to "refit" it later on on another data without any
side-effects (e.g. crash, incorrect results, etc)?
I believe that this is a goal we had set ourselves. There have been a few
challenges to satisfying this goal, but I'd like to keep to it as much as
possible.
Mostly out of curiosity, because I haven't seen a strong case for this
in econometrics yet.
When you refit a model to a new dataset, what are you actually reusing?
shouldn't the answer be "nothing"? ;)
Post by j***@gmail.com
Post by Gael Varoquaux
At some point, we will probably add a 'refit' function (or similarily
named) for online fitting, recursive estimation, or warm restart. As all
these functionnality do not satisfy the above criterion.
this I understand, but is not a another dataset.
for "warm restart" (if I got it correctly) it could be really a "new"
dataset. E.g. if algorithm guarantees convergence (e.g. SVM?), i.e.
starting point for optimization shouldn't matter, it might be
beneficial to start from a location found on a previous dataset to
achieve faster convergence (e.g. during iterative cross-validation or
any other statistics bootstrapping).
--
.-.
=------------------------------ /v\ ----------------------------=
Keep in touch // \\ (yoh@|www.)onerussian.com
Yaroslav Halchenko /( )\ ICQ#: 60653192
Linux User ^^-^^ [175555]
Alexandre Gramfort
2010-05-31 19:35:32 UTC
Permalink
Post by Yaroslav Halchenko
for "warm restart" (if I got it correctly) it could be really a "new"
dataset.  E.g. if algorithm guarantees convergence (e.g. SVM?), i.e.
starting point for optimization shouldn't matter, it might be
beneficial to start from a location found on a previous dataset to
achieve faster convergence (e.g. during iterative cross-validation or
any other statistics bootstrapping).
if can also be the same dataset with a different regularization parameter.

It's how the lasso path is computed currently.

Alex
j***@gmail.com
2010-05-31 20:07:39 UTC
Permalink
Post by Yaroslav Halchenko
Post by j***@gmail.com
Post by Gael Varoquaux
Post by Yaroslav Halchenko
if I create a model (classifier) and fit it to one data set,
should I be able to "refit" it later on on another data without any
side-effects (e.g. crash, incorrect results, etc)?
I believe that this is a goal we had set ourselves. There have been a few
challenges to satisfying this goal, but I'd like to keep to it as much as
possible.
Mostly out of curiosity, because I haven't seen a strong case for this
in econometrics yet.
When you refit a model to a new dataset, what are you actually reusing?
shouldn't the answer be "nothing"? ;)
If you don't reuse anything, then why don't you just create a new
instance for the new data sets. This has been our practice so far in
statsmodels, and we thought it's easier to do then to (selectively)
wipe the memory of the existing instance.

The only case for us where this might imply some extra cost is if in
"supervised learning" the same design matrix is fit to different
endogenous data sets, and there are costly internal transformations of
the design matrix.
Post by Yaroslav Halchenko
Post by j***@gmail.com
Post by Gael Varoquaux
At some point, we will probably add a 'refit' function (or similarily
named) for online fitting, recursive estimation, or warm restart. As all
these functionnality do not satisfy the above criterion.
this I understand, but is not a another dataset.
for "warm restart" (if I got it correctly) it could be really a "new"
dataset.  E.g. if algorithm guarantees convergence (e.g. SVM?), i.e.
starting point for optimization shouldn't matter, it might be
beneficial to start from a location found on a previous dataset to
achieve faster convergence (e.g. during iterative cross-validation or
any other statistics bootstrapping).
For non-linear estimators, we have the starting value as an option to fit.

With cross-validation and bootstrapping we are not far enough and
haven't formalized any structure yet, but I assume if creating a new
instance is too costly, then we will write model specific
bootstrapping and cross-validation code.

Since we have less experience with larger data sets, I keep looking
over some shoulders.

Josef
Post by Yaroslav Halchenko
--
                                 .-.
=------------------------------   /v\  ----------------------------=
Yaroslav Halchenko              /(   )\               ICQ#: 60653192
                  Linux User    ^^-^^    [175555]
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Yaroslav Halchenko
2010-06-01 01:30:13 UTC
Permalink
Post by j***@gmail.com
Post by Yaroslav Halchenko
Post by j***@gmail.com
Post by Gael Varoquaux
I believe that this is a goal we had set ourselves. There have been a few
challenges to satisfying this goal, but I'd like to keep to it as much as
possible.
Mostly out of curiosity, because I haven't seen a strong case for this
in econometrics yet.
When you refit a model to a new dataset, what are you actually reusing?
shouldn't the answer be "nothing"? ;)
If you don't reuse anything, then why don't you just create a new
instance for the new data sets.
what about cross-validation and bootstrapping? e.g. user provides
existing instance (could be not yet trained of cause) and I need to
evaluate some measure on that model based on some sampling/splitting of
dataset at hands. It would be somewhat overkill to ask user to provide
instances for each split, or to do deepcopying of original instance for
each split. Natively, imho, it should be the same instance used over in
a loop.
Post by j***@gmail.com
With cross-validation and bootstrapping we are not far enough and
haven't formalized any structure yet, but I assume if creating a new
instance is too costly, then we will write model specific
bootstrapping and cross-validation code.
hm... if I got it right, sounds like overkill.
--
.-.
=------------------------------ /v\ ----------------------------=
Keep in touch // \\ (yoh@|www.)onerussian.com
Yaroslav Halchenko /( )\ ICQ#: 60653192
Linux User ^^-^^ [175555]
j***@gmail.com
2010-06-01 06:36:10 UTC
Permalink
Post by Yaroslav Halchenko
Post by j***@gmail.com
Post by Yaroslav Halchenko
Post by j***@gmail.com
Post by Gael Varoquaux
I believe that this is a goal we had set ourselves. There have been a few
challenges to satisfying this goal, but I'd like to keep to it as much as
possible.
Mostly out of curiosity, because I haven't seen a strong case for this
in econometrics yet.
When you refit a model to a new dataset, what are you actually reusing?
shouldn't the answer be "nothing"? ;)
If you don't reuse anything, then why don't you just create a new
instance for the new data sets.
what about cross-validation and bootstrapping? e.g. user provides
existing instance (could be not yet trained of cause) and I need to
evaluate some measure on that model based on some sampling/splitting of
dataset at hands.  It would be somewhat overkill to ask user to provide
instances for each split, or to do deepcopying of original instance for
each split.  Natively, imho, it should be the same instance used over in
a loop.
my draft version for crossvalidation, that I wrote for example for
principal component regression creates anew instance and in this case
there is nothing to reuse for eg. leavePout

for OLS cross val or bootstrap, which would be easier:

def ols_bootstrap(endog, exog):
for i in boottrapiterator(n): #or i,j in crossval_iterator
res(i) = sm.OLS(endog[i], exog[i,:]).fit().params

sm.OLS(..) creates a new instance instead of wiping and reusing an
existing instance.

In statsmodels, most except for the minimum results are by now lazy,
and are not calculated until requested. I haven't tried yet to figure
out what "state" in pymvpa really does and I don't know what your
overhead for instance creation is. That's one of the reasons for my
initial question.

and our internal code for generalized linear models, which works on
iteratively updating the weights matrix has roughly a pattern like
(from memory)

while ... notfinished:
...
newres = sm.WLS(endog, exog, weights= newweights)

which creates a new WLS (weighted least squares) instance in each iteration.
I was initialy reluctant last year when Skipper and Alan proposed the
design change, whether a new instance is useful, but finally agreed
that since there is little additional cost besides instance creation,
and it makes a cleaner design.
Post by Yaroslav Halchenko
Post by j***@gmail.com
With cross-validation and bootstrapping we are not far enough and
haven't formalized any structure yet, but I assume if creating a new
instance is too costly, then we will write model specific
bootstrapping and cross-validation code.
hm... if I got it right, sounds like overkill.
Somewhat related, a while ago I wrote a standard cusum test for
structural breaks, which is still standalone. this case requires
recursive residuals

this would roughly be

for i in range(start, nobs):
res(i) = y[i] - OLS(y[:i-1],x[:i-1,:],).fit().model.predict(x[i,:])

which is very inefficient in this case, and I wrote an online
estimator for the params, that updates on the inverse of X'X, which is
the standard approach for this case.

Josef
Post by Yaroslav Halchenko
--
                                 .-.
=------------------------------   /v\  ----------------------------=
Yaroslav Halchenko              /(   )\               ICQ#: 60653192
                  Linux User    ^^-^^    [175555]
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Yaroslav Halchenko
2010-06-01 15:51:33 UTC
Permalink
Post by j***@gmail.com
In statsmodels, most except for the minimum results are by now lazy,
and are not calculated until requested. I haven't tried yet to figure
out what "state" in pymvpa really does and I don't know what your
overhead for instance creation is. That's one of the reasons for my
initial question.
;) "state" is no longer there -- welcome 'conditional attribute', so
collection of those is .ca now (in 0.5 which is under development).

Easy example of what it would be useful for, is scikit-learn's LDA:

def predict(self, X):
probas = self.predict_proba(X)
y_pred = self.classes[probas.argmax(1)]
return y_pred

so, if I want to know both predictions (output of predict) and probabilities
per each class (output of predict_proba), I would need to either

* call predict and predict_proba separately, thus invoking predict_proba twice
* call predict_proba and manually deduce classes with argmax

in PyMVPA I would enable something like 'probabilities' conditional attribute which would be set in predict, e.g. smth along the


def predict(self, X):
self.ca.probabilities = probas = self.predict_proba(X)
y_pred = self.classes[probas.argmax(1)]
return y_pred

so, it is enabled -- self.ca.probabilities will maintain values of
probabilities.


But overall question is not about instance creation overhead -- it is imho
about API and abstraction. I see no objective reason why an instance cannot be
retrained, and why cross-validation should be model-specific in most of the
cases. Relying on creating a new instance within cross-validation precludes
clean propagation (unless suggested factory method or explicit 'swallow'
arguments of cross-validation function to be passed to the constructor) of
parametrization for a model given by a user. But sure thing -- in your case it
might be different and such instances creations are more native thing to do.
In your scenarios models seems to be more lightweight throw out things... but I
still see no reason why they should not be reliable refit ;)
--
.-.
=------------------------------ /v\ ----------------------------=
Keep in touch // \\ (yoh@|www.)onerussian.com
Yaroslav Halchenko /( )\ ICQ#: 60653192
Linux User ^^-^^ [175555]
j***@gmail.com
2010-06-01 16:37:45 UTC
Permalink
Post by Yaroslav Halchenko
Post by j***@gmail.com
In statsmodels, most except for the minimum results are by now lazy,
and are not calculated until requested. I haven't tried yet to figure
out what "state" in pymvpa really does and I don't know what your
overhead for instance creation is. That's one of the reasons for my
initial question.
;) "state" is no longer there -- welcome 'conditional attribute', so
collection of those is .ca now (in 0.5 which is under development).
       probas = self.predict_proba(X)
       y_pred = self.classes[probas.argmax(1)]
       return y_pred
so, if I want to know both predictions (output of predict) and probabilities
per each class (output of predict_proba), I would need to either
* call predict and predict_proba separately, thus invoking predict_proba twice
* call predict_proba and manually deduce classes with argmax
in PyMVPA I would enable something like 'probabilities' conditional attribute which would be set in predict, e.g. smth along the
       self.ca.probabilities = probas = self.predict_proba(X)
       y_pred = self.classes[probas.argmax(1)]
       return y_pred
so, it is enabled -- self.ca.probabilities will maintain values of
probabilities.
But overall question is not about instance creation overhead -- it is imho
about API and abstraction.  I see no objective reason why an instance cannot be
retrained, and why cross-validation should be model-specific in most of the
cases.  Relying on creating a new instance within cross-validation precludes
clean propagation (unless suggested factory method or explicit 'swallow'
arguments of cross-validation function to be passed to the constructor) of
parametrization for a model given by a user.  But sure thing -- in your case it
might be different and such instances creations are more native thing to do.
In your scenarios models seems to be more lightweight throw out things... but I
still see no reason why they should not be reliable refit ;)
Thanks for the explanation, I will look again at the internal design
differences more carefully again.
I also realized, I haven't read your previous message carefully
enough, maybe going through a round of emails in the middle of the
night is not the best approach.

Josef
Post by Yaroslav Halchenko
--
                                 .-.
=------------------------------   /v\  ----------------------------=
Yaroslav Halchenko              /(   )\               ICQ#: 60653192
                  Linux User    ^^-^^    [175555]
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2010-06-01 23:03:52 UTC
Permalink
Post by Yaroslav Halchenko
;) "state" is no longer there -- welcome 'conditional attribute', so
collection of those is .ca now (in 0.5 which is under development).
probas = self.predict_proba(X)
y_pred = self.classes[probas.argmax(1)]
return y_pred
so, if I want to know both predictions (output of predict) and probabilities
per each class (output of predict_proba), I would need to either
* call predict and predict_proba separately, thus invoking predict_proba twice
* call predict_proba and manually deduce classes with argmax
in PyMVPA I would enable something like 'probabilities' conditional attribute which would be set in predict, e.g. smth along the
self.ca.probabilities = probas = self.predict_proba(X)
y_pred = self.classes[probas.argmax(1)]
return y_pred
so, it is enabled -- self.ca.probabilities will maintain values of
probabilities.
I must admit that I am uneasy with such code. I have used a fair amount
of code that does this, and I have been out-smarted more than once.

The problem that you are trying to address here is that of
lazy-reevaluation: if an operation is to be done several with the same
input, there is no need to perform computation more than once, the result
could simply be stored.

Now, given a result stored, how do I know that the result correspond to
the input that have in mind? Once a library starts having these
behaviors, I am worried each time I call a method (will it have a
side-effect I should be aware off), or I pass my object to a function
(will it fit/predict on data, and thus modify my object). This renders my
reading of the code much more tedious, because I can't think of functions
as black-boxs.

I am not saying that I have a good solution to propose, I am just
pointing out why I try to minimize such behavior.
Post by Yaroslav Halchenko
But overall question is not about instance creation overhead -- it is
imho about API and abstraction. I see no objective reason why an
instance cannot be retrained, and why cross-validation should be
model-specific in most of the cases. Relying on creating a new
instance within cross-validation precludes clean propagation (unless
suggested factory method or explicit 'swallow' arguments of
cross-validation function to be passed to the constructor) of
parametrization for a model given by a user. But sure thing -- in your
case it might be different and such instances creations are more native
thing to do. In your scenarios models seems to be more lightweight
throw out things... but I still see no reason why they should not be
reliable refit ;)
+1 on that. I believe that having to throw away an instance is going to
confuse the user. Thus, if we want to do that, we need to raise an error
when a model is fitted a second time.

My 2 tired cents,

Gaël
Yaroslav Halchenko
2010-06-01 23:35:01 UTC
Permalink
Post by Gael Varoquaux
Now, given a result stored, how do I know that the result correspond to
the input that have in mind?
it corresponds to the most recent evaluation -- upon subsequent call to
train() (yours fit()) for instance all conditional attributes get reset
to 'not set' state.

lazy evaluation is indeed something in line with this, BUT might be more
cumbersome to control. But sure thing -- everything depends ;)
--
.-.
=------------------------------ /v\ ----------------------------=
Keep in touch // \\ (yoh@|www.)onerussian.com
Yaroslav Halchenko /( )\ ICQ#: 60653192
Linux User ^^-^^ [175555]
Gael Varoquaux
2010-06-02 05:57:41 UTC
Permalink
Post by Yaroslav Halchenko
Post by Gael Varoquaux
Now, given a result stored, how do I know that the result correspond to
the input that have in mind?
it corresponds to the most recent evaluation -- upon subsequent call to
train() (yours fit()) for instance all conditional attributes get reset
to 'not set' state.
Right, but my point is: if your code is not trivial, how do you know that
'predict' has not been called a second time between the moment that you
did your predict, and the moment you are grabbing the probabilities.
Post by Yaroslav Halchenko
lazy evaluation is indeed something in line with this, BUT might be more
cumbersome to control. But sure thing -- everything depends ;)
I was just using the term 'lazy reevaluation' as a general term to call
this pattern.

Gaël

Alexandre Gramfort
2010-06-01 09:57:06 UTC
Permalink
Post by Yaroslav Halchenko
what about cross-validation and bootstrapping? e.g. user provides
existing instance (could be not yet trained of cause) and I need to
evaluate some measure on that model based on some sampling/splitting of
dataset at hands.  It would be somewhat overkill to ask user to provide
instances for each split, or to do deepcopying of original instance for
each split.  Natively, imho, it should be the same instance used over in
a loop.
for example a Lasso instance does a warm restart by default.
if you want to have different instances for every fold in a function,
the solution
I've adopted is to pass to the function a factory that returns an
instance when called.

Alex
Continue reading on narkive:
Loading...