Discussion:
"MemoryError() in 'sklearn.tree._tree.Tree._resize' ignored"
(too old to reply)
c TAKES
2014-09-20 19:29:47 UTC
Permalink
Any help someone might be able to provide with this error would be greatly
appreciated.

I was trying to familiarize myself with sklearn's decision tree source code
by playing around with some of the lower level classes and functions with a
portion of the iris data set. I'm trying to build a single tree using
BestFirstTreeBuilder and I'm using code I borrowed from tree.py. The last
line in my code is creating the error:

builder.build(tree_, x, y)

The full error is:

Exception MemoryError: MemoryError() in 'sklearn.tree._tree.Tree._resize'
ignored


I've attached the full code I'm using (one page) and also pasted it below.
The version of sklearn that's being used is the pipped version rather than
the most recent version on github. Also trying this on Windows.

Anyone recognize this error?

Best, and thanks so much!
Ken



import numpy as np


from sklearn.tree._tree import BestFirstTreeBuilder

from sklearn import datasets

from sklearn.tree import _tree

from sklearn.tree._tree import Tree


from sklearn.utils import check_random_state



iris = datasets.load_iris()


x = iris.data[:, :2] # only the first two features.

y = iris.target



CRITERIA_CLF = {"gini": _tree.Gini, "entropy": _tree.Entropy}

CRITERIA_REG = {"mse": _tree.MSE, "friedman_mse": _tree.FriedmanMSE}

SPLITTERS = {"best": _tree.BestSplitter,

"presort-best": _tree.PresortBestSplitter,

"random": _tree.RandomSplitter}


criterion = "gini"

splitter = "best"

min_samples_split = 2

min_samples_leaf = 1

#min_weight_leaf = 0

max_depth = (2 ** 31) - 1

max_leaf_nodes = (2 ** 31) - 1


random_state = 0

random_state = check_random_state(random_state)


y = np.reshape(y, (-1, 1))


n_samples, n_features_ = x.shape

max_features_ = n_features_


n_outputs_ = y.shape[1]

n_classes_ = [1] * n_outputs_

n_classes_ = np.array(n_classes_, dtype=np.intp)


criterion = CRITERIA_CLF[criterion](n_outputs_,

n_classes_)


splitter = SPLITTERS[splitter](criterion,

max_features_,

min_samples_leaf,

random_state)


# Note: current pipped version of sklearn omits min_weight_leaf

builder = BestFirstTreeBuilder(splitter, min_samples_split,

min_samples_leaf,

#min_weight_leaf,

max_depth,

max_leaf_nodes)


tree_ = Tree(n_features_, n_classes_, n_outputs_)


builder.build(tree_, x, y)
I read the RGF paper. Interesting method but definitely too early to
include it in scikit-learn (we focus on mature algorithms). This also looks
more complicated to implement than gradient boosting, since tree induction
and boosting are interleaved.
The paper also clarified what "fully corrective" means, thanks. To
summarize, here are different strategies for setting the weights in
1. using the same constant value (`learning_rate`) for all estimators
2. setting the weight of the last estimator by line search (other weights
are kept fixed)
3. setting one separate weight per leaf node of the last estimator, by
line search
4. refitting all estimators weights (including the past ones)
5. refitting all leaf nodes of all estimators?
Some authors [1] argued that option 1 is sufficient in practice to get
good performance since our goal is to do well on test data, not training
data.
Option 3 is what scikit-learn implements. Option 4 is the fully corrective
case. It basically amounts to a least squares for square loss or to
logisic regression for log loss (using each estimator predictions as
features). One thing though is that our implementation doesn't store the
estimator weights explicitly (leaves are updated directly). Setting a
penalty (l1 or l2) on the estimator weights (i.e., on the functional)
should prevent overfitting. Option 5 sounds pretty computationally
expensive, although the update doesn't need to be done at every iteration.
I recently re-implemented gradient boosting [2]. One difference in my
implementation is that it is possible to use any base learner (not just
trees). So my implementation stores estimator weights explicitly and uses
option 2 above. The fully corrective updates (option 4) might be easier to
add to my implementation.
BTW, the notion of fully corrective updates is also present in the
matching pursuit (-> orthogonal matching pursuit) and frank-wolfe
literatures.
Mathieu
[1] "Boosting Algorithms: Regularization, Prediction and Model Fitting",
Peter B ̈uhlmann and Torsten Hothorn (thanks to Peter for telling me about
this paper)
[2]
https://github.com/mblondel/ivalice/blob/master/ivalice/impl/gradient_boosting.py
Mathieu
yes - In fact my real goal is to implement RGF ultimately, though I had
considered building/forking off the current GradientBoostingRegressor
package as a starting point A) b/c I'm new to developing for scikit-learn
and B) to maintain as much consistency as possible with the rest of the
package.
Upon further review though, I don't think there's much point in adding
fully corrective updates to GBR because without the regularization (the
rest of RGF) it is probably useless and leads to over fitting, as per
http://stat.rutgers.edu/home/tzhang/software/rgf/tpami14-rgf.pdf.
So it would likely be more useful to go ahead and create RGF as an
entirely separate module. But that could take some time, of course.
Is anyone working on RGF for sklearn?
Arnaud, thanks for directing me to your sparse implementation, I will
have a look! It would certainly be excellent to have this work for all
decision tree algorithms.
Ken
On Tue, Sep 16, 2014 at 11:16 AM, Peter Prettenhofer <
The only reference I know is the Regularized Greedy Forest paper by
Johnson and Zhang [1]
I havent read the primary source (by Zhang as well).
[1] http://arxiv.org/abs/1109.0887
Could you give a reference for gradient boosting with fully corrective
updates?
Since the philosophy of gradient boosting is to fit each tree against
the residuals (or negative gradient) so far, I am wondering how such fully
corrective update would work...
Mathieu
Is anyone working on making Gradient Boosting Regressor work with
sparse matrices?
Or is anyone working on adding an option for fully corrective gradient
boosting, I.E. all trees in the ensemble are re-weighted at each iteration?
These are things I would like to see and may be able to help with if
no one is currently working on them.
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Lars Buitinck
2014-09-20 22:06:05 UTC
Permalink
Post by c TAKES
Exception MemoryError: MemoryError() in 'sklearn.tree._tree.Tree._resize'
ignored
Anyone recognize this error?
All too well, but I thought it was fixed for good last time we went
through the tree growing code. Which version exactly are you using?
(sklearn.__version__ holds the version number.)
c TAKES
2014-09-21 04:42:35 UTC
Permalink
looks like it is version 0.15.2

I figured it must have been something I did wrong rather than a bug because
I think this is the same version as the last time I used sklearn decision
trees (random forest) and don't remember having a problem.
Post by Lars Buitinck
Post by c TAKES
Exception MemoryError: MemoryError() in 'sklearn.tree._tree.Tree._resize'
ignored
Anyone recognize this error?
All too well, but I thought it was fixed for good last time we went
through the tree growing code. Which version exactly are you using?
(sklearn.__version__ holds the version number.)
------------------------------------------------------------------------------
Slashdot TV. Video for Nerds. Stuff that Matters.
http://pubads.g.doubleclick.net/gampad/clk?id=160591471&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Lars Buitinck
2014-09-22 09:30:29 UTC
Permalink
Post by c TAKES
looks like it is version 0.15.2
Confirmed.
Post by c TAKES
I figured it must have been something I did wrong rather than a bug because
I think this is the same version as the last time I used sklearn decision
trees (random forest) and don't remember having a problem.
No, it's a bug, but it's one I thought we fixed (painstakingly).
Lars Buitinck
2014-09-22 09:35:13 UTC
Permalink
Post by Lars Buitinck
Post by c TAKES
looks like it is version 0.15.2
Confirmed.
https://github.com/scikit-learn/scikit-learn/issues/3684
c TAKES
2014-09-23 04:38:00 UTC
Permalink
Thanks! What should be the proper behavior when I run the script I wrote?

I tried uninstalling and reinstalling from pip, but no success (same
error), so I assume I have to build to get the bleeding edge update.

Probably serves me right for using Windows, but I've been having trouble
building scikit-learn from the source code. I followed all the steps here
to build on windows, including installing the sdk 7.0 (already had 7.1, and
I'm using Python 2):

http://scikit-learn.org/stable/install.html#id1

but get the following error:

Building msvcr library: "C:\**HIDEUSERPATH**\Anaconda\libs\libmsvcr90d.a"
(from
:\Windows\winsxs\amd64_microsoft.vc90.debugcrt_1fc8b3b9a1e18e3b_9.0.21022.8_none_4ec74c6b3093419c\msvcr90d.dll)
error: The system cannot find the file specified

I assume the file it can find is libmsvcr90d.a, because there isn't one in
that directory (though the other file doesn't exist in the specified
directory either).

I do have versions of both libmsvcr90d.a and msvcr90d.dll, but in different
locations. Is my problem that those files aren't where they should be or
that the setup routine is looking in the wrong place?

Here are the locations of those files on my computer:

libmsvcr90d.a:

"C:\MinGW\libmsvcr90d.a"

"C:\MinGW\lib\libmsvcr90d.a"

"C:\**HIDEUSERPATH**\Anaconda\MinGW\x86_64-w64-mingw32\lib\libmsvcr90d.a"

msvcr90d.dll:

"C:\Program Files (x86)\Microsoft Visual Studio
9.0\VC\redist\Debug_NonRedist\x86\Microsoft.VC90.DebugCRT\msvcr90d.dll"

"C:\Program Files (x86)\Microsoft Visual Studio
9.0\VC\redist\Debug_NonRedist\amd64\Microsoft.VC90.DebugCRT\msvcr90d.dll"
Post by Lars Buitinck
Post by Lars Buitinck
Post by c TAKES
looks like it is version 0.15.2
Confirmed.
https://github.com/scikit-learn/scikit-learn/issues/3684
------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2014-09-23 09:37:50 UTC
Permalink
Post by c TAKES
Thanks! What should be the proper behavior when I run the script I wrote?
I tried uninstalling and reinstalling from pip, but no success (same error),
so I assume I have to build to get the bleeding edge update.
Probably serves me right for using Windows, but I've been having trouble
building scikit-learn from the source code. I followed all the steps here
to build on windows, including installing the sdk 7.0 (already had 7.1, and
http://scikit-learn.org/stable/install.html#id1
Building msvcr library: "C:\**HIDEUSERPATH**\Anaconda\libs\libmsvcr90d.a"
(from
:\Windows\winsxs\amd64_microsoft.vc90.debugcrt_1fc8b3b9a1e18e3b_9.0.21022.8_none_4ec74c6b3093419c\msvcr90d.dll)
error: The system cannot find the file specified
The build instructions on scikit-learn.org are only valid for a python
installed from the official python.org website which is build with
MSVC.

Apparently Anaconda is built with its own embedded version of the
MinGW compiler. I don't think that a scikit-learn binary built with
MSVC can be used within the anaconda distribution.

For building packages on an Anaconda distribution I think one is
supposed to use the `conda pipbuild` command but I do not have much
experience with that.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Lars Buitinck
2014-09-23 10:27:19 UTC
Permalink
Post by c TAKES
Thanks! What should be the proper behavior when I run the script I wrote?
A MemoryError. The bug is already fixed in master.

Continue reading on narkive:
Loading...