Discussion:
covertype benchmark and unexpected extra trees and random forest results
(too old to reply)
Peter Prettenhofer
2012-03-27 09:34:37 UTC
Permalink
Paolo,

I noticed that too - maybe @glouppe can comment on this - I think the
reason was a change in the ``n_features`` heuristic but I might be
mistaken.

Concerning the GaussianNB - there's a PR [1] adressing a critical bug
in the estimator - it should be merged ASAP. Furthermore, test time is
quite low - this might be due to memory layout issues - SGDClassifier
converts ``coef_`` to fortran-style for increased test-time
performance.

best,
Peter

[1] https://github.com/scikit-learn/scikit-learn/pull/731
Hi all,
I've just run bench_covertype on today master.
I needed to uncomment ExtraTrees and RandomForest benchs.
Classifier   train-time test-time error-rate
--------------------------------------------
Liblinear     13.5609s   0.0683s     0.2307
GaussianNB    3.6565s    0.1753s     0.6367
SGD           0.4522s    0.0170s     0.2300
CART          35.3378s   0.0375s     0.0476
RandomForest 246.8737s   0.6908s     0.0807
Extra-Trees  182.0412s   0.6269s     0.1986
Classifier train-time test-time error-rate
-------------------------------------------- Liblinear 11.8977s 0.0285s
0.2305 GaussianNB 3.5931s 0.6645s 0.3633 SGD 0.2924s 0.0114s 0.2300 CART
39.9829s 0.0345s 0.0476 RandomForest 794.6232s 1.0526s 0.0249 Extra-Trees
1401.7051s 1.1181s 0.0230
Unless I'm missing something obvious I'll open a ticket
a try to give git bisect a run ...
Thanks!
Paolo
PS: I just noticed that also GaussianNB results are worse...
--
Peter Prettenhofer
Paolo Losi
2012-03-27 09:56:29 UTC
Permalink
Thanks Peter,

On Tue, Mar 27, 2012 at 11:34 AM, Peter Prettenhofer <
Post by Peter Prettenhofer
Paolo,
reason was a change in the ``n_features`` heuristic but I might be
mistaken.
Gilles, can you give a quick look to it? If it's not anything obvious just
ping back and I'll try to git bisect the issue...
Post by Peter Prettenhofer
Concerning the GaussianNB - there's a PR [1] adressing a critical bug
in the estimator - it should be merged ASAP.
Thank's. I've commented on the PR (the performance regression seems
not to be connected with the PR)
Post by Peter Prettenhofer
Furthermore, test time is
quite low - this might be due to memory layout issues - SGDClassifier
converts ``coef_`` to fortran-style for increased test-time
performance.
Clear.

Thanks again

Paolo
Gilles Louppe
2012-03-27 11:19:21 UTC
Permalink
Hi,

I am running the tests again, but indeed I think the difference in the
results comes from that fact that max_features=sqrt(n_features) now by
default whereas it was max_features=n_features before.

Gilles
Post by Paolo Losi
Thanks Peter,
On Tue, Mar 27, 2012 at 11:34 AM, Peter Prettenhofer
Post by Peter Prettenhofer
Paolo,
reason was a change in the ``n_features`` heuristic but I might be
mistaken.
Gilles, can you give a quick look to it? If it's not anything obvious just
ping back and I'll try to git bisect the issue...
Post by Peter Prettenhofer
Concerning the GaussianNB - there's a PR [1] adressing a critical bug
in the estimator - it should be merged ASAP.
Thank's. I've commented on the PR (the performance regression seems
not to be connected with the PR)
Post by Peter Prettenhofer
Furthermore, test time is
quite low - this might be due to memory layout issues - SGDClassifier
converts ``coef_`` to fortran-style for increased test-time
performance.
Clear.
Thanks again
Paolo
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Peter Prettenhofer
2012-03-27 11:38:26 UTC
Permalink
Interesting - covtype involves a number of categorical attributes
which are represented via a one-hot encoding - do you think that such
a representation has a significant effect on feature sampling and thus
the performance of random forests?
Post by Gilles Louppe
Hi,
I am running the tests again, but indeed I think the difference in the
results comes from that fact that max_features=sqrt(n_features) now by
default whereas it was max_features=n_features before.
Gilles
Post by Paolo Losi
Thanks Peter,
On Tue, Mar 27, 2012 at 11:34 AM, Peter Prettenhofer
Post by Peter Prettenhofer
Paolo,
reason was a change in the ``n_features`` heuristic but I might be
mistaken.
Gilles, can you give a quick look to it? If it's not anything obvious just
ping back and I'll try to git bisect the issue...
Post by Peter Prettenhofer
Concerning the GaussianNB - there's a PR [1] adressing a critical bug
in the estimator - it should be merged ASAP.
Thank's. I've commented on the PR (the performance regression seems
not to be connected with the PR)
Post by Peter Prettenhofer
Furthermore, test time is
quite low - this might be due to memory layout issues - SGDClassifier
converts ``coef_`` to fortran-style for increased test-time
performance.
Clear.
Thanks again
Paolo
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Gilles Louppe
2012-03-27 12:13:21 UTC
Permalink
Hi,

Using max_features="auto" (default setting) indeed yields the results
that Paolo reports.

When setting max_features=None (i.e., using all features as in our
earlier code), I got the following on my machine:

RandomForest 778.1471s 1.2830s 0.0248
Extra-Trees 1325.2397s 1.3544s 0.0199

which is consistent with what is mentioned in the doc.

@pprett: Since max_features=sqrt(n_features) now by default on
classification problems, the trees are usually more randomized, hence
with a higher bias. To compensate for that, more trees usually need to
be build whereas we only use 20 trees in the benchmark (which is low
in my opinion). The effect of max_features is very dataset specific
though. On some problems, decreasing max_features does not impair
performance as much as here on covertype. I am not sure whether
one-hot-encoding is causing this.

Best,

Gilles
Post by Peter Prettenhofer
Interesting - covtype involves a number of categorical attributes
which are represented via a one-hot encoding - do you think that such
a representation has a significant effect on feature sampling and thus
the performance of random forests?
Post by Gilles Louppe
Hi,
I am running the tests again, but indeed I think the difference in the
results comes from that fact that max_features=sqrt(n_features) now by
default whereas it was max_features=n_features before.
Gilles
Post by Paolo Losi
Thanks Peter,
On Tue, Mar 27, 2012 at 11:34 AM, Peter Prettenhofer
Post by Peter Prettenhofer
Paolo,
reason was a change in the ``n_features`` heuristic but I might be
mistaken.
Gilles, can you give a quick look to it? If it's not anything obvious just
ping back and I'll try to git bisect the issue...
Post by Peter Prettenhofer
Concerning the GaussianNB - there's a PR [1] adressing a critical bug
in the estimator - it should be merged ASAP.
Thank's. I've commented on the PR (the performance regression seems
not to be connected with the PR)
Post by Peter Prettenhofer
Furthermore, test time is
quite low - this might be due to memory layout issues - SGDClassifier
converts ``coef_`` to fortran-style for increased test-time
performance.
Clear.
Thanks again
Paolo
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Paolo Losi
2012-03-27 12:50:43 UTC
Permalink
Gilles,

thank you very much for having checked.

If everyone agrees I'll:

- uncomment extratrees and randomforest benchmark (@pprett is there
any valid reason to leave them out?)
- explicitly config max_features=None for RandomForest and ExtraTrees

Thanks again

Paolo
Post by Gilles Louppe
Hi,
Using max_features="auto" (default setting) indeed yields the results
that Paolo reports.
When setting max_features=None (i.e., using all features as in our
RandomForest 778.1471s 1.2830s 0.0248
Extra-Trees 1325.2397s 1.3544s 0.0199
which is consistent with what is mentioned in the doc.
@pprett: Since max_features=sqrt(n_features) now by default on
classification problems, the trees are usually more randomized, hence
with a higher bias. To compensate for that, more trees usually need to
be build whereas we only use 20 trees in the benchmark (which is low
in my opinion). The effect of max_features is very dataset specific
though. On some problems, decreasing max_features does not impair
performance as much as here on covertype. I am not sure whether
one-hot-encoding is causing this.
Best,
Gilles
Post by Peter Prettenhofer
Interesting - covtype involves a number of categorical attributes
which are represented via a one-hot encoding - do you think that such
a representation has a significant effect on feature sampling and thus
the performance of random forests?
Post by Gilles Louppe
Hi,
I am running the tests again, but indeed I think the difference in the
results comes from that fact that max_features=sqrt(n_features) now by
default whereas it was max_features=n_features before.
Gilles
Post by Paolo Losi
Thanks Peter,
On Tue, Mar 27, 2012 at 11:34 AM, Peter Prettenhofer
Post by Peter Prettenhofer
Paolo,
reason was a change in the ``n_features`` heuristic but I might be
mistaken.
Gilles, can you give a quick look to it? If it's not anything obvious
just
Post by Peter Prettenhofer
Post by Gilles Louppe
Post by Paolo Losi
ping back and I'll try to git bisect the issue...
Post by Peter Prettenhofer
Concerning the GaussianNB - there's a PR [1] adressing a critical bug
in the estimator - it should be merged ASAP.
Thank's. I've commented on the PR (the performance regression seems
not to be connected with the PR)
Post by Peter Prettenhofer
Furthermore, test time is
quite low - this might be due to memory layout issues - SGDClassifier
converts ``coef_`` to fortran-style for increased test-time
performance.
Clear.
Thanks again
Paolo
------------------------------------------------------------------------------
Post by Peter Prettenhofer
Post by Gilles Louppe
Post by Paolo Losi
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Paolo Losi
e-mail: ***@enuan.com
mob: +39 348 7705261

ENUAN Srl
Via XX Settembre, 12 - 29100 Piacenza
Olivier Grisel
2012-03-27 12:53:03 UTC
Permalink
Post by Paolo Losi
Gilles,
thank you very much for having checked.
  any valid reason to leave them out?)
They are far slower to run that the other.

Ideally a commandline switch to select the class names of the
estimators to benchmark would be great to launch the bench only on a
few specific models.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Peter Prettenhofer
2012-03-27 12:57:11 UTC
Permalink
Post by Paolo Losi
Gilles,
thank you very much for having checked.
  any valid reason to leave them out?)
no, absolutely not - I just forgot to uncomment them - thx
Post by Paolo Losi
- explicitly config max_features=None for RandomForest and ExtraTrees
+1
Post by Paolo Losi
Thanks again
Paolo
Post by Gilles Louppe
Hi,
Using max_features="auto" (default setting) indeed yields the results
that Paolo reports.
When setting max_features=None (i.e., using all features as in our
RandomForest 778.1471s   1.2830s     0.0248
Extra-Trees  1325.2397s  1.3544s     0.0199
which is consistent with what is mentioned in the doc.
@pprett: Since max_features=sqrt(n_features) now by default on
classification problems, the trees are usually more randomized, hence
with a higher bias. To compensate for that, more trees usually need to
be build whereas we only use 20 trees in the benchmark (which is low
in my opinion). The effect of max_features is very dataset specific
though. On some problems, decreasing max_features does not impair
performance as much as here on covertype. I am not sure whether
one-hot-encoding is causing this.
Best,
Gilles
Post by Peter Prettenhofer
Interesting - covtype involves a number of categorical attributes
which are represented via a one-hot encoding - do you think that such
a representation has a significant effect on feature sampling and thus
the performance of random forests?
Post by Gilles Louppe
Hi,
I am running the tests again, but indeed I think the difference in the
results comes from that fact that max_features=sqrt(n_features) now by
default whereas it was max_features=n_features before.
Gilles
Post by Paolo Losi
Thanks Peter,
On Tue, Mar 27, 2012 at 11:34 AM, Peter Prettenhofer
Post by Peter Prettenhofer
Paolo,
reason was a change in the ``n_features`` heuristic but I might be
mistaken.
Gilles, can you give a quick look to it? If it's not anything obvious just
ping back and I'll try to git bisect the issue...
Post by Peter Prettenhofer
Concerning the GaussianNB - there's a PR [1] adressing a critical bug
in the estimator - it should be merged ASAP.
Thank's. I've commented on the PR (the performance regression seems
not to be connected with the PR)
Post by Peter Prettenhofer
Furthermore, test time is
quite low - this might be due to memory layout issues - SGDClassifier
converts ``coef_`` to fortran-style for increased test-time
performance.
Clear.
Thanks again
Paolo
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Paolo Losi
mob:   +39 348 7705261
ENUAN Srl
Via XX Settembre, 12 - 29100 Piacenza
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Dimitrios Pritsos
2012-03-27 16:31:32 UTC
Permalink
Hello,

While I am svm.sparse.SVC with a scipy.sparse.csr_matrix the following
error occurs:

File
"/home/dimitrios/Development_Workspace/webgenreidentification/src/experiments_lowbow.py",
line 115, in evaluate
csvm.fit(train_X, train_Y)
File
"/usr/local/lib/python2.6/dist-packages/sklearn/svm/sparse/base.py",
line 22, in fit
return super(SparseBaseLibSVM, self).fit(X, y, sample_weight)
File "/usr/local/lib/python2.6/dist-packages/sklearn/svm/base.py",
line 145, in fit
fit(X, y, sample_weight)
File "/usr/local/lib/python2.6/dist-packages/sklearn/svm/base.py",
line 294, in _sparse_fit
int(self.shrinking), int(self.probability))
File "libsvm_sparse.pyx", line 158, in
sklearn.svm.libsvm_sparse.libsvm_sparse_train (svm/libsvm_sparse.c:1927)
File "/usr/lib/python2.6/dist-packages/scipy/sparse/compressed.py",
line 75, in __init__
self.shape = shape # spmatrix will check for errors
File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line
71, in set_shape
raise ValueError('invalid shape')
ValueError: invalid shape/
/
However, the .shape of my matrix Train_X is 18 x 7500 and the is
scipy.sparse.isspmatrix_csr( train_X ) is TRUE. The Train_Y is a list of
18 int objects.

So, what might be the problem?

I am fighting with this for few hours but I cannot find the cause of
this error.

Regards,

Dimitrios
Vlad Niculae
2012-03-27 17:07:16 UTC
Permalink
Hello Dimitrios

You only have 18 samples? What is the shape of your train_Y?

Best,
Vlad
Post by Dimitrios Pritsos
Hello,
File "/home/dimitrios/Development_Workspace/webgenreidentification/src/experiments_lowbow.py", line 115, in evaluate
csvm.fit(train_X, train_Y)
File "/usr/local/lib/python2.6/dist-packages/sklearn/svm/sparse/base.py", line 22, in fit
return super(SparseBaseLibSVM, self).fit(X, y, sample_weight)
File "/usr/local/lib/python2.6/dist-packages/sklearn/svm/base.py", line 145, in fit
fit(X, y, sample_weight)
File "/usr/local/lib/python2.6/dist-packages/sklearn/svm/base.py", line 294, in _sparse_fit
int(self.shrinking), int(self.probability))
File "libsvm_sparse.pyx", line 158, in sklearn.svm.libsvm_sparse.libsvm_sparse_train (svm/libsvm_sparse.c:1927)
File "/usr/lib/python2.6/dist-packages/scipy/sparse/compressed.py", line 75, in __init__
self.shape = shape # spmatrix will check for errors
File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 71, in set_shape
raise ValueError('invalid shape')
ValueError: invalid shape
However, the .shape of my matrix Train_X is 18 x 7500 and the is scipy.sparse.isspmatrix_csr( train_X ) is TRUE. The Train_Y is a list of 18 int objects.
So, what might be the problem?
I am fighting with this for few hours but I cannot find the cause of this error.
Regards,
Dimitrios
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Dimitrios Pritsos
2012-03-27 17:13:44 UTC
Permalink
Hello Vlad,

Yes 18 it is just for Debugging because I have implemented a Locally
Weighted Bag Of Words that requires several Gaussian PDFs to Smooth out
the Data and it is a quite time consuming process. So 18 is just enough
for Debugging. Later will uses about 800 etc.

Train_Y is a list size 18 (with int(1) objects).

Best Regrds,

Dimitrios
Post by Vlad Niculae
Hello Dimitrios
You only have 18 samples? What is the shape of your train_Y?
Best,
Vlad
Post by Dimitrios Pritsos
Hello,
File "/home/dimitrios/Development_Workspace/webgenreidentification/src/experiments_lowbow.py", line 115, in evaluate
csvm.fit(train_X, train_Y)
File "/usr/local/lib/python2.6/dist-packages/sklearn/svm/sparse/base.py", line 22, in fit
return super(SparseBaseLibSVM, self).fit(X, y, sample_weight)
File "/usr/local/lib/python2.6/dist-packages/sklearn/svm/base.py", line 145, in fit
fit(X, y, sample_weight)
File "/usr/local/lib/python2.6/dist-packages/sklearn/svm/base.py", line 294, in _sparse_fit
int(self.shrinking), int(self.probability))
File "libsvm_sparse.pyx", line 158, in sklearn.svm.libsvm_sparse.libsvm_sparse_train (svm/libsvm_sparse.c:1927)
File "/usr/lib/python2.6/dist-packages/scipy/sparse/compressed.py", line 75, in __init__
self.shape = shape # spmatrix will check for errors
File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 71, in set_shape
raise ValueError('invalid shape')
ValueError: invalid shape
However, the .shape of my matrix Train_X is 18 x 7500 and the is scipy.sparse.isspmatrix_csr( train_X ) is TRUE. The Train_Y is a list of 18 int objects.
So, what might be the problem?
I am fighting with this for few hours but I cannot find the cause of this error.
Regards,
Dimitrios
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Peter Prettenhofer
2012-03-27 17:08:21 UTC
Permalink
Dimitrios,

please provide an example script so that we can reproduce the error.

BTW: gist [1] is a handy tool to distribute scripts.

[1] https://gist.github.com/

best,
Peter
Post by Dimitrios Pritsos
Hello,
While I am  svm.sparse.SVC with a scipy.sparse.csr_matrix the following
  File
"/home/dimitrios/Development_Workspace/webgenreidentification/src/experiments_lowbow.py",
line 115, in evaluate
    csvm.fit(train_X, train_Y)
  File "/usr/local/lib/python2.6/dist-packages/sklearn/svm/sparse/base.py",
line 22, in fit
    return super(SparseBaseLibSVM, self).fit(X, y, sample_weight)
  File "/usr/local/lib/python2.6/dist-packages/sklearn/svm/base.py", line
145, in fit
    fit(X, y, sample_weight)
  File "/usr/local/lib/python2.6/dist-packages/sklearn/svm/base.py", line
294, in _sparse_fit
    int(self.shrinking), int(self.probability))
  File "libsvm_sparse.pyx", line 158, in
sklearn.svm.libsvm_sparse.libsvm_sparse_train (svm/libsvm_sparse.c:1927)
  File "/usr/lib/python2.6/dist-packages/scipy/sparse/compressed.py", line
75, in __init__
    self.shape = shape   # spmatrix will check for errors
  File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 71, in
set_shape
    raise ValueError('invalid shape')
ValueError: invalid shape
However, the .shape of my matrix Train_X is 18 x 7500 and the is
scipy.sparse.isspmatrix_csr( train_X ) is TRUE. The Train_Y is a list of 18
int objects.
So, what might be the problem?
I am fighting with this for few hours but I cannot find the cause of this
error.
Regards,
Dimitrios
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Dimitrios Pritsos
2012-03-27 17:20:11 UTC
Permalink
Hello Peter,

Yes I can do that but the codes is using a lib I have implemented for
raw HTML to Vector conversion.

So Should I send the whole thing or the parts are creating the matrix?

Regards,

Dimitrios
Post by Peter Prettenhofer
Dimitrios,
please provide an example script so that we can reproduce the error.
BTW: gist [1] is a handy tool to distribute scripts.
[1] https://gist.github.com/
best,
Peter
Post by Dimitrios Pritsos
Hello,
While I am svm.sparse.SVC with a scipy.sparse.csr_matrix the following
File
"/home/dimitrios/Development_Workspace/webgenreidentification/src/experiments_lowbow.py",
line 115, in evaluate
csvm.fit(train_X, train_Y)
File "/usr/local/lib/python2.6/dist-packages/sklearn/svm/sparse/base.py",
line 22, in fit
return super(SparseBaseLibSVM, self).fit(X, y, sample_weight)
File "/usr/local/lib/python2.6/dist-packages/sklearn/svm/base.py", line
145, in fit
fit(X, y, sample_weight)
File "/usr/local/lib/python2.6/dist-packages/sklearn/svm/base.py", line
294, in _sparse_fit
int(self.shrinking), int(self.probability))
File "libsvm_sparse.pyx", line 158, in
sklearn.svm.libsvm_sparse.libsvm_sparse_train (svm/libsvm_sparse.c:1927)
File "/usr/lib/python2.6/dist-packages/scipy/sparse/compressed.py", line
75, in __init__
self.shape = shape # spmatrix will check for errors
File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 71, in
set_shape
raise ValueError('invalid shape')
ValueError: invalid shape
However, the .shape of my matrix Train_X is 18 x 7500 and the is
scipy.sparse.isspmatrix_csr( train_X ) is TRUE. The Train_Y is a list of 18
int objects.
So, what might be the problem?
I am fighting with this for few hours but I cannot find the cause of this
error.
Regards,
Dimitrios
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2012-03-27 17:29:26 UTC
Permalink
Post by Dimitrios Pritsos
So Should I send the whole thing or the parts are creating the matrix?
Just save X and y and create a gist that can reproduce the problem
without the external dependencies.

G
Dimitrios Pritsos
2012-03-27 17:36:33 UTC
Permalink
I think I found it - but I have to test it again with the whole data set
and let you know.

So when I am using only one tag in the Y for example [1, 1, 1, 1, 1, 1,
1, 1].

it is returing the error I metioned in my first post.

But when I am having something like this [1, 1,1, 1, 1, 1, 1, 2]. It
seem OK, no error is returned.

I will test it with the real data set and I will let you know. However,
in the code Lines of the sklearn.sparse.svm.SCV() lib returned with the
<invalid shape> error there was only checked the shape of the data and
not the content.

Regards,

Dimitrios
Post by Gael Varoquaux
Post by Dimitrios Pritsos
So Should I send the whole thing or the parts are creating the matrix?
Just save X and y and create a gist that can reproduce the problem
without the external dependencies.
G
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Paolo Losi
2012-03-27 09:29:27 UTC
Permalink
Hi all,

I've just run bench_covertype on today master.
I needed to uncomment ExtraTrees and RandomForest benchs.

The result are quite unexpected:

Classifier train-time test-time error-rate
--------------------------------------------
Liblinear 13.5609s 0.0683s 0.2307
GaussianNB 3.6565s 0.1753s 0.6367
SGD 0.4522s 0.0170s 0.2300
CART 35.3378s 0.0375s 0.0476
RandomForest 246.8737s 0.6908s 0.0807
Extra-Trees 182.0412s 0.6269s 0.1986


with respect to the what reported in bech_covertype.py:

Classifier train-time test-time error-rate
-------------------------------------------- Liblinear 11.8977s 0.0285s
0.2305 GaussianNB 3.5931s 0.6645s 0.3633 SGD 0.2924s 0.0114s 0.2300 CART
39.9829s 0.0345s 0.0476 RandomForest 794.6232s 1.0526s 0.0249 Extra-Trees
1401.7051s 1.1181s 0.0230

Unless I'm missing something obvious I'll open a ticket
a try to give git bisect a run ...

Thanks!
Paolo

PS: I just noticed that also GaussianNB results are worse...

Loading...