Discussion:
Panda / Tree and Random Forest
(too old to reply)
Didier Vila
2012-10-24 14:34:56 UTC
Permalink
Good Morning all,



I just downloaded Panda based on the recommendation of friends for my
data manipulation in python.



I will test this on the Tree and Random Forests from Scikit-Learn. My
data are in a csv file with 150 000 lignes and 8 variables.



It looks that the process is finished it and I still have the memory
bloocked at 5 GB ( the memory of my pc is 16 GB !!!)



* Is it normal that Scikit learn can not handle these size of
data ( 150 000 ) ?



* I don't have any feedback from python. It s stopping and
that's it... somepeople did have the same issue ?



* How Can I clean the memory of the pc from python ?





Thanks for this





from settlementmatrix import SettlementMatrix

from pandas import *

from sklearn import hmm

from sklearn.ensemble import RandomForestClassifier

from sklearn import tree





import pandas



training_data = read_csv("data.csv") # Read the file



feature=training_data[['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7',
'X8']]



label=training_data['X9']



clf = RandomForestClassifier(n_estimators=5)



clf.fit(feature, label)











Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye Close |
Fleet | Hampshire | GU51 2QQ | Fax: 0871 574 2992 | Email:
***@capquestco.com <mailto:***@capquestco.com>



This e-mail is intended solely for the addressee, is strictly confidential and may also be legally privileged. If you are not the addressee please do not read, print, re-transmit, store or act in reliance on it or any attachments. Instead, please email it back to the sender and then immediately permanently delete it. E-mail communications cannot be guaranteed to be secure or error free, as information could be intercepted, corrupted, amended, lost, destroyed, arrive late or incomplete, or contain viruses. We do not accept liability for any such matters or their consequences. Anyone who communicates with us by e-mail is taken to accept the risks in doing so. Opinions, conclusions and other information in this e-mail and any attachments are solely those of the author and do not represent those of CapQuest Group Limited or any of its subsidiaries unless otherwise stated. CapQuest Group Limited (registered number 4936030), CapQuest Debt Recovery Limited (registered number 3772278), CapQuest Investments Limited (registered number 5245825), CapQuest Asset Management Limited (registered number 5245829) and CapQuest Mortgage Servicing Limited (registered number 05821008) are all limited companies registered in England and Wales with their registered offices at Fleet 27, Rye Close, Fleet, Hampshire, GU51 2QQ. Each company is a separate and independent legal entity. None of the companies have any liability for each other's acts or omissions. This communication is from the company named in the sender's details above.
Andreas Mueller
2012-10-24 14:39:59 UTC
Permalink
Hi Didier.
What do you mean by "no feedback from python". What did you expect?
If it ran through, it handled the data, so why do you say it does not?
What do you mean by cleaning the memory? If python ended, the memory
will be freed.
Cheers,
Andy
Post by Didier Vila
Good Morning all,
I just downloaded Panda based on the recommendation of friends for my
data manipulation in python.
I will test this on the Tree and Random Forests from Scikit-Learn. My
data are in a csv file with 150 000 lignes and 8 variables.
It looks that the process is finished it and I still have the memory
bloocked at 5 GB ( the memory of my pc is 16 GB !!!)
·Is it normal that Scikit learn can not handle these size of data (
150 000 ) ?
· I don't have any feedback from python. It s stopping and that's
it... somepeople did have the same issue ?
·How Can I clean the memory of the pc from python ?
Thanks for this
fromsettlementmatrix importSettlementMatrix
from_pandas __import__*_
fromsklearn import_hmm_
fromsklearn.ensemble importRandomForestClassifier
fromsklearn import_tree_
import pandas
training_data = read_csv(/"data.csv"/) # Read the file
feature=training_data[[/'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8'/]]
label=training_data[/'X9'/]
clf = RandomForestClassifier(n_estimators=5)
clf.fit(feature, label)
Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye
This e-mail is intended solely for the addressee, is strictly
confidential and may also be legally privileged. If you are not the
addressee please do not read, print, re-transmit, store or act in
reliance on it or any attachments. Instead, please email it back to
the sender and then immediately permanently delete it. E-mail
communications cannot be guaranteed to be secure or error free, as
information could be intercepted, corrupted, amended, lost, destroyed,
arrive late or incomplete, or contain viruses. We do not accept
liability for any such matters or their consequences. Anyone who
communicates with us by e-mail is taken to accept the risks in doing
so. Opinions, conclusions and other information in this e-mail and any
attachments are solely those of the author and do not represent those
of CapQuest Group Limited or any of its subsidiaries unless otherwise
stated. CapQuest Group Limited (registered number 4936030), CapQuest
Debt Recovery Limited (registered number 3772278), CapQuest
Investments Limited (registered number 5245825), CapQuest Asset
Management Limited (registered number 5245829) and CapQuest Mortgage
Servicing Limited (registered number 05821008) are all limited
companies registered in England and Wales with their registered
offices at Fleet 27, Rye Close, Fleet, Hampshire, GU51 2QQ. Each
company is a separate and independent legal entity. None of the
companies have any liability for each other's acts or omissions. This
communication is from the company named in the sender's details above.
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Didier Vila
2012-10-24 15:21:30 UTC
Permalink
Andy,



What do you mean by "no feedback from python". What did you expect?



In my pydev environnement, I wrote





print datetime.now()



print clf.predict(feature)



print datetime.now()



the first date is running but the second date is not because the python
process finished in between


If it ran through, it handled the data, so why do you say it does not?



It's because I have 150 000 lines ?


What do you mean by cleaning the memory? If python ended, the memory
will be freed.



This is where I am confused, my memory is not freed.....


Cheers,


Andy



Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye Close |
Fleet | Hampshire | GU51 2QQ | Fax: 0871 574 2992 | Email:
***@capquestco.com <mailto:***@capquestco.com>



From: Andreas Mueller [mailto:***@ais.uni-bonn.de]
Sent: 24 October 2012 15:40
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest



Hi Didier.
What do you mean by "no feedback from python". What did you expect?
If it ran through, it handled the data, so why do you say it does not?
What do you mean by cleaning the memory? If python ended, the memory
will be freed.
Cheers,
Andy

Am 24.10.2012 16:34, schrieb Didier Vila:

Good Morning all,



I just downloaded Panda based on the recommendation of friends
for my data manipulation in python.



I will test this on the Tree and Random Forests from
Scikit-Learn. My data are in a csv file with 150 000 lignes and 8
variables.



It looks that the process is finished it and I still have the
memory bloocked at 5 GB ( the memory of my pc is 16 GB !!!)



* Is it normal that Scikit learn can not handle these
size of data ( 150 000 ) ?



* I don't have any feedback from python. It s stopping
and that's it... somepeople did have the same issue ?



* How Can I clean the memory of the pc from python ?





Thanks for this





from settlementmatrix import SettlementMatrix

from pandas import *

from sklearn import hmm

from sklearn.ensemble import RandomForestClassifier

from sklearn import tree





import pandas



training_data = read_csv("data.csv") # Read the file



feature=training_data[['X1', 'X2', 'X3', 'X4', 'X5', 'X6',
'X7', 'X8']]



label=training_data['X9']



clf = RandomForestClassifier(n_estimators=5)



clf.fit(feature, label)











Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye
Close | Fleet | Hampshire | GU51 2QQ | Fax: 0871 574 2992 | Email:
***@capquestco.com <mailto:***@capquestco.com>



This e-mail is intended solely for the addressee, is strictly
confidential and may also be legally privileged. If you are not the
addressee please do not read, print, re-transmit, store or act in
reliance on it or any attachments. Instead, please email it back to the
sender and then immediately permanently delete it. E-mail communications
cannot be guaranteed to be secure or error free, as information could be
intercepted, corrupted, amended, lost, destroyed, arrive late or
incomplete, or contain viruses. We do not accept liability for any such
matters or their consequences. Anyone who communicates with us by e-mail
is taken to accept the risks in doing so. Opinions, conclusions and
other information in this e-mail and any attachments are solely those of
the author and do not represent those of CapQuest Group Limited or any
of its subsidiaries unless otherwise stated. CapQuest Group Limited
(registered number 4936030), CapQuest Debt Recovery Limited (registered
number 3772278), CapQuest Investments Limited (registered number
5245825), CapQuest Asset Management Limited (registered number 5245829)
and CapQuest Mortgage Servicing Limited (registered number 05821008) are
all limited companies registered in England and Wales with their
registered offices at Fleet 27, Rye Close, Fleet, Hampshire, GU51 2QQ.
Each company is a separate and independent legal entity. None of the
companies have any liability for each other's acts or omissions. This
communication is from the company named in the sender's details above.







------------------------------------------------------------------------
------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct






_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



This e-mail is intended solely for the addressee, is strictly confidential and may also be legally privileged. If you are not the addressee please do not read, print, re-transmit, store or act in reliance on it or any attachments. Instead, please email it back to the sender and then immediately permanently delete it. E-mail communications cannot be guaranteed to be secure or error free, as information could be intercepted, corrupted, amended, lost, destroyed, arrive late or incomplete, or contain viruses. We do not accept liability for any such matters or their consequences. Anyone who communicates with us by e-mail is taken to accept the risks in doing so. Opinions, conclusions and other information in this e-mail and any attachments are solely those of the author and do not represent those of CapQuest Group Limited or any of its subsidiaries unless otherwise stated. CapQuest Group Limited (registered number 4936030), CapQuest Debt Recovery Limited (registered number 3772278), CapQuest Investments Limited (registered number 5245825), CapQuest Asset Management Limited (registered number 5245829) and CapQuest Mortgage Servicing Limited (registered number 05821008) are all limited companies registered in England and Wales with their registered offices at Fleet 27, Rye Close, Fleet, Hampshire, GU51 2QQ. Each company is a separate and independent legal entity. None of the companies have any liability for each other's acts or omissions. This communication is from the company named in the sender's details above.
Andreas Mueller
2012-10-24 14:43:40 UTC
Permalink
As an addition, maybe it would be good for you to have a look into the
tutorial:
http://scikit-learn.org/dev/tutorial/basic/tutorial.html
Didier Vila
2012-10-24 15:21:59 UTC
Permalink
Thanks, I will have a look.

Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye Close | Fleet | Hampshire | GU51 2QQ | Fax: 0871 574 2992 | Email: ***@capquestco.com

-----Original Message-----
From: Andreas Mueller [mailto:***@ais.uni-bonn.de]
Sent: 24 October 2012 15:44
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest

As an addition, maybe it would be good for you to have a look into the
tutorial:
http://scikit-learn.org/dev/tutorial/basic/tutorial.html
Peter Prettenhofer
2012-10-24 15:36:26 UTC
Permalink
Didier,

what type is ``feature`` (simply print ``type(feature``)? Considering
your first email I suspect its a pandas.DataFrame; scikit-learn
estimators require array-like inputs - so please do
``clf.fit(features.values, labels.values.ravel())`` instead of
``clf.fit(features, values)``.

150000 is quite a lot; but if you just want to fit 5 trees it should
run in under 15 seconds (I tested using random data and binary
classification).

best,
Peter
Post by Didier Vila
Thanks, I will have a look.
-----Original Message-----
Sent: 24 October 2012 15:44
Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest
As an addition, maybe it would be good for you to have a look into the
http://scikit-learn.org/dev/tutorial/basic/tutorial.html
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
This e-mail is intended solely for the addressee, is strictly confidential and may also be legally privileged. If you are not the addressee please do not read, print, re-transmit, store or act in reliance on it or any attachments. Instead, please email it back to the sender and then immediately permanently delete it. E-mail communications cannot be guaranteed to be secure or error free, as information could be intercepted, corrupted, amended, lost, destroyed, arrive late or incomplete, or contain viruses. We do not accept liability for any such matters or their consequences. Anyone who communicates with us by e-mail is taken to accept the risks in doing so. Opinions, conclusions and other information in this e-mail and any attachments are solely those of the author and do not represent those of CapQuest Group Limited or any of its subsidiaries unless otherwise stated. CapQuest Group Limited (registered number 4936030), CapQuest Debt Recovery Limited (registered number 3772278), CapQuest Investments Limited (registered number 5245825), CapQuest Asset Management Limited (registered number 5245829) and CapQuest Mortgage Servicing Limited (registered number 05821008) are all limited companies registered in England and Wales with their registered offices at Fleet 27, Rye Close, Fleet, Hampshire, GU51 2QQ. Each company is a separate and independent legal entity. None of the companies have any liability for each other's acts or omissions. This communication is from the company named in the sender's details above.
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Didier Vila
2012-10-24 15:53:01 UTC
Permalink
Peter,

Thanks for the email.

I just started to use Panda this morning.

Feature are integer ( binary or 0-1-2-3) or real .

Note that my target variable is continuous between 0 and 1.

I just run your code below and I still have the same issue on that.

clf.fit(feature.values, label.values.ravel())

Regards

Didier

Ps: the initial codes worked for 100 samples.

Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye Close | Fleet | Hampshire | GU51 2QQ | Fax: 0871 574 2992 | Email: ***@capquestco.com


-----Original Message-----
From: Peter Prettenhofer [mailto:***@gmail.com]
Sent: 24 October 2012 16:36
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest

Didier,

what type is ``feature`` (simply print ``type(feature``)? Considering
your first email I suspect its a pandas.DataFrame; scikit-learn
estimators require array-like inputs - so please do
``clf.fit(features.values, labels.values.ravel())`` instead of
``clf.fit(features, values)``.

150000 is quite a lot; but if you just want to fit 5 trees it should
run in under 15 seconds (I tested using random data and binary
classification).

best,
Peter
Post by Didier Vila
Thanks, I will have a look.
-----Original Message-----
Sent: 24 October 2012 15:44
Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest
As an addition, maybe it would be good for you to have a look into the
http://scikit-learn.org/dev/tutorial/basic/tutorial.html
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
This e-mail is intended solely for the addressee, is strictly confidential and may also be legally privileged. If you are not the addressee please do not read, print, re-transmit, store or act in reliance on it or any attachments. Instead, please email it back to the sender and then immediately permanently delete it. E-mail communications cannot be guaranteed to be secure or error free, as information could be intercepted, corrupted, amended, lost, destroyed, arrive late or incomplete, or contain viruses. We do not accept liability for any such matters or their consequences. Anyone who communicates with us by e-mail is taken to accept the risks in doing so. Opinions, conclusions and other information in this e-mail and any attachments are solely those of the author and do not represent those of CapQuest Group Limited or any of its subsidiaries unless otherwise stated. CapQuest Group Limited (registered number 4936030), CapQuest Debt Recovery Limited (registered number 3772278), CapQuest Investments Limited (registered number 5245825), CapQuest Asset Management Limited (registered number 5245829) and CapQuest Mortgage Servicing Limited (registered number 05821008) are all limited companies registered in England and Wales with their registered offices at Fleet 27, Rye Close, Fleet, Hampshire, GU51 2QQ. Each company is a separate and independent legal entity. None of the companies have any liability for each other's acts or omissions. This communication is from the company named in the sender's details above.
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Peter Prettenhofer
2012-10-24 16:02:44 UTC
Permalink
Post by Didier Vila
Peter,
Thanks for the email.
I just started to use Panda this morning.
Feature are integer ( binary or 0-1-2-3) or real .
Note that my target variable is continuous between 0 and 1.
Ok - then that's the problem - for regression problems you have to use
RandomForestRegressor instead of RandomForestClassifier.

best,
Peter
Post by Didier Vila
I just run your code below and I still have the same issue on that.
clf.fit(feature.values, label.values.ravel())
Regards
Didier
Ps: the initial codes worked for 100 samples.
-----Original Message-----
Sent: 24 October 2012 16:36
Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest
Didier,
what type is ``feature`` (simply print ``type(feature``)? Considering
your first email I suspect its a pandas.DataFrame; scikit-learn
estimators require array-like inputs - so please do
``clf.fit(features.values, labels.values.ravel())`` instead of
``clf.fit(features, values)``.
150000 is quite a lot; but if you just want to fit 5 trees it should
run in under 15 seconds (I tested using random data and binary
classification).
best,
Peter
Post by Didier Vila
Thanks, I will have a look.
-----Original Message-----
Sent: 24 October 2012 15:44
Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest
As an addition, maybe it would be good for you to have a look into the
http://scikit-learn.org/dev/tutorial/basic/tutorial.html
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
This e-mail is intended solely for the addressee, is strictly confidential and may also be legally privileged. If you are not the addressee please do not read, print, re-transmit, store or act in reliance on it or any attachments. Instead, please email it back to the sender and then immediately permanently delete it. E-mail communications cannot be guaranteed to be secure or error free, as information could be intercepted, corrupted, amended, lost, destroyed, arrive late or incomplete, or contain viruses. We do not accept liability for any such matters or their consequences. Anyone who communicates with us by e-mail is taken to accept the risks in doing so. Opinions, conclusions and other information in this e-mail and any attachments are solely those of the author and do not represent those of CapQuest Group Limited or any of its subsidiaries unless otherwise stated. CapQuest Group Limited (registered number 4936030), CapQuest Debt Recovery Limited (registered number 3772278), CapQuest Investments Limited (registered number 5245825), CapQuest Asset Management Limited (registered number 5245829) and CapQuest Mortgage Servicing Limited (registered number 05821008) are all limited companies registered in England and Wales with their registered offices at Fleet 27, Rye Close, Fleet, Hampshire, GU51 2QQ. Each company is a separate and independent legal entity. None of the companies have any liability for each other's acts or omissions. This communication is from the company named in the sender's details above.
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
This e-mail is intended solely for the addressee, is strictly confidential and may also be legally privileged. If you are not the addressee please do not read, print, re-transmit, store or act in reliance on it or any attachments. Instead, please email it back to the sender and then immediately permanently delete it. E-mail communications cannot be guaranteed to be secure or error free, as information could be intercepted, corrupted, amended, lost, destroyed, arrive late or incomplete, or contain viruses. We do not accept liability for any such matters or their consequences. Anyone who communicates with us by e-mail is taken to accept the risks in doing so. Opinions, conclusions and other information in this e-mail and any attachments are solely those of the author and do not represent those of CapQuest Group Limited or any of its subsidiaries unless otherwise stated. CapQuest Group Limited (registered number 4936030), CapQuest Debt Recovery Limited (registered number 3772278), CapQuest Investments Limited (registered number 5245825), CapQuest Asset Management Limited (registered number 5245829) and CapQuest Mortgage Servicing Limited (registered number 05821008) are all limited companies registered in England and Wales with their registered offices at Fleet 27, Rye Close, Fleet, Hampshire, GU51 2QQ. Each company is a separate and independent legal entity. None of the companies have any liability for each other's acts or omissions. This communication is from the company named in the sender's details above.
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Gael Varoquaux
2012-10-24 16:06:38 UTC
Permalink
Post by Peter Prettenhofer
Ok - then that's the problem - for regression problems you have to use
RandomForestRegressor instead of RandomForestClassifier.
Maybe we could try to capture this: have a standard test, if the data is
float, how many unique entries does it have? If it has more than
.5*n_samples, raise a warning or an error.

I would vote to perform such test only if the data is float, as the
unique is not a free operation on big arrays.

We could (should?) add this to all classifiers, and add a test in the
comon tests.

G
Andreas Mueller
2012-10-24 16:09:32 UTC
Permalink
Post by Gael Varoquaux
Post by Peter Prettenhofer
Ok - then that's the problem - for regression problems you have to use
RandomForestRegressor instead of RandomForestClassifier.
Maybe we could try to capture this: have a standard test, if the data is
float, how many unique entries does it have? If it has more than
.5*n_samples, raise a warning or an error.
I would vote to perform such test only if the data is float, as the
unique is not a free operation on big arrays.
We could (should?) add this to all classifiers, and add a test in the
comon tests.
G
Last time I suggested this, someone replied "garbage in - garbage out"
and someone else replied "that's not stable".

I'm for it, though.

Btw, the classifier computes "unique" any way.
Gael Varoquaux
2012-10-24 16:13:31 UTC
Permalink
Post by Andreas Mueller
Last time I suggested this, someone replied "garbage in - garbage out"
and someone else replied "that's not stable".
I am probably one of these someones :$. I hadn't realized that it would
lead to an explosion of the memory usage, and not just GIGO.

That said, there is only so much that we can do for users that don't read
basic docs, and I am not too much in favor of trying to go out of our way
for them. They'll just endlessly generate more work and we have to decide
how to allocate our resources.

Let's say that in the case of the random forest, I am in favor of raising
an error, because elsewhere we have a memory explosion.

For the rest, GIGO.

My 2 cents,

Gaël
Mathieu Blondel
2012-10-24 16:38:22 UTC
Permalink
On Thu, Oct 25, 2012 at 1:06 AM, Gael Varoquaux <
Post by Gael Varoquaux
Maybe we could try to capture this: have a standard test, if the data is
float, how many unique entries does it have? If it has more than
.5*n_samples, raise a warning or an error.
This kind of rule will generate false positives. For example:
X = [[0, 1], [1, 0]]
y = [1, -1]

In this case, np.unique(y) > 0.5 * n_samples.

I think it's hard to come up with a reliable rule.

Mathieu
Gael Varoquaux
2012-10-24 16:39:45 UTC
Permalink
Post by Mathieu Blondel
X = [[0, 1], [1, 0]]
y = [1, -1]
In this case, np.unique(y) > 0.5 * n_samples.
Indeed, however would random forests learn anything useful on such data?
Post by Mathieu Blondel
I think it's hard to come up with a reliable rule.
Granted.

G
Mathieu Blondel
2012-10-24 16:53:41 UTC
Permalink
On Thu, Oct 25, 2012 at 1:39 AM, Gael Varoquaux <
Post by Gael Varoquaux
Indeed, however would random forests learn anything useful on such data?
In my opinion, any rule with false positives is bad.
Post by Gael Varoquaux
Post by Mathieu Blondel
I think it's hard to come up with a reliable rule.
Maybe checking for the decimal part of numbers could work. For example,
1.33333 should not be allowed.
Rather than doing an exhaustive search, one could randomly check
sqrt(n_samples) values. This rule would have false negatives but false
negatives are acceptable.

Mathieu
Andreas Mueller
2012-10-24 16:58:43 UTC
Permalink
Post by Mathieu Blondel
On Thu, Oct 25, 2012 at 1:39 AM, Gael Varoquaux
Indeed, however would random forests learn anything useful on such data?
In my opinion, any rule with false positives is bad.
Post by Mathieu Blondel
I think it's hard to come up with a reliable rule.
Maybe checking for the decimal part of numbers could work. For
example, 1.33333 should not be allowed.
Rather than doing an exhaustive search, one could randomly check
sqrt(n_samples) values. This rule would have false negatives but false
negatives are acceptable.
You mean check the labels themselves?
They are passed to unique.... hum....
Brian Holt
2012-10-24 17:02:16 UTC
Permalink
I'm with GIGO. The name of the model (classifier or regressor) should be
enough clue to the user which they should use for their problem.
Post by Mathieu Blondel
On Thu, Oct 25, 2012 at 1:39 AM, Gael Varoquaux <
Post by Gael Varoquaux
Indeed, however would random forests learn anything useful on such data?
In my opinion, any rule with false positives is bad.
Post by Gael Varoquaux
Post by Mathieu Blondel
I think it's hard to come up with a reliable rule.
Maybe checking for the decimal part of numbers could work. For example,
1.33333 should not be allowed.
Rather than doing an exhaustive search, one could randomly check
sqrt(n_samples) values. This rule would have false negatives but false
negatives are acceptable.
You mean check the labels themselves?
They are passed to unique.... hum....
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Peter Prettenhofer
2012-10-24 17:09:27 UTC
Permalink
Post by Brian Holt
I'm with GIGO. The name of the model (classifier or regressor) should be
enough clue to the user which they should use for their problem.

+1 - I totally agree
Post by Brian Holt
Post by Andreas Mueller
Post by Mathieu Blondel
On Thu, Oct 25, 2012 at 1:39 AM, Gael Varoquaux <
Post by Gael Varoquaux
Indeed, however would random forests learn anything useful on such data?
In my opinion, any rule with false positives is bad.
Post by Gael Varoquaux
Post by Mathieu Blondel
I think it's hard to come up with a reliable rule.
Maybe checking for the decimal part of numbers could work. For example,
1.33333 should not be allowed.
Post by Brian Holt
Post by Andreas Mueller
Post by Mathieu Blondel
Rather than doing an exhaustive search, one could randomly check
sqrt(n_samples) values. This rule would have false negatives but false
negatives are acceptable.
Post by Brian Holt
Post by Andreas Mueller
You mean check the labels themselves?
They are passed to unique.... hum....
------------------------------------------------------------------------------
Post by Brian Holt
Post by Andreas Mueller
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Brian Holt
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Mathieu Blondel
2012-10-24 17:16:22 UTC
Permalink
On Thu, Oct 25, 2012 at 1:58 AM, Andreas Mueller
Post by Andreas Mueller
You mean check the labels themselves?
They are passed to unique.... hum....
Yes, in order to check whether a classifier is wrongly called with
regression targets.
Doing the check before the call to np.unique that many classifiers have
would be more useful :)
I think my suggestion based on non-exhaustive search should be light enough.

Mathieu

Didier Vila
2012-10-24 16:18:50 UTC
Permalink
Post by Didier Vila
Post by Didier Vila
Ok - then that's the problem - for regression problems you have to use RandomForestRegressor instead of RandomForestClassifier.
best,
Peter
All , I just changed my code by

clf = RandomForestRegressor(n_estimators=5)

it works properly and it takes 15 Seconds to run !!!!!

Thanks again to all !!!!



TWO supplementary questions:

1) using PANDA, do I need all the time to write value to converse :

fit(feature.values, label.values.ravel())

2) How can I extract the rule of all the nodes of my tree/random forest , I expected something that:

( NODE 1: tree 5 : if and and and then )



Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye Close | Fleet | Hampshire | GU51 2QQ | Fax: 0871 574 2992 | Email: ***@capquestco.com


-----Original Message-----
From: Peter Prettenhofer [mailto:***@gmail.com]
Sent: 24 October 2012 17:03
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest
Post by Didier Vila
Peter,
Thanks for the email.
I just started to use Panda this morning.
Feature are integer ( binary or 0-1-2-3) or real .
Note that my target variable is continuous between 0 and 1.
Ok - then that's the problem - for regression problems you have to use
RandomForestRegressor instead of RandomForestClassifier.

best,
Peter
Post by Didier Vila
I just run your code below and I still have the same issue on that.
clf.fit(feature.values, label.values.ravel())
Regards
Didier
Ps: the initial codes worked for 100 samples.
-----Original Message-----
Sent: 24 October 2012 16:36
Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest
Didier,
what type is ``feature`` (simply print ``type(feature``)? Considering
your first email I suspect its a pandas.DataFrame; scikit-learn
estimators require array-like inputs - so please do
``clf.fit(features.values, labels.values.ravel())`` instead of
``clf.fit(features, values)``.
150000 is quite a lot; but if you just want to fit 5 trees it should
run in under 15 seconds (I tested using random data and binary
classification).
best,
Peter
Post by Didier Vila
Thanks, I will have a look.
-----Original Message-----
Sent: 24 October 2012 15:44
Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest
As an addition, maybe it would be good for you to have a look into the
http://scikit-learn.org/dev/tutorial/basic/tutorial.html
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
This e-mail is intended solely for the addressee, is strictly confidential and may also be legally privileged. If you are not the addressee please do not read, print, re-transmit, store or act in reliance on it or any attachments. Instead, please email it back to the sender and then immediately permanently delete it. E-mail communications cannot be guaranteed to be secure or error free, as information could be intercepted, corrupted, amended, lost, destroyed, arrive late or incomplete, or contain viruses. We do not accept liability for any such matters or their consequences. Anyone who communicates with us by e-mail is taken to accept the risks in doing so. Opinions, conclusions and other information in this e-mail and any attachments are solely those of the author and do not represent those of CapQuest Group Limited or any of its subsidiaries unless otherwise stated. CapQuest Group Limited (registered number 4936030), CapQuest Debt Recovery Limited (registered number 3772278), CapQuest Investments Limited (registered number 5245825), CapQuest Asset Management Limited (registered number 5245829) and CapQuest Mortgage Servicing Limited (registered number 05821008) are all limited companies registered in England and Wales with their registered offices at Fleet 27, Rye Close, Fleet, Hampshire, GU51 2QQ. Each company is a separate and independent legal entity. None of the companies have any liability for each other's acts or omissions. This communication is from the company named in the sender's details above.
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
This e-mail is intended solely for the addressee, is strictly confidential and may also be legally privileged. If you are not the addressee please do not read, print, re-transmit, store or act in reliance on it or any attachments. Instead, please email it back to the sender and then immediately permanently delete it. E-mail communications cannot be guaranteed to be secure or error free, as information could be intercepted, corrupted, amended, lost, destroyed, arrive late or incomplete, or contain viruses. We do not accept liability for any such matters or their consequences. Anyone who communicates with us by e-mail is taken to accept the risks in doing so. Opinions, conclusions and other information in this e-mail and any attachments are solely those of the author and do not represent those of CapQuest Group Limited or any of its subsidiaries unless otherwise stated. CapQuest Group Limited (registered number 4936030), CapQuest Debt Recovery Limited (registered number 3772278), CapQuest Investments Limited (registered number 5245825), CapQuest Asset Management Limited (registered number 5245829) and CapQuest Mortgage Servicing Limited (registered number 05821008) are all limited companies registered in England and Wales with their registered offices at Fleet 27, Rye Close, Fleet, Hampshire, GU51 2QQ. Each company is a separate and independent legal entity. None of the companies have any liability for each other's acts or omissions. This communication is from the company named in the sender's details above.
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Brian Holt
2012-10-24 16:37:41 UTC
Permalink
If you want rules you can create an exporter similar to the graphviz one.
But just to be clear this tree implementation is CART not C4.5, so you
shouldn't be expecting that the tree stores rules in your format.

Brian
Post by Peter Prettenhofer
Post by Didier Vila
Post by Didier Vila
Post by Peter Prettenhofer
Ok - then that's the problem - for regression problems you have to use
RandomForestRegressor instead of RandomForestClassifier.
Post by Didier Vila
Post by Didier Vila
Post by Peter Prettenhofer
best,
Peter
All , I just changed my code by
clf = RandomForestRegressor(n_estimators=5)
it works properly and it takes 15 Seconds to run !!!!!
Thanks again to all !!!!
fit(feature.values, label.values.ravel())
2) How can I extract the rule of all the nodes of my tree/random forest ,
( NODE 1: tree 5 : if and and and then )
Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye
-----Original Message-----
Sent: 24 October 2012 17:03
Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest
Post by Didier Vila
Peter,
Thanks for the email.
I just started to use Panda this morning.
Feature are integer ( binary or 0-1-2-3) or real .
Note that my target variable is continuous between 0 and 1.
Ok - then that's the problem - for regression problems you have to use
RandomForestRegressor instead of RandomForestClassifier.
best,
Peter
Post by Didier Vila
I just run your code below and I still have the same issue on that.
clf.fit(feature.values, label.values.ravel())
Regards
Didier
Ps: the initial codes worked for 100 samples.
Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye Close |
-----Original Message-----
Sent: 24 October 2012 16:36
Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest
Didier,
what type is ``feature`` (simply print ``type(feature``)? Considering
your first email I suspect its a pandas.DataFrame; scikit-learn
estimators require array-like inputs - so please do
``clf.fit(features.values, labels.values.ravel())`` instead of
``clf.fit(features, values)``.
150000 is quite a lot; but if you just want to fit 5 trees it should
run in under 15 seconds (I tested using random data and binary
classification).
best,
Peter
Post by Didier Vila
Thanks, I will have a look.
Didier Vila, PhD | Risk | CapQuest Group Ltd | Fleet 27 | Rye Close |
-----Original Message-----
Sent: 24 October 2012 15:44
Subject: Re: [Scikit-learn-general] Panda / Tree and Random Forest
As an addition, maybe it would be good for you to have a look into the
http://scikit-learn.org/dev/tutorial/basic/tutorial.html
------------------------------------------------------------------------------
Post by Didier Vila
Post by Didier Vila
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
This e-mail is intended solely for the addressee, is strictly
confidential and may also be legally privileged. If you are not the
addressee please do not read, print, re-transmit, store or act in reliance
on it or any attachments. Instead, please email it back to the sender and
then immediately permanently delete it. E-mail communications cannot be
guaranteed to be secure or error free, as information could be intercepted,
corrupted, amended, lost, destroyed, arrive late or incomplete, or contain
viruses. We do not accept liability for any such matters or their
consequences. Anyone who communicates with us by e-mail is taken to accept
the risks in doing so. Opinions, conclusions and other information in this
e-mail and any attachments are solely those of the author and do not
represent those of CapQuest Group Limited or any of its subsidiaries unless
otherwise stated. CapQuest Group Limited (registered number 4936030),
CapQuest Debt Recovery Limited (registered number 3772278), CapQuest
Investments Limited (registered number 5245825), CapQuest Asset Management
Limited (registered number 5245829) and CapQuest Mortgage Servicing Limited
(registered number 05821008) are all limited companies registered in
England and Wales with their registered offices at Fleet 27, Rye Close,
Fleet, Hampshire, GU51 2QQ. Each company is a separate and independent
legal entity. None of the companies have any liability for each other's
acts or omissions. This communication is from the company named in the
sender's details above.
------------------------------------------------------------------------------
Post by Didier Vila
Post by Didier Vila
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
------------------------------------------------------------------------------
Post by Didier Vila
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
This e-mail is intended solely for the addressee, is strictly
confidential and may also be legally privileged. If you are not the
addressee please do not read, print, re-transmit, store or act in reliance
on it or any attachments. Instead, please email it back to the sender and
then immediately permanently delete it. E-mail communications cannot be
guaranteed to be secure or error free, as information could be intercepted,
corrupted, amended, lost, destroyed, arrive late or incomplete, or contain
viruses. We do not accept liability for any such matters or their
consequences. Anyone who communicates with us by e-mail is taken to accept
the risks in doing so. Opinions, conclusions and other information in this
e-mail and any attachments are solely those of the author and do not
represent those of CapQuest Group Limited or any of its subsidiaries unless
otherwise stated. CapQuest Group Limited (registered number 4936030),
CapQuest Debt Recovery Limited (registered number 3772278), CapQuest
Investments Limited (registered number 5245825), CapQuest Asset Management
Limited (registered number 5245829) and CapQuest Mortgage Servicing Limited
(registered number 05821008) are all limited companies registered in
England and Wales with their registered offices at Fleet 27, Rye Close,
Fleet, Hampshire, GU51 2QQ. Each company is a separate and independent
legal entity. None of the companies have any liability for each other's
acts or omissions. This communication is from the company named in the
sender's details above.
------------------------------------------------------------------------------
Post by Didier Vila
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
This e-mail is intended solely for the addressee, is strictly confidential
and may also be legally privileged. If you are not the addressee please do
not read, print, re-transmit, store or act in reliance on it or any
attachments. Instead, please email it back to the sender and then
immediately permanently delete it. E-mail communications cannot be
guaranteed to be secure or error free, as information could be intercepted,
corrupted, amended, lost, destroyed, arrive late or incomplete, or contain
viruses. We do not accept liability for any such matters or their
consequences. Anyone who communicates with us by e-mail is taken to accept
the risks in doing so. Opinions, conclusions and other information in this
e-mail and any attachments are solely those of the author and do not
represent those of CapQuest Group Limited or any of its subsidiaries unless
otherwise stated. CapQuest Group Limited (registered number 4936030),
CapQuest Debt Recovery Limited (registered number 3772278), CapQuest
Investments Limited (registered number 5245825), CapQuest Asset Management
Limited (registered number 5245829) and CapQuest Mortgage Servicing Limited
(registered number 05821008) are all limited companies registered in
England and Wales with their registered offices at Fleet 27, Rye Close,
Fleet, Hampshire, GU51 2QQ. Each company is a separate and independent
legal entity. None of the companies have any liability for each other's
acts or omissions. This communication is from the company named in the
sender's details above.
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Continue reading on narkive:
Loading...