Discussion:
[Scikit-learn-general] class label hashing
Pagliari, Roberto
2015-05-01 03:02:29 UTC
Permalink
Suppose I train a classifier with dataset1, which contains labels

0
3
4
6
7

and then predict over dataset2 with labels

0
3
4
8
10

will the hashing be the same for labels 0, 3 and 4? and will scikit learn get confused by seeing new labels such as 8 and 10?

Thank you,
Sebastian Raschka
2015-05-01 03:08:49 UTC
Permalink
Roberto, I am not sure if this causes problems regarding the implementation, but in any case, I'd recommend you to use the LabelEncoder to have your classes mapped to a fixed range, e.g., 0, 1, 2, 3, 4, 5. And having different class labels in training and test set that reference to the same class is not good practice and could cause all kinds of problems. I just wouldn't risk it even it it works.

> On Apr 30, 2015, at 11:02 PM, Pagliari, Roberto <***@appcomsci.com> wrote:
>
> Suppose I train a classifier with dataset1, which contains labels
>
> 0
> 3
> 4
> 6
> 7
>
> and then predict over dataset2 with labels
>
> 0
> 3
> 4
> 8
> 10
>
> will the hashing be the same for labels 0, 3 and 4? and will scikit learn get confused by seeing new labels such as 8 and 10?
>
> Thank you,
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y_______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Pagliari, Roberto
2015-05-01 15:07:23 UTC
Permalink
Hi Sebastian,
if classes/labels are the same for both training and test, that should not be a problem. I've done that and never seen any issues. As far as I can see, scikit learn automatically maps classes into numbers from 0 to number of classes -1, which is something Spark, for example, does not do.

With different set of classes, the simplest thing is to remove the ones in the test that do not appear in the training, to avoid messing with the confusion matrix [ in my case, different label numbers are really different classes ]


________________________________________
From: Sebastian Raschka [***@gmail.com]
Sent: Thursday, April 30, 2015 11:08 PM
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] class label hashing

Roberto, I am not sure if this causes problems regarding the implementation, but in any case, I'd recommend you to use the LabelEncoder to have your classes mapped to a fixed range, e.g., 0, 1, 2, 3, 4, 5. And having different class labels in training and test set that reference to the same class is not good practice and could cause all kinds of problems. I just wouldn't risk it even it it works.

> On Apr 30, 2015, at 11:02 PM, Pagliari, Roberto <***@appcomsci.com> wrote:
>
> Suppose I train a classifier with dataset1, which contains labels
>
> 0
> 3
> 4
> 6
> 7
>
> and then predict over dataset2 with labels
>
> 0
> 3
> 4
> 8
> 10
>
> will the hashing be the same for labels 0, 3 and 4? and will scikit learn get confused by seeing new labels such as 8 and 10?
>
> Thank you,
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y_______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Michael Eickenberg
2015-05-01 15:13:11 UTC
Permalink
What do expect a classifier to predict on a label that it has never seen
during training? If there were structure in the target, such as an order,
then an appropriate regression may be able to infer unseen targets due to
this structure. But in classification this information is entirely absent.

Michael

On Fri, May 1, 2015 at 5:07 PM, Pagliari, Roberto <***@appcomsci.com>
wrote:

> Hi Sebastian,
> if classes/labels are the same for both training and test, that should not
> be a problem. I've done that and never seen any issues. As far as I can
> see, scikit learn automatically maps classes into numbers from 0 to number
> of classes -1, which is something Spark, for example, does not do.
>
> With different set of classes, the simplest thing is to remove the ones in
> the test that do not appear in the training, to avoid messing with the
> confusion matrix [ in my case, different label numbers are really different
> classes ]
>
>
> ________________________________________
> From: Sebastian Raschka [***@gmail.com]
> Sent: Thursday, April 30, 2015 11:08 PM
> To: scikit-learn-***@lists.sourceforge.net
> Subject: Re: [Scikit-learn-general] class label hashing
>
> Roberto, I am not sure if this causes problems regarding the
> implementation, but in any case, I'd recommend you to use the LabelEncoder
> to have your classes mapped to a fixed range, e.g., 0, 1, 2, 3, 4, 5. And
> having different class labels in training and test set that reference to
> the same class is not good practice and could cause all kinds of problems.
> I just wouldn't risk it even it it works.
>
> > On Apr 30, 2015, at 11:02 PM, Pagliari, Roberto <***@appcomsci.com>
> wrote:
> >
> > Suppose I train a classifier with dataset1, which contains labels
> >
> > 0
> > 3
> > 4
> > 6
> > 7
> >
> > and then predict over dataset2 with labels
> >
> > 0
> > 3
> > 4
> > 8
> > 10
> >
> > will the hashing be the same for labels 0, 3 and 4? and will scikit
> learn get confused by seeing new labels such as 8 and 10?
> >
> > Thank you,
> >
> >
> ------------------------------------------------------------------------------
> > One dashboard for servers and applications across Physical-Virtual-Cloud
> > Widest out-of-the-box monitoring support with 50+ applications
> > Performance metrics, stats and reports that give you Actionable Insights
> > Deep dive visibility with transaction tracing using APM Insight.
> >
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y_______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-***@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
Pagliari, Roberto
2015-05-01 15:16:39 UTC
Permalink
I agree with you.
I'm just not sure whether scikit learn would handle that or not.

thank you,


________________________________
From: Michael Eickenberg [***@gmail.com]
Sent: Friday, May 01, 2015 11:13 AM
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] class label hashing

What do expect a classifier to predict on a label that it has never seen during training? If there were structure in the target, such as an order, then an appropriate regression may be able to infer unseen targets due to this structure. But in classification this information is entirely absent.

Michael

On Fri, May 1, 2015 at 5:07 PM, Pagliari, Roberto <***@appcomsci.com<mailto:***@appcomsci.com>> wrote:
Hi Sebastian,
if classes/labels are the same for both training and test, that should not be a problem. I've done that and never seen any issues. As far as I can see, scikit learn automatically maps classes into numbers from 0 to number of classes -1, which is something Spark, for example, does not do.

With different set of classes, the simplest thing is to remove the ones in the test that do not appear in the training, to avoid messing with the confusion matrix [ in my case, different label numbers are really different classes ]


________________________________________
From: Sebastian Raschka [***@gmail.com<mailto:***@gmail.com>]
Sent: Thursday, April 30, 2015 11:08 PM
To: scikit-learn-***@lists.sourceforge.net<mailto:scikit-learn-***@lists.sourceforge.net>
Subject: Re: [Scikit-learn-general] class label hashing

Roberto, I am not sure if this causes problems regarding the implementation, but in any case, I'd recommend you to use the LabelEncoder to have your classes mapped to a fixed range, e.g., 0, 1, 2, 3, 4, 5. And having different class labels in training and test set that reference to the same class is not good practice and could cause all kinds of problems. I just wouldn't risk it even it it works.

> On Apr 30, 2015, at 11:02 PM, Pagliari, Roberto <***@appcomsci.com<mailto:***@appcomsci.com>> wrote:
>
> Suppose I train a classifier with dataset1, which contains labels
>
> 0
> 3
> 4
> 6
> 7
>
> and then predict over dataset2 with labels
>
> 0
> 3
> 4
> 8
> 10
>
> will the hashing be the same for labels 0, 3 and 4? and will scikit learn get confused by seeing new labels such as 8 and 10?
>
> Thank you,
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y_______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net<mailto:Scikit-learn-***@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net<mailto:Scikit-learn-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andreas Mueller
2015-05-01 16:51:13 UTC
Permalink
It should. If not, please report a bug.

On 05/01/2015 11:16 AM, Pagliari, Roberto wrote:
> I agree with you.
> I'm just not sure whether scikit learn would handle that or not.
>
> thank you,
>
>
> ------------------------------------------------------------------------
> *From:* Michael Eickenberg [***@gmail.com]
> *Sent:* Friday, May 01, 2015 11:13 AM
> *To:* scikit-learn-***@lists.sourceforge.net
> *Subject:* Re: [Scikit-learn-general] class label hashing
>
> What do expect a classifier to predict on a label that it has never
> seen during training? If there were structure in the target, such as
> an order, then an appropriate regression may be able to infer unseen
> targets due to this structure. But in classification this information
> is entirely absent.
>
> Michael
>
> On Fri, May 1, 2015 at 5:07 PM, Pagliari, Roberto
> <***@appcomsci.com <mailto:***@appcomsci.com>> wrote:
>
> Hi Sebastian,
> if classes/labels are the same for both training and test, that
> should not be a problem. I've done that and never seen any issues.
> As far as I can see, scikit learn automatically maps classes into
> numbers from 0 to number of classes -1, which is something Spark,
> for example, does not do.
>
> With different set of classes, the simplest thing is to remove the
> ones in the test that do not appear in the training, to avoid
> messing with the confusion matrix [ in my case, different label
> numbers are really different classes ]
>
>
> ________________________________________
> From: Sebastian Raschka [***@gmail.com
> <mailto:***@gmail.com>]
> Sent: Thursday, April 30, 2015 11:08 PM
> To: scikit-learn-***@lists.sourceforge.net
> <mailto:scikit-learn-***@lists.sourceforge.net>
> Subject: Re: [Scikit-learn-general] class label hashing
>
> Roberto, I am not sure if this causes problems regarding the
> implementation, but in any case, I'd recommend you to use the
> LabelEncoder to have your classes mapped to a fixed range, e.g.,
> 0, 1, 2, 3, 4, 5. And having different class labels in training
> and test set that reference to the same class is not good practice
> and could cause all kinds of problems. I just wouldn't risk it
> even it it works.
>
> > On Apr 30, 2015, at 11:02 PM, Pagliari, Roberto
> <***@appcomsci.com <mailto:***@appcomsci.com>> wrote:
> >
> > Suppose I train a classifier with dataset1, which contains labels
> >
> > 0
> > 3
> > 4
> > 6
> > 7
> >
> > and then predict over dataset2 with labels
> >
> > 0
> > 3
> > 4
> > 8
> > 10
> >
> > will the hashing be the same for labels 0, 3 and 4? and will
> scikit learn get confused by seeing new labels such as 8 and 10?
> >
> > Thank you,
> >
> >
> ------------------------------------------------------------------------------
> > One dashboard for servers and applications across
> Physical-Virtual-Cloud
> > Widest out-of-the-box monitoring support with 50+ applications
> > Performance metrics, stats and reports that give you Actionable
> Insights
> > Deep dive visibility with transaction tracing using APM Insight.
> >
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y_______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-***@lists.sourceforge.net
> <mailto:Scikit-learn-***@lists.sourceforge.net>
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across
> Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable
> Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> <mailto:Scikit-learn-***@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across
> Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable
> Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> <mailto:Scikit-learn-***@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...