Discussion:
scikit learn classification issue
(too old to reply)
Karimkhan Pathan
2014-09-03 10:20:22 UTC
Permalink
I have trained my classifier using 20 domain datasets using MultinomialNB.
And it is working fine for these 20 domains.

Issue is, if I make query which contains text which does not belongs to any
of these 20 domain, even it gives classification result.

Is it possible that if query does not belongs to any of 20 domain, it
should get probability value 0?
Sebastian Raschka
2014-09-03 14:31:28 UTC
Permalink
This is due to the Laplace smoothening. If I understand correctly, you want the classification to fail if there is a new feature value (e.g., a word that is not in the vocabulary when you are doing text classification)?

You can set the alpha parameter to 0 (see http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) which would disable the Laplace smoothening.

Best,
Sebastian Raschka
I have trained my classifier using 20 domain datasets using MultinomialNB. And it is working fine for these 20 domains.
Issue is, if I make query which contains text which does not belongs to any of these 20 domain, even it gives classification result.
Is it possible that if query does not belongs to any of 20 domain, it should get probability value 0?
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Patrick Short
2014-09-03 22:24:36 UTC
Permalink
Hi Karimkhan,

If I am understanding your question correctly, you are asking to classify
test data in a class that is not specified in your training set.

For instance if you have three classes of news article specified in your
training data (e.g. politics, sports, and food) and you try to classify an
article that 'truly' best belongs in a 'business' category you are out of
luck. Your classification can only be as good as the training data and your
classifier will put the article in the closest match it can find (if the
article was about McDonald's stock price, it might be classified as food,
for instance).

Hope that helps!
Post by Sebastian Raschka
This is due to the Laplace smoothening. If I understand correctly, you
want the classification to fail if there is a new feature value (e.g., a
word that is not in the vocabulary when you are doing text classification)?
You can set the alpha parameter to 0 (see
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)
which would disable the Laplace smoothening.
Best,
Sebastian Raschka
Post by Karimkhan Pathan
I have trained my classifier using 20 domain datasets using
MultinomialNB. And it is working fine for these 20 domains.
Post by Karimkhan Pathan
Issue is, if I make query which contains text which does not belongs to
any of these 20 domain, even it gives classification result.
Post by Karimkhan Pathan
Is it possible that if query does not belongs to any of 20 domain, it
should get probability value 0?
------------------------------------------------------------------------------
Post by Karimkhan Pathan
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Patrick Short
------------------------------

University of North Carolina at Chapel Hill, 2014

Applied Mathematics and Quantitative Biology

***@gmail.com | 919-455-7045 C
Karimkhan Pathan
2014-09-04 06:56:55 UTC
Permalink
Mohamed-Rafik Bouguelia
2014-09-04 09:01:44 UTC
Permalink
Gael Varoquaux
2014-09-04 11:29:38 UTC
Permalink
An example of this is the paper that can be found here: http://www.loria.fr/
~mbouguel/papers/BougueliaICPR.pdf
Mohamed-Rafik Bouguelia, Yoland Belaid and Abdel Belaid. Efficient active novel
class detection for data stream classification. In the IEEE International
Conference on Pattern Recognition - ICPR, Stockholm (Sweden), August 2014.
It would be nice if some of these methods can be implemented in Sickit-Learn.
There are guidelines on what can be included in scikit-learn in our FAQ
(necessary but not sufficient conditions):
http://scikit-learn.org/dev/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published

Cheers,

Gaël
Karimkhan Pathan
2014-09-04 11:52:02 UTC
Permalink
Hey Gaël,
Happy to see you on this thread. Actually today only I was listening to
your scikit Ipython notebook tutorial.

Well could you please throw light on my classification issue? I guess you
might be knowing well whether something helpful class/method exists in
scikit which can solve this issue.


On Thu, Sep 4, 2014 at 4:59 PM, Gael Varoquaux <
Post by
http://www.loria.fr/
Post by
~mbouguel/papers/BougueliaICPR.pdf
Mohamed-Rafik Bouguelia, Yoland Belaid and Abdel Belaid. Efficient
active novel
Post by
class detection for data stream classification. In the IEEE International
Conference on Pattern Recognition - ICPR, Stockholm (Sweden), August
2014.
Post by
It would be nice if some of these methods can be implemented in
Sickit-Learn.
There are guidelines on what can be included in scikit-learn in our FAQ
http://scikit-learn.org/dev/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published
Cheers,
Gaël
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2014-09-04 13:30:04 UTC
Permalink
Post by Karimkhan Pathan
Well could you please throw light on my classification issue? I guess
you might be knowing well whether something helpful class/method exists
in scikit which can solve this issue. 
I don't know. I would naively try to do a predict_proba and conclude that
it's none of the classes known if none of the probas are somewhat
confident. But I have no prior experience doing that, so I cannot give a
good advice.

G
Karimkhan Pathan
2014-09-04 13:45:09 UTC
Permalink
Oh okay, well I tried with predict_proba. But if query is out of domain
then classifier uniformly divide probability to all learned domains. Like
in case of 4 domains
(0.333123570669, 0.333073654046, 0.166936800591, 0.166865974694)


On Thu, Sep 4, 2014 at 7:00 PM, Gael Varoquaux <
Post by Gael Varoquaux
Post by Karimkhan Pathan
Well could you please throw light on my classification issue? I guess
you might be knowing well whether something helpful class/method exists
in scikit which can solve this issue.
I don't know. I would naively try to do a predict_proba and conclude that
it's none of the classes known if none of the probas are somewhat
confident. But I have no prior experience doing that, so I cannot give a
good advice.
G
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Lars Buitinck
2014-09-04 14:11:34 UTC
Permalink
Oh okay, well I tried with predict_proba. But if query is out of domain then
classifier uniformly divide probability to all learned domains. Like in case
of 4 domains (0.333123570669, 0.333073654046, 0.166936800591,
0.166865974694)
Naive Bayes returns highly distorted probabilities. It's a good
classifier, but a lousy probability model. predict_proba is really
only useful for ensemble algorithms.

What you could do is phrase the problem as multi-label classification
with sklearn.multiclass.OneVsRestClassifier, and then predict the
class with the highest probability under this model iff it exceeds .5.
If none of the k classifiers predicts positive, return the null class.
This is just an idea, no guarantee that it will work. You'll need to
convert the targets using sklearn.preprocessing.MultiLabelBinarizer.
Mohamed-Rafik Bouguelia
2014-09-04 14:24:35 UTC
Permalink
Karimkhan,

Two possible naive methods that you can directly use with sklearn are:

(1) use predict_proba and check if the probability of belonging to the most
probable class (p1) is less than a threshold. Or you can use the entropy
over the probability distribution instead of p1. However, an instance with
a low prediction probability is not always necessarily an instance of an
unknown class.

(2) use the one-class svm available on sklearn ( see
http://scikit-learn.org/stable/modules/outlier_detection.html ) and try to
build a one-class svm for each of your known classes (e.g. if you have 4
known classes, you will build 4 one-class svm models). If a new test point
is classified as an outilier by all those models, then it is possibly a
novel class instance. However, it is a bit difficult to tune the parameters
"nu" and "gamma" for the one-class svm.

Another possibly more efficient way (but not straightforward) is to extend
(2) to detect the instances of the test set that are determined as outliers
and are close to each other.
Post by Karimkhan Pathan
Oh okay, well I tried with predict_proba. But if query is out of domain
then classifier uniformly divide probability to all learned domains. Like
in case of 4 domains
(0.333123570669, 0.333073654046, 0.166936800591, 0.166865974694)
On Thu, Sep 4, 2014 at 7:00 PM, Gael Varoquaux <
Post by Gael Varoquaux
Post by Karimkhan Pathan
Well could you please throw light on my classification issue? I guess
you might be knowing well whether something helpful class/method exists
in scikit which can solve this issue.
I don't know. I would naively try to do a predict_proba and conclude that
it's none of the classes known if none of the probas are somewhat
confident. But I have no prior experience doing that, so I cannot give a
good advice.
G
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Mohamed-Rafik BOUGUELIA
PhD Student
INRIA Nancy Grand Est - LORIA - READ Team
Nancy University - France.
Olivier Grisel
2014-09-04 16:44:49 UTC
Permalink
Another possible strategy:

Add a new class named "random garbage" to your training set with
random text collected from wikipedia or social networks messages, or
both.
--
Olivier
Karimkhan Pathan
2014-09-04 06:52:18 UTC
Permalink
Dear Sebastian,
Thanks for reply. My actual alpha value was 0.1, I changed it to 0 and
tested the code. But it behave similarly.
Post by Sebastian Raschka
This is due to the Laplace smoothening. If I understand correctly, you
want the classification to fail if there is a new feature value (e.g., a
word that is not in the vocabulary when you are doing text classification)?
You can set the alpha parameter to 0 (see
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)
which would disable the Laplace smoothening.
Best,
Sebastian Raschka
Post by Karimkhan Pathan
I have trained my classifier using 20 domain datasets using
MultinomialNB. And it is working fine for these 20 domains.
Post by Karimkhan Pathan
Issue is, if I make query which contains text which does not belongs to
any of these 20 domain, even it gives classification result.
Post by Karimkhan Pathan
Is it possible that if query does not belongs to any of 20 domain, it
should get probability value 0?
------------------------------------------------------------------------------
Post by Karimkhan Pathan
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
1970-01-01 00:00:00 UTC
Permalink
--f46d043be1d2b22f27050237db4f
Content-Type: text/plain; charset=UTF-8

Hi Patrick,
Yeah, you might be correct. But when I input testing query 'what is where'
with all stopwords, and having filter for stopwords still it classifies it
as :

'what is what' => films => 0.333333333333
'what is what' => laptops => 0.333333333333
'what is what' => medicine => 0.166666666667
'what is what' => mobile_phones => 0.166666666667

this behavior surprise me.
Post by Patrick Short
Hi Karimkhan,
If I am understanding your question correctly, you are asking to classify
test data in a class that is not specified in your training set.
For instance if you have three classes of news article specified in your
training data (e.g. politics, sports, and food) and you try to classify an
article that 'truly' best belongs in a 'business' category you are out of
luck. Your classification can only be as good as the training data and your
classifier will put the article in the closest match it can find (if the
article was about McDonald's stock price, it might be classified as food,
for instance).
Hope that helps!
Post by Sebastian Raschka
This is due to the Laplace smoothening. If I understand correctly, you
want the classification to fail if there is a new feature value (e.g., a
word that is not in the vocabulary when you are doing text classification)?
You can set the alpha parameter to 0 (see
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)
which would disable the Laplace smoothening.
Best,
Sebastian Raschka
Post by Karimkhan Pathan
I have trained my classifier using 20 domain datasets using
MultinomialNB. And it is working fine for these 20 domains.
Post by Karimkhan Pathan
Issue is, if I make query which contains text which does not belongs to
any of these 20 domain, even it gives classification result.
Post by Karimkhan Pathan
Is it possible that if query does not belongs to any of 20 domain, it
should get probability value 0?
------------------------------------------------------------------------------
Post by Karimkhan Pathan
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Patrick Short
------------------------------
University of North Carolina at Chapel Hill, 2014
Applied Mathematics and Quantitative Biology
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--f46d043be1d2b22f27050237db4f
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir="ltr">Hi Patrick, <div>Yeah, you might be correct. But when I input testing query &#39;what is where&#39; with all stopwords, and having filter for stopwords still it classifies it as :</div><div><br></div><div><div>
&#39;what is what&#39; =&gt; films =&gt; 0.333333333333</div><div>&#39;what is what&#39; =&gt; laptops =&gt; 0.333333333333</div><div>&#39;what is what&#39; =&gt; medicine =&gt; 0.166666666667</div><div>&#39;what is what&#39; =&gt; mobile_phones =&gt; 0.166666666667</div> </div><div><br></div><div>this behavior surprise me.</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Sep 4, 2014 at 3:54 AM, Patrick Short <span dir="ltr">&lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt;</span> wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi Karimkhan,<div><br></div><div>If I am understanding your question correctly, you are asking to classify test data in a class that is not specified in your training set.</div> <div><br></div><div>For instance if you have three classes of news article specified in your training data (e.g. politics, sports, and food) and you try to classify an article that &#39;truly&#39; best belongs in a &#39;business&#39; category you are out of luck. Your classification can only be as good as the training data and your classifier will put the article in the closest match it can find (if the article was about McDonald&#39;s stock price, it might be classified as food, for instance).</div> <div><br></div><div>Hope that helps!</div></div><div class="gmail_extra"><div><div class="h5"><br><br><div class="gmail_quote">On Wed, Sep 3, 2014 at 10:31 AM, Sebastian Raschka <span dir="ltr">&lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">This is due to the Laplace smoothening. If I understand correctly, you want the classification to fail if there is a new feature value (e.g., a word that is not in the vocabulary when you are doing text classification)?<br>



<br>
You can set the alpha parameter to 0 (see <a href="http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB" target="_blank">http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB</a>) which would disable the Laplace smoothening.<br>



<br>
Best,<br>
Sebastian Raschka<br>
<div><div><br>
&gt; On Sep 3, 2014, at 6:20 AM, Karimkhan Pathan &lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt; wrote:<br>
&gt;<br>
&gt; I have trained my classifier using 20 domain datasets using  MultinomialNB. And it is working fine for these 20 domains.<br>
&gt;<br>
&gt; Issue is, if I make query which contains text which does not belongs to any of these 20 domain, even it gives classification result.<br>
&gt;<br>
&gt; Is it possible that if query does not belongs to any of 20 domain, it should get probability value 0?<br> </div></div>&gt; ------------------------------------------------------------------------------<br>
&gt; Slashdot TV.<br>
&gt; Video for Nerds.  Stuff that matters.<br>
&gt; <a href="http://tv.slashdot.org/" target="_blank">http://tv.slashdot.org/</a><br>
&gt; _______________________________________________<br>
&gt; Scikit-learn-general mailing list<br>
&gt; <a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>
&gt; <a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br>
------------------------------------------------------------------------------<br>
Slashdot TV.<br>
Video for Nerds.  Stuff that matters.<br>
<a href="http://tv.slashdot.org/" target="_blank">http://tv.slashdot.org/</a><br>
_______________________________________________<br>
Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
</blockquote></div><br><br clear="all"><div><br></div></div></div><span class="HOEnZb"><font color="#888888">-- <br><span style="font-size:13px;border-collapse:collapse;color:rgb(34,34,34)"><div style="font-family:arial,sans-serif">
Patrick Short</div><span style="font-family:arial,sans-serif"><div align="center" style="font-family:arial,sans-serif;margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;text-align:center">

<hr size="2" width="100%" align="center"></div></span><p style="color:rgb(34,34,34);font-family:arial,sans-serif;margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px"><span style="font-size:10pt;font-family:Tahoma,sans-serif">University of North Carolina at Chapel Hill, 2014<u></u></span></p>


<p style="color:rgb(34,34,34);font-family:arial,sans-serif;margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px"><span style="font-family:Tahoma,sans-serif">Applied Mathematics and Quantitative Biology</span></p>


<p style="color:rgb(34,34,34);margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px"><span style="font-family:Tahoma,sans-serif"><font color="#000000"><a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a> | 919-455-7045</font><font color="#222222"> C</font></span></p>


<p style="color:rgb(34,34,34);margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px"><span style="font-family:Tahoma,sans-serif"><font color="#222222"><br></font></span></p></span>
</font></span></div>
<br>------------------------------------------------------------------------------<br>
Slashdot TV.<br>
Video for Nerds.  Stuff that matters.<br>
<a href="http://tv.slashdot.org/" target="_blank">http://tv.slashdot.org/</a><br>_______________________________________________<br>
Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br></blockquote></div><br></div>

--f46d043be1d2b22f27050237db4f--
1970-01-01 00:00:00 UTC
Permalink
--001a11c357020981520502399a5a
Content-Type: text/plain; charset=UTF-8

Hi Patrick,

Juste for information, there is some existing techniques to detect test
instances whose class is not provided for training. Instead of letting the
classifier put those instances in the closest match it can (the most
probable known class), we detect that they belongs to a novel class which
was unknown during training.

An example of this is the paper that can be found here:
http://www.loria.fr/~mbouguel/papers/BougueliaICPR.pdf
Mohamed-Rafik Bouguelia, Yoland Belaid and Abdel Belaid. Efficient active
novel class detection for data stream classification. In the IEEE
International Conference on Pattern Recognition - ICPR, Stockholm (Sweden),
August 2014.

It would be nice if some of these methods can be implemented in
Sickit-Learn.
Post by Patrick Short
Hi Karimkhan,
If I am understanding your question correctly, you are asking to classify
test data in a class that is not specified in your training set.
For instance if you have three classes of news article specified in your
training data (e.g. politics, sports, and food) and you try to classify an
article that 'truly' best belongs in a 'business' category you are out of
luck. Your classification can only be as good as the training data and your
classifier will put the article in the closest match it can find (if the
article was about McDonald's stock price, it might be classified as food,
for instance).
Hope that helps!
Post by Sebastian Raschka
This is due to the Laplace smoothening. If I understand correctly, you
want the classification to fail if there is a new feature value (e.g., a
word that is not in the vocabulary when you are doing text classification)?
You can set the alpha parameter to 0 (see
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)
which would disable the Laplace smoothening.
Best,
Sebastian Raschka
Post by Karimkhan Pathan
I have trained my classifier using 20 domain datasets using
MultinomialNB. And it is working fine for these 20 domains.
Post by Karimkhan Pathan
Issue is, if I make query which contains text which does not belongs to
any of these 20 domain, even it gives classification result.
Post by Karimkhan Pathan
Is it possible that if query does not belongs to any of 20 domain, it
should get probability value 0?
------------------------------------------------------------------------------
Post by Karimkhan Pathan
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Patrick Short
------------------------------
University of North Carolina at Chapel Hill, 2014
Applied Mathematics and Quantitative Biology
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Mohamed-Rafik BOUGUELIA

--001a11c357020981520502399a5a
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir="ltr"><div><div>Hi Patrick,<br><br></div>Juste for information, there is some existing techniques to detect test instances whose class is not provided for training. Instead of letting the classifier put those instances in the closest match it can (the most probable known class), we detect that they belongs to a novel class which was unknown during training.<br>
<br></div>An example of this is the paper that can be found here: <a href="http://www.loria.fr/~mbouguel/papers/BougueliaICPR.pdf">http://www.loria.fr/~mbouguel/papers/BougueliaICPR.pdf</a><br><div><div> <div><div class="gmail_extra">
Mohamed-Rafik Bouguelia, Yoland Belaid and Abdel Belaid. Efficient active novel class detection for data stream classification. In the IEEE International Conference on Pattern Recognition - ICPR, Stockholm (Sweden), August 2014.<br> <br></div><div class="gmail_extra">It would be nice if some of these methods can be implemented in Sickit-Learn.<br><br></div><div class="gmail_extra"><br><div class="gmail_quote">2014-09-04 0:24 GMT+02:00 Patrick Short <span dir="ltr">&lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt;</span>:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi Karimkhan,<div><br></div><div>If I am understanding your question correctly, you are asking to classify test data in a class that is not specified in your training set.</div> <div><br></div><div>For instance if you have three classes of news article specified in your training data (e.g. politics, sports, and food) and you try to classify an article that &#39;truly&#39; best belongs in a &#39;business&#39; category you are out of luck. Your classification can only be as good as the training data and your classifier will put the article in the closest match it can find (if the article was about McDonald&#39;s stock price, it might be classified as food, for instance).</div> <div><br></div><div>Hope that helps!</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Sep 3, 2014 at 10:31 AM, Sebastian Raschka <span dir="ltr">&lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">This is due to the Laplace smoothening. If I understand correctly, you want the classification to fail if there is a new feature value (e.g., a word that is not in the vocabulary when you are doing text classification)?<br>



<br>
You can set the alpha parameter to 0 (see <a href="http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB" target="_blank">http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB</a>) which would disable the Laplace smoothening.<br>



<br>
Best,<br>
Sebastian Raschka<br>
<div><div><br>
&gt; On Sep 3, 2014, at 6:20 AM, Karimkhan Pathan &lt;<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>&gt; wrote:<br>
&gt;<br>
&gt; I have trained my classifier using 20 domain datasets using  MultinomialNB. And it is working fine for these 20 domains.<br>
&gt;<br>
&gt; Issue is, if I make query which contains text which does not belongs to any of these 20 domain, even it gives classification result.<br>
&gt;<br>
&gt; Is it possible that if query does not belongs to any of 20 domain, it should get probability value 0?<br> </div></div>&gt; ------------------------------------------------------------------------------<br>
&gt; Slashdot TV.<br>
&gt; Video for Nerds.  Stuff that matters.<br>
&gt; <a href="http://tv.slashdot.org/" target="_blank">http://tv.slashdot.org/</a><br>
&gt; _______________________________________________<br>
&gt; Scikit-learn-general mailing list<br>
&gt; <a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>
&gt; <a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br>
------------------------------------------------------------------------------<br>
Slashdot TV.<br>
Video for Nerds.  Stuff that matters.<br>
<a href="http://tv.slashdot.org/" target="_blank">http://tv.slashdot.org/</a><br>
_______________________________________________<br>
Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><span class=""><font color="#888888"><br>
</font></span></blockquote></div><span class=""><font color="#888888"><br><br clear="all"><div><br></div>-- <br><span style="font-size:13px;border-collapse:collapse;color:rgb(34,34,34)"><div style="font-family:arial,sans-serif">
Patrick Short</div><span style="font-family:arial,sans-serif"><div style="font-family:arial,sans-serif;margin:0px;text-align:center" align="center">

<hr align="center" size="2" width="100%"></div></span><p style="color:rgb(34,34,34);font-family:arial,sans-serif;margin:0px"><span style="font-size:10pt;font-family:Tahoma,sans-serif">University of North Carolina at Chapel Hill, 2014<u></u></span></p>


<p style="color:rgb(34,34,34);font-family:arial,sans-serif;margin:0px"><span style="font-family:Tahoma,sans-serif">Applied Mathematics and Quantitative Biology</span></p>

<p style="color:rgb(34,34,34);margin:0px"><span style="font-family:Tahoma,sans-serif"><font color="#000000"><a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a> | <a href="tel:919-455-7045" value="+19194557045" target="_blank">919-455-7045</a></font><font color="#222222"> C</font></span></p>


<p style="color:rgb(34,34,34);margin:0px"><span style="font-family:Tahoma,sans-serif"><font color="#222222"><br></font></span></p></span>
</font></span></div>
<br>------------------------------------------------------------------------------<br>
Slashdot TV.<br>
Video for Nerds.  Stuff that matters.<br>
<a href="http://tv.slashdot.org/" target="_blank">http://tv.slashdot.org/</a><br>_______________________________________________<br>
Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br></blockquote></div><br><br clear="all"><br>-- <br>Mohamed-Rafik BOUGUELIA<br><br></div></div></div></div></div>

--001a11c357020981520502399a5a--
Continue reading on narkive:
Loading...