Discussion:
SVM: select the training set randomly
(too old to reply)
Gianni Iannelli
2013-06-21 00:31:22 UTC
Permalink
Dear All,
I have one question. I have a dataset of 100 vector each with some features. Of this 100 I already know the classification of all of them. What I wanna do is select randomly in this 100 a subset to use as training set and the rest as test set. There is something already implemented in scikit that do it automatically or I have to use an index method? For index method I mean to separate the two classes, for example I have 40 (class A) and 60 (class B) for each of them I select 10 number randomly for each class and put set these 20 vector as training set. After that I select the other 80 vectors (also using the index of the main matrix) and classify.
Do you think this is too crazy and there is something simple? There is also a validation of the result that could tell me how the classification is good? I know that this is not a real case because I know the classification result but I just wanna see what happens changing the number of features, number of training elements, and so on.
Thanks All!!!
Joel Nothman
2013-06-21 00:59:13 UTC
Permalink
Please see
http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html


On Fri, Jun 21, 2013 at 10:31 AM, Gianni Iannelli <***@msn.com>wrote:

> Dear All,
>
> I have one question. I have a dataset of 100 vector each with some
> features. Of this 100 I already know the classification of all of them.
> What I wanna do is select randomly in this 100 a subset to use as training
> set and the rest as test set. There is something already implemented in
> scikit that do it automatically or I have to use an index method? For index
> method I mean to separate the two classes, for example I have 40 (class A)
> and 60 (class B) for each of them I select 10 number randomly for each
> class and put set these 20 vector as training set. After that I select the
> other 80 vectors (also using the index of the main matrix) and classify.
>
> Do you think this is too crazy and there is something simple? There is
> also a validation of the result that could tell me how the classification
> is good? I know that this is not a real case because I know the
> classification result but I just wanna see what happens changing the number
> of features, number of training elements, and so on.
>
> Thanks All!!!
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>
> http://p.sf.net/sfu/windows-dev2dev
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
Bilal Dadanlar
2013-06-21 06:19:36 UTC
Permalink
you can have a look at "sklearn.cross_validation.train_test_split()" and
some other methods
from here:
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cross_validation



On Fri, Jun 21, 2013 at 3:59 AM, Joel Nothman
<***@student.usyd.edu.au>wrote:

> Please see
> http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html
>
>
> On Fri, Jun 21, 2013 at 10:31 AM, Gianni Iannelli <***@msn.com>wrote:
>
>> Dear All,
>>
>> I have one question. I have a dataset of 100 vector each with some
>> features. Of this 100 I already know the classification of all of them.
>> What I wanna do is select randomly in this 100 a subset to use as training
>> set and the rest as test set. There is something already implemented in
>> scikit that do it automatically or I have to use an index method? For index
>> method I mean to separate the two classes, for example I have 40 (class A)
>> and 60 (class B) for each of them I select 10 number randomly for each
>> class and put set these 20 vector as training set. After that I select the
>> other 80 vectors (also using the index of the main matrix) and classify.
>>
>> Do you think this is too crazy and there is something simple? There is
>> also a validation of the result that could tell me how the classification
>> is good? I know that this is not a real case because I know the
>> classification result but I just wanna see what happens changing the number
>> of features, number of training elements, and so on.
>>
>> Thanks All!!!
>>
>>
>> ------------------------------------------------------------------------------
>> This SF.net email is sponsored by Windows:
>>
>> Build for Windows Store.
>>
>> http://p.sf.net/sfu/windows-dev2dev
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-***@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>
> http://p.sf.net/sfu/windows-dev2dev
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


--
Bilal Dadanlar
cimri.com | Yazılım MÌhendisi
Gianni Iannelli
2013-06-21 14:20:22 UTC
Permalink
Thank You very much for the link!! It does closely what I wanna do!
In my case I have two classes that are for example 0 and 1. I wanna keep the distribution (in the training set and so also the test set) between them similar. And I also need that are choosen randomly, I don't care if in one case I get the same index for training and test set. Well, to select randomly I think that the sklearn.cross_validation.ShuffleSplit() does what I want and I will investigate on that. To keep the distribution equally between the two classes I was thinking to:split the two classesapply for each of them the separation in training and test using the ShuffleSplit()concatenate now the two classes again (they will have the same size before the split)add to one of the two index vector the size of one of the two class (depends how I will concatenate the two)apply my SVM classification
What do you think? Do you think is ok?
I have one another question. How score works? What it computes? I searched around but I found this:
sklearn.metrics.classification_report(y_true, y_pred, labels=None, target_names=None)
That maybe could give me back a confusion matrix where I could compute (maybe, I'm just guessing) an Overall Accuracy and a Kappa Coefficient.
Is it correct?
Thank You Very Much!!!
Date: Fri, 21 Jun 2013 10:59:13 +1000
From: ***@student.usyd.edu.au
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] SVM: select the training set randomly

Please see http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html


On Fri, Jun 21, 2013 at 10:31 AM, Gianni Iannelli <***@msn.com> wrote:




Dear All,
I have one question. I have a dataset of 100 vector each with some features. Of this 100 I already know the classification of all of them. What I wanna do is select randomly in this 100 a subset to use as training set and the rest as test set. There is something already implemented in scikit that do it automatically or I have to use an index method? For index method I mean to separate the two classes, for example I have 40 (class A) and 60 (class B) for each of them I select 10 number randomly for each class and put set these 20 vector as training set. After that I select the other 80 vectors (also using the index of the main matrix) and classify.

Do you think this is too crazy and there is something simple? There is also a validation of the result that could tell me how the classification is good? I know that this is not a real case because I know the classification result but I just wanna see what happens changing the number of features, number of training elements, and so on.

Thanks All!!!

------------------------------------------------------------------------------

This SF.net email is sponsored by Windows:



Build for Windows Store.



http://p.sf.net/sfu/windows-dev2dev
_______________________________________________

Scikit-learn-general mailing list

Scikit-learn-***@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Roban Kramer
2013-06-21 14:57:58 UTC
Permalink
StratifiedKFold will keep the class distribution the same for you:

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold

There are lots of metrics (score functions, etc.) available:

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
http://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation

See the docs for a particular estimator to find out what the score method
returns (which is generally the score function used in optimizing the
model). For instance

http://jaquesgrobler.github.io/Online-Scikit-Learn-stat-tut/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.score


-Roban


On Fri, Jun 21, 2013 at 10:20 AM, Gianni Iannelli <***@msn.com>
wrote:
>
> Thank You very much for the link!! It does closely what I wanna do!
>
> In my case I have two classes that are for example 0 and 1. I wanna keep
the distribution (in the training set and so also the test set) between
them similar. And I also need that are choosen randomly, I don't care if in
one case I get the same index for training and test set. Well, to select
randomly I think that the sklearn.cross_validation.ShuffleSplit() does what
I want and I will investigate on that. To keep the distribution equally
between the two classes I was thinking to:
>
> split the two classes
> apply for each of them the separation in training and test using the
ShuffleSplit()
> concatenate now the two classes again (they will have the same size
before the split)
> add to one of the two index vector the size of one of the two class
(depends how I will concatenate the two)
> apply my SVM classification
>
>
> What do you think? Do you think is ok?
>
> I have one another question. How score works? What it computes? I
searched around but I found this:
>
> sklearn.metrics.classification_report(y_true, y_pred, labels=None,
target_names=None)
>
> That maybe could give me back a confusion matrix where I could compute
(maybe, I'm just guessing) an Overall Accuracy and a Kappa Coefficient.
>
> Is it correct?
>
> Thank You Very Much!!!
>
> ________________________________
> Date: Fri, 21 Jun 2013 10:59:13 +1000
> From: ***@student.usyd.edu.au
> To: scikit-learn-***@lists.sourceforge.net
> Subject: Re: [Scikit-learn-general] SVM: select the training set randomly
>
>
> Please see
http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html
>
>
> On Fri, Jun 21, 2013 at 10:31 AM, Gianni Iannelli <***@msn.com>
wrote:
>
> Dear All,
>
> I have one question. I have a dataset of 100 vector each with some
features. Of this 100 I already know the classification of all of them.
What I wanna do is select randomly in this 100 a subset to use as training
set and the rest as test set. There is something already implemented in
scikit that do it automatically or I have to use an index method? For index
method I mean to separate the two classes, for example I have 40 (class A)
and 60 (class B) for each of them I select 10 number randomly for each
class and put set these 20 vector as training set. After that I select the
other 80 vectors (also using the index of the main matrix) and classify.
>
> Do you think this is too crazy and there is something simple? There is
also a validation of the result that could tell me how the classification
is good? I know that this is not a real case because I know the
classification result but I just wanna see what happens changing the number
of features, number of training elements, and so on.
>
> Thanks All!!!
>
>
------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>
> http://p.sf.net/sfu/windows-dev2dev
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows: Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
> _______________________________________________ Scikit-learn-general
mailing list Scikit-learn-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>
> http://p.sf.net/sfu/windows-dev2dev
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
Gianni Iannelli
2013-06-21 15:47:40 UTC
Permalink
StratifiedKFold will keep the class distribution the same for you:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold
I was looking at this, it is written:
This cross-validation object is a variation of KFold, which returns stratified folds. The folds are made by preserving the percentage of samples for each class.
But I don't know how he could manage since that I pass to him just the training set and I don't also how to set this percentage for each class. Do I miss something?
I have done one simple test code (see below) where I have my two dataset (class A and class B). I add a cicle for where I select the 20% for each as test and the other 80% as training. I concatenate the train and the test. I scale my training and my test. I found the best C and gamma for my RBF SVM. Train the SVM and use on my test set. The results are in a list with the score values. I think that I'm doing something wrong cause I get my score always 0.5 (in this case I tried always with range(3)).
I think that I will take a look on the metrics that you wrote to me! thanks for that!! Do you think that the StratifiedKFold is better than train_test_split ? Could you see some conceptual mistake on the code below?
#TEST
X_noscaled_A = X_noscaled_A[0:100,:]y_A = y_A[0:100,:]X_noscaled_B = X_noscaled_B[0:100,:]y_B = y_B[0:100,:]
#Define a list for the resultsscores = list()
for i in range(3): #Split keeping the ratio X_train_noscal_A, X_test_noscal_A, y_train_A, y_test_A = train_test_split(X_noscaled_A, y_A, test_size = 0.20) X_train_noscal_B, X_test_noscal_B, y_train_B, y_test_B = train_test_split(X_noscaled_B, y_B, test_size = 0.20)
#Concatenate in order to have just one vector for train and one vector for test X_train_noscal = numpy.concatenate((X_train_noscal_A, X_train_noscal_B)) y_train = numpy.concatenate((y_train_A,y_train_B)) X_test_noscal = numpy.concatenate((X_test_noscal_A,X_test_noscal_B)) y_test = numpy.concatenate((y_test_A,y_test_B))
#Scale the training set scaler = preprocessing.StandardScaler().fit(X_train_noscal) X_train = scaler.transform(X_train_noscal)
#Scale the test set using the values obtained from the test set X_test = scaler.transform(X_test_noscal)
#Optimization of C and gamma C_range = 10.0 ** numpy.arange(-3, 7) gamma_range = 10.0 ** numpy.arange(-5, 3) param_grid = dict(gamma=gamma_range, C=C_range) svr = svm.SVC() clfopt = grid_search.GridSearchCV(svr,param_grid) clfopt.fit(X_train, y_train)
print clfopt.best_estimator_.C print clfopt.best_estimator_.gamma
#Define a SVM using the best parameters C and gamma clf = svm.SVC(gamma = clfopt.best_estimator_.gamma, C = clfopt.best_estimator_.C) clf.fit(X_train, y_train)
#Write the result in the list scores.append(clf.score(X_test,y_test))
#See the resultsprint scores
From: ***@gmail.com
Date: Fri, 21 Jun 2013 10:57:58 -0400
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] SVM: select the training set randomly

StratifiedKFold will keep the class distribution the same for you:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold


There are lots of metrics (score functions, etc.) available:
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics


http://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation



See the docs for a particular estimator to find out what the score method returns (which is generally the score function used in optimizing the model). For instance
http://jaquesgrobler.github.io/Online-Scikit-Learn-stat-tut/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.score




-Roban

On Fri, Jun 21, 2013 at 10:20 AM, Gianni Iannelli <***@msn.com> wrote:


>
> Thank You very much for the link!! It does closely what I wanna do!
>
> In my case I have two classes that are for example 0 and 1. I wanna keep the distribution (in the training set and so also the test set) between them similar. And I also need that are choosen randomly, I don't care if in one case I get the same index for training and test set. Well, to select randomly I think that the sklearn.cross_validation.ShuffleSplit() does what I want and I will investigate on that. To keep the distribution equally between the two classes I was thinking to:


>
> split the two classes
> apply for each of them the separation in training and test using the ShuffleSplit()
> concatenate now the two classes again (they will have the same size before the split)


> add to one of the two index vector the size of one of the two class (depends how I will concatenate the two)
> apply my SVM classification
>
>
> What do you think? Do you think is ok?
>


> I have one another question. How score works? What it computes? I searched around but I found this:
>
> sklearn.metrics.classification_report(y_true, y_pred, labels=None, target_names=None)
>
> That maybe could give me back a confusion matrix where I could compute (maybe, I'm just guessing) an Overall Accuracy and a Kappa Coefficient.


>
> Is it correct?
>
> Thank You Very Much!!!
>
> ________________________________
> Date: Fri, 21 Jun 2013 10:59:13 +1000
> From: ***@student.usyd.edu.au


> To: scikit-learn-***@lists.sourceforge.net
> Subject: Re: [Scikit-learn-general] SVM: select the training set randomly
>
>
> Please see http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html


>
>
> On Fri, Jun 21, 2013 at 10:31 AM, Gianni Iannelli <***@msn.com> wrote:
>
> Dear All,
>
> I have one question. I have a dataset of 100 vector each with some features. Of this 100 I already know the classification of all of them. What I wanna do is select randomly in this 100 a subset to use as training set and the rest as test set. There is something already implemented in scikit that do it automatically or I have to use an index method? For index method I mean to separate the two classes, for example I have 40 (class A) and 60 (class B) for each of them I select 10 number randomly for each class and put set these 20 vector as training set. After that I select the other 80 vectors (also using the index of the main matrix) and classify.


>
> Do you think this is too crazy and there is something simple? There is also a validation of the result that could tell me how the classification is good? I know that this is not a real case because I know the classification result but I just wanna see what happens changing the number of features, number of training elements, and so on.


>
> Thanks All!!!
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>


> http://p.sf.net/sfu/windows-dev2dev
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net


> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev


> _______________________________________________ Scikit-learn-general mailing list Scikit-learn-***@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>
> http://p.sf.net/sfu/windows-dev2dev


> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


>
Roban Kramer
2013-06-21 16:11:00 UTC
Permalink
Oh sorry, I was thinking of balanced sets for cross validation, rather than
a training and testing split. I don't know of a convenience routine
specifically for producing stratified training and testing sets. If both
your classes have decent support and the training and testing set sizes
aren't too small then you should end up with pretty representative samples
anyway. You could check the class balance to make sure they're not too far
off. Arguably a slightly different class balance is reasonable anyway if
you are trying to check out-of-sample performance.

-Roban



On Fri, Jun 21, 2013 at 11:47 AM, Gianni Iannelli <***@msn.com>wrote:

> *StratifiedKFold will keep the class distribution the same for you: *
> *
> *
> *
> http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold
> *<http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold>
>
>
> I was looking at this, it is written:
>
> *This cross-validation object is a variation of KFold, which returns
> stratified folds. The folds are made by preserving the percentage of
> samples for each class.*
> *
> *
>
> But I don't know how he could manage since that I pass to him just the
> training set and I don't also how to set this percentage for each class. Do
> I miss something?
>
> I have done one simple test code (see below) where I have my two dataset
> (class A and class B). I add a cicle for where I select the 20% for each as
> test and the other 80% as training. I concatenate the train and the test. I
> scale my training and my test. I found the best C and gamma for my RBF SVM.
> Train the SVM and use on my test set. The results are in a list with the
> score values. I think that I'm doing something wrong cause I get my score
> always 0.5 (in this case I tried always with range(3)).
>
> I think that I will take a look on the metrics that you wrote to me!
> thanks for that!! Do you think that the StratifiedKFold is better than
> train_test_split ? Could you see some conceptual mistake on the code below?
>
> #TEST
>
>
> X_noscaled_A = X_noscaled_A[0:100,:]
>
> y_A = y_A[0:100,:]
>
> X_noscaled_B = X_noscaled_B[0:100,:]
>
> y_B = y_B[0:100,:]
>
>
> #Define a list for the results
>
> scores = list()
>
>
> for i in range(3):
>
> #Split keeping the ratio
>
> X_train_noscal_A, X_test_noscal_A, y_train_A, y_test_A =
> train_test_split(X_noscaled_A, y_A, test_size = 0.20)
>
> X_train_noscal_B, X_test_noscal_B, y_train_B, y_test_B =
> train_test_split(X_noscaled_B, y_B, test_size = 0.20)
>
>
> #Concatenate in order to have just one vector for train and one vector
> for test
>
> X_train_noscal = numpy.concatenate((X_train_noscal_A,
> X_train_noscal_B))
>
> y_train = numpy.concatenate((y_train_A,y_train_B))
>
> X_test_noscal = numpy.concatenate((X_test_noscal_A,X_test_noscal_B))
>
> y_test = numpy.concatenate((y_test_A,y_test_B))
>
>
> #Scale the training set
>
> scaler = preprocessing.StandardScaler().fit(X_train_noscal)
>
> X_train = scaler.transform(X_train_noscal)
>
>
> #Scale the test set using the values obtained from the test set
>
> X_test = scaler.transform(X_test_noscal)
>
>
> #Optimization of C and gamma
>
> C_range = 10.0 ** numpy.arange(-3, 7)
>
> gamma_range = 10.0 ** numpy.arange(-5, 3)
>
> param_grid = dict(gamma=gamma_range, C=C_range)
>
> svr = svm.SVC()
>
> clfopt = grid_search.GridSearchCV(svr,param_grid)
>
> clfopt.fit(X_train, y_train)
>
>
> print clfopt.best_estimator_.C
>
> print clfopt.best_estimator_.gamma
>
>
> #Define a SVM using the best parameters C and gamma
>
> clf = svm.SVC(gamma = clfopt.best_estimator_.gamma, C =
> clfopt.best_estimator_.C)
>
> clf.fit(X_train, y_train)
>
>
> #Write the result in the list
>
> scores.append(clf.score(X_test,y_test))
>
>
> #See the results
>
> print scores
>
>
> ------------------------------
> From: ***@gmail.com
> Date: Fri, 21 Jun 2013 10:57:58 -0400
>
> To: scikit-learn-***@lists.sourceforge.net
> Subject: Re: [Scikit-learn-general] SVM: select the training set randomly
>
> StratifiedKFold will keep the class distribution the same for you:
>
>
> http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold
>
> There are lots of metrics (score functions, etc.) available:
>
> http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
>
> http://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation
>
> See the docs for a particular estimator to find out what the score method
> returns (which is generally the score function used in optimizing the
> model). For instance
>
>
> http://jaquesgrobler.github.io/Online-Scikit-Learn-stat-tut/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.score
>
>
> -Roban
>
>
> On Fri, Jun 21, 2013 at 10:20 AM, Gianni Iannelli <***@msn.com>
> wrote:
> >
> > Thank You very much for the link!! It does closely what I wanna do!
> >
> > In my case I have two classes that are for example 0 and 1. I wanna keep
> the distribution (in the training set and so also the test set) between
> them similar. And I also need that are choosen randomly, I don't care if in
> one case I get the same index for training and test set. Well, to select
> randomly I think that the sklearn.cross_validation.ShuffleSplit() does what
> I want and I will investigate on that. To keep the distribution equally
> between the two classes I was thinking to:
> >
> > split the two classes
> > apply for each of them the separation in training and test using the
> ShuffleSplit()
> > concatenate now the two classes again (they will have the same size
> before the split)
> > add to one of the two index vector the size of one of the two class
> (depends how I will concatenate the two)
> > apply my SVM classification
> >
> >
> > What do you think? Do you think is ok?
> >
> > I have one another question. How score works? What it computes? I
> searched around but I found this:
> >
> > sklearn.metrics.classification_report(y_true, y_pred, labels=None,
> target_names=None)
> >
> > That maybe could give me back a confusion matrix where I could compute
> (maybe, I'm just guessing) an Overall Accuracy and a Kappa Coefficient.
> >
> > Is it correct?
> >
> > Thank You Very Much!!!
> >
> > ________________________________
> > Date: Fri, 21 Jun 2013 10:59:13 +1000
> > From: ***@student.usyd.edu.au
> > To: scikit-learn-***@lists.sourceforge.net
> > Subject: Re: [Scikit-learn-general] SVM: select the training set randomly
> >
> >
> > Please see
> http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html
> >
> >
> > On Fri, Jun 21, 2013 at 10:31 AM, Gianni Iannelli <
> ***@msn.com> wrote:
> >
> > Dear All,
> >
> > I have one question. I have a dataset of 100 vector each with some
> features. Of this 100 I already know the classification of all of them.
> What I wanna do is select randomly in this 100 a subset to use as training
> set and the rest as test set. There is something already implemented in
> scikit that do it automatically or I have to use an index method? For index
> method I mean to separate the two classes, for example I have 40 (class A)
> and 60 (class B) for each of them I select 10 number randomly for each
> class and put set these 20 vector as training set. After that I select the
> other 80 vectors (also using the index of the main matrix) and classify.
> >
> > Do you think this is too crazy and there is something simple? There is
> also a validation of the result that could tell me how the classification
> is good? I know that this is not a real case because I know the
> classification result but I just wanna see what happens changing the number
> of features, number of training elements, and so on.
> >
> > Thanks All!!!
> >
> >
> ------------------------------------------------------------------------------
> > This SF.net email is sponsored by Windows:
> >
> > Build for Windows Store.
> >
> > http://p.sf.net/sfu/windows-dev2dev
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-***@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows: Build for Windows Store.
> http://p.sf.net/sfu/windows-dev2dev
> > _______________________________________________ Scikit-learn-general
> mailing list Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> ------------------------------------------------------------------------------
> > This SF.net email is sponsored by Windows:
> >
> > Build for Windows Store.
> >
> > http://p.sf.net/sfu/windows-dev2dev
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-***@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows: Build for Windows Store.
> http://p.sf.net/sfu/windows-dev2dev
> _______________________________________________ Scikit-learn-general
> mailing list Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>
> http://p.sf.net/sfu/windows-dev2dev
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
Gianni Iannelli
2013-06-21 16:23:49 UTC
Permalink
Gianni Iannelli
2013-06-21 20:47:17 UTC
Permalink
r***@gmail.com
1970-01-01 00:00:00 UTC
Permalink
--_aedb7bf4-5018-442a-99e8-19763712f8a9_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Ah ok! Yeah, I was thinking that having in my dataset 50/50 (also 40/60) of my dataset for the two classes will be not a problem but since that the ratio is 1/3 I would prefere to have the same distribution for the two, then my choose to use the train_test_split method. I don't know if there are something better but this seems to work :) !!
Now I was trying to understand how to get the confusion matrix but I came out with a problem. Below the code (that comes from the code posted in the previous message) and his relative error:
... #Define a SVM using the best parameters C and gamma clf = svm.SVC(gamma = clfopt.best_estimator_.gamma, C = clfopt.best_estimator_.C) clf.fit(X_train, y_train)
result = clf.predict(X_test) metrics.confusion_matrix(y_test,result) ...
Traceback (most recent call last): File "<pyshell#101>", line 1, in <module> metrics.confusion_matrix(y_test,result) File "C:\Python27_32\lib\site-packages\sklearn\metrics\metrics.py", line 610, in confusion_matrix y_true = np.array([label_to_ind.get(x, n_labels + 1) for x in y_true])TypeError: unhashable type: 'numpy.ndarray'
Thanks for your Precious Support!
From: ***@gmail.com
Date: Fri, 21 Jun 2013 12:11:00 -0400
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] SVM: select the training set randomly

Oh sorry, I was thinking of balanced sets for cross validation, rather than a training and testing split. I don't know of a convenience routine specifically for producing stratified training and testing sets. If both your classes have decent support and the training and testing set sizes aren't too small then you should end up with pretty representative samples anyway. You could check the class balance to make sure they're not too far off. Arguably a slightly different class balance is reasonable anyway if you are trying to check out-of-sample performance.


-Roban


On Fri, Jun 21, 2013 at 11:47 AM, Gianni Iannelli <***@msn.com> wrote:





StratifiedKFold will keep the class distribution the same for you:


http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold


I was looking at this, it is written:
This cross-validation object is a variation of KFold, which returns stratified folds. The folds are made by preserving the percentage of samples for each class.


But I don't know how he could manage since that I pass to him just the training set and I don't also how to set this percentage for each class. Do I miss something?


I have done one simple test code (see below) where I have my two dataset (class A and class B). I add a cicle for where I select the 20% for each as test and the other 80% as training. I concatenate the train and the test. I scale my training and my test. I found the best C and gamma for my RBF SVM. Train the SVM and use on my test set. The results are in a list with the score values. I think that I'm doing something wrong cause I get my score always 0.5 (in this case I tried always with range(3)).


I think that I will take a look on the metrics that you wrote to me! thanks for that!! Do you think that the StratifiedKFold is better than train_test_split ? Could you see some conceptual mistake on the code below?


#TEST


X_noscaled_A = X_noscaled_A[0:100,:]

y_A = y_A[0:100,:]X_noscaled_B = X_noscaled_B[0:100,:]

y_B = y_B[0:100,:]


#Define a list for the resultsscores = list()


for i in range(3):

#Split keeping the ratio X_train_noscal_A, X_test_noscal_A, y_train_A, y_test_A = train_test_split(X_noscaled_A, y_A, test_size = 0.20)

X_train_noscal_B, X_test_noscal_B, y_train_B, y_test_B = train_test_split(X_noscaled_B, y_B, test_size = 0.20)


#Concatenate in order to have just one vector for train and one vector for test

X_train_noscal = numpy.concatenate((X_train_noscal_A, X_train_noscal_B))

y_train = numpy.concatenate((y_train_A,y_train_B)) X_test_noscal = numpy.concatenate((X_test_noscal_A,X_test_noscal_B))

y_test = numpy.concatenate((y_test_A,y_test_B))


#Scale the training set

scaler = preprocessing.StandardScaler().fit(X_train_noscal) X_train = scaler.transform(X_train_noscal)


#Scale the test set using the values obtained from the test set

X_test = scaler.transform(X_test_noscal)


#Optimization of C and gamma

C_range = 10.0 ** numpy.arange(-3, 7) gamma_range = 10.0 ** numpy.arange(-5, 3)

param_grid = dict(gamma=gamma_range, C=C_range) svr = svm.SVC()

clfopt = grid_search.GridSearchCV(svr,param_grid) clfopt.fit(X_train, y_train)


print clfopt.best_estimator_.C

print clfopt.best_estimator_.gamma


#Define a SVM using the best parameters C and gamma clf = svm.SVC(gamma = clfopt.best_estimator_.gamma, C = clfopt.best_estimator_.C)

clf.fit(X_train, y_train)


#Write the result in the list

scores.append(clf.score(X_test,y_test))


#See the resultsprint scores
From: ***@gmail.com


Date: Fri, 21 Jun 2013 10:57:58 -0400
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] SVM: select the training set randomly



StratifiedKFold will keep the class distribution the same for you:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold




There are lots of metrics (score functions, etc.) available:
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics




http://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation





See the docs for a particular estimator to find out what the score method returns (which is generally the score function used in optimizing the model). For instance
http://jaquesgrobler.github.io/Online-Scikit-Learn-stat-tut/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.score






-Roban

On Fri, Jun 21, 2013 at 10:20 AM, Gianni Iannelli <***@msn.com> wrote:




>
> Thank You very much for the link!! It does closely what I wanna do!
>
> In my case I have two classes that are for example 0 and 1. I wanna keep the distribution (in the training set and so also the test set) between them similar. And I also need that are choosen randomly, I don't care if in one case I get the same index for training and test set. Well, to select randomly I think that the sklearn.cross_validation.ShuffleSplit() does what I want and I will investigate on that. To keep the distribution equally between the two classes I was thinking to:




>
> split the two classes
> apply for each of them the separation in training and test using the ShuffleSplit()
> concatenate now the two classes again (they will have the same size before the split)




> add to one of the two index vector the size of one of the two class (depends how I will concatenate the two)
> apply my SVM classification
>
>
> What do you think? Do you think is ok?
>




> I have one another question. How score works? What it computes? I searched around but I found this:
>
> sklearn.metrics.classification_report(y_true, y_pred, labels=None, target_names=None)
>
> That maybe could give me back a confusion matrix where I could compute (maybe, I'm just guessing) an Overall Accuracy and a Kappa Coefficient.




>
> Is it correct?
>
> Thank You Very Much!!!
>
> ________________________________
> Date: Fri, 21 Jun 2013 10:59:13 +1000
> From: ***@student.usyd.edu.au




> To: scikit-learn-***@lists.sourceforge.net
> Subject: Re: [Scikit-learn-general] SVM: select the training set randomly
>


>
> Please see http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html




>
>
> On Fri, Jun 21, 2013 at 10:31 AM, Gianni Iannelli <***@msn.com> wrote:
>
> Dear All,
>
> I have one question. I have a dataset of 100 vector each with some features. Of this 100 I already know the classification of all of them. What I wanna do is select randomly in this 100 a subset to use as training set and the rest as test set. There is something already implemented in scikit that do it automatically or I have to use an index method? For index method I mean to separate the two classes, for example I have 40 (class A) and 60 (class B) for each of them I select 10 number randomly for each class and put set these 20 vector as training set. After that I select the other 80 vectors (also using the index of the main matrix) and classify.




>
> Do you think this is too crazy and there is something simple? There is also a validation of the result that could tell me how the classification is good? I know that this is not a real case because I know the classification result but I just wanna see what happens changing the number of features, number of training elements, and so on.




>
> Thanks All!!!
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>




> http://p.sf.net/sfu/windows-dev2dev
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net




> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev




> _______________________________________________ Scikit-learn-general mailing list Scikit-learn-***@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>
> http://p.sf.net/sfu/windows-dev2dev




> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net


> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


>


------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

This SF.net email is sponsored by Windows:



Build for Windows Store.



http://p.sf.net/sfu/windows-dev2dev
_______________________________________________

Scikit-learn-general mailing list

Scikit-learn-***@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general





------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--_aedb7bf4-5018-442a-99e8-19763712f8a9_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<style><!--
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 12pt;
font-family:Calibri
}
--></style></head>
<body class='hmmessage'><div dir='ltr'>Ah ok! Yeah, I was thinking that having in my dataset 50/50 (also 40/60) of my dataset for the two classes will be not a problem but since that the ratio is 1/3 I would prefere to have the same distribution for the two, then my choose to use the train_test_split method. I don't know if there are something better but this seems to work :) !!<div><br></div><div>Now I was trying to understand how to get the confusion matrix but I came out with a problem. Below the code (that comes from the code posted in the previous message) and his relative error:</div><div><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><div><br></div><div>&nbsp; &nbsp; ...</div><div><span style="font-size: 12pt;">&nbsp; &nbsp; #Define a SVM using the best parameters C and gamma</span></div><div>&nbsp; &nbsp; clf = svm.SVC(gamma = clfopt.best_estimator_.gamma, C = clfopt.best_estimator_.C)</div><div>&nbsp; &nbsp; clf.fit(X_train, y_train)</div><div><br></div><div>&nbsp; &nbsp; result = clf.predict(X_test)</div><div>&nbsp; &nbsp;&nbsp;</div><div><span style="font-size: 12pt;">&nbsp; &nbsp; metrics.confusion_matrix(y_test,result)</span></div><div>&nbsp; &nbsp; ...</div><div><br></div><div>Traceback (most recent call last):</div><div>&nbsp; File "&lt;pyshell#101&gt;", line 1, in &lt;module&gt;</div><div>&nbsp; &nbsp; metrics.confusion_matrix(y_test,result)</div><div>&nbsp; File "C:\Python27_32\lib\site-packages\sklearn\metrics\metrics.py", line 610, in confusion_matrix</div><div>&nbsp; &nbsp; y_true = np.array([label_to_ind.get(x, n_labels + 1) for x in y_true])</div><div>TypeError: unhashable type: 'numpy.ndarray'</div></blockquote><div><br></div><div>Thanks for your Precious Support!</div><br><div><hr id="stopSpelling">From: ***@gmail.com<br>Date: Fri, 21 Jun 2013 12:11:00 -0400<br>To: scikit-learn-***@lists.sourceforge.net<br>Subject: Re: [Scikit-learn-general] SVM: select the training set randomly<br><br><div dir="ltr">Oh sorry, I was thinking of balanced sets for cross validation, rather than a training and testing split. I don't know of a convenience routine specifically for producing stratified training and testing sets. If both your classes have decent support and the training and testing set sizes aren't too small then you should end up with pretty representative samples anyway. You could check the class balance to make sure they're not too far off. Arguably a slightly different class balance is reasonable anyway if you are trying to check out-of-sample performance.<div>

<br></div><div>-Roban</div><div><br></div></div><div class="ecxgmail_extra"><br><br><div class="ecxgmail_quote">On Fri, Jun 21, 2013 at 11:47 AM, Gianni Iannelli <span dir="ltr">&lt;<a href="mailto:***@msn.com" target="_blank">***@msn.com</a>&gt;</span> wrote:<br>

<blockquote class="ecxgmail_quote" style="border-left:1px #ccc solid;padding-left:1ex;">


<div><div dir="ltr"><div class="ecxim"><blockquote style="border:none;padding:0px;"><div><i>StratifiedKFold will keep the class distribution the same for you:&nbsp;</i></div><div><div><i><br></i></div></div><div>

<div><a href="http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold" target="_blank"><i>http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold</i></a></div>

</div></blockquote><div><br></div></div><div>I was looking at this, it is written:</div><div><br></div><div><blockquote style="border:none;padding:0px;"><div><span style="color:rgb(62,67,73);font-family:Helvetica,Arial,sans-serif;font-size:14px;line-height:21.59375px;"><i>This cross-validation object is a variation of KFold, which returns stratified folds. The folds are made by preserving the percentage of samples for each class.</i></span></div>

<div><span style="color:rgb(62,67,73);font-family:Helvetica,Arial,sans-serif;font-size:14px;line-height:21.59375px;"><i><br></i></span></div></blockquote>&nbsp;But I don't know how he could manage since that I pass to him just the training set and I don't also how to set this percentage for each class. Do I miss something?</div>

<div><br></div><div>I have done one simple test code (see below) where I have my two dataset (class A and class B). I add a cicle for where I select the 20% for each as test and the other 80% as training. I concatenate the train and the test. I scale my training and my test. I found the best C and gamma for my RBF SVM. Train the SVM and use on my test set. The results are in a list with the score values. I think that I'm doing something wrong cause I get my score always 0.5 (in this case I tried always with range(3)).</div>

<div><br></div><div>I think that I will take a look on the metrics that you wrote to me! thanks for that!! Do you think that the StratifiedKFold is better than train_test_split ? Could you see some conceptual mistake on the code below?</div>

<div><br></div><blockquote style="border:none;padding:0px;"><blockquote style="border:none;padding-right:0px;padding-left:0px;">#TEST</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

<br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">X_noscaled_A = X_noscaled_A[0:100,:]</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

y_A = y_A[0:100,:]</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">X_noscaled_B = X_noscaled_B[0:100,:]</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

y_B = y_B[0:100,:]</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;"><br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

#Define a list for the results</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">scores = list()</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

<br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">for i in range(3):</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; #Split keeping the ratio</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; X_train_noscal_A, X_test_noscal_A, y_train_A, y_test_A &nbsp;= train_test_split(X_noscaled_A, y_A, test_size = 0.20)</blockquote>

<blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; X_train_noscal_B, X_test_noscal_B, y_train_B, y_test_B &nbsp;= train_test_split(X_noscaled_B, y_B, test_size = 0.20)</blockquote>

<blockquote style="border:none;padding-right:0px;padding-left:0px;"><br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; #Concatenate in order to have just one vector for train and one vector for test</blockquote>

<blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; X_train_noscal = numpy.concatenate((X_train_noscal_A, X_train_noscal_B))</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; y_train = numpy.concatenate((y_train_A,y_train_B))</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; X_test_noscal = numpy.concatenate((X_test_noscal_A,X_test_noscal_B))</blockquote>

<blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; y_test = numpy.concatenate((y_test_A,y_test_B))</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

<br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; #Scale the training set</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; scaler = preprocessing.StandardScaler().fit(X_train_noscal)</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; X_train = scaler.transform(X_train_noscal)</blockquote>

<blockquote style="border:none;padding-right:0px;padding-left:0px;"><br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; #Scale the test set using the values obtained from the test set</blockquote>

<blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; X_test = scaler.transform(X_test_noscal)</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

<br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; #Optimization of C and gamma</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; C_range = 10.0 ** numpy.arange(-3, 7)</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; gamma_range = 10.0 ** numpy.arange(-5, 3)</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; param_grid = dict(gamma=gamma_range, C=C_range)</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; svr = svm.SVC()</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; clfopt = grid_search.GridSearchCV(svr,param_grid)</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; clfopt.fit(X_train, y_train)</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

<br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; print clfopt.best_estimator_.C</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; print clfopt.best_estimator_.gamma</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;"><br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; #Define a SVM using the best parameters C and gamma</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; clf = svm.SVC(gamma = clfopt.best_estimator_.gamma, C = clfopt.best_estimator_.C)</blockquote>

<blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; clf.fit(X_train, y_train)</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

<br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; #Write the result in the list</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; scores.append(clf.score(X_test,y_test))</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;"><br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

#See the results</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">print scores</blockquote></blockquote><br><div><hr>From: <a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a><br>

Date: Fri, 21 Jun 2013 10:57:58 -0400<div><div class="h5"><br>To: <a href="mailto:scikit-learn-***@lists.sourceforge.net" target="_blank">scikit-learn-***@lists.sourceforge.net</a><br>Subject: Re: [Scikit-learn-general] SVM: select the training set randomly<br>

<br><div dir="ltr">StratifiedKFold will keep the class distribution the same for you:&nbsp;<div><br></div><div><a href="http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold" target="_blank">http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold</a></div>



<div><br></div><div>There are lots of metrics (score functions, etc.) available:</div><div><br></div><div><a href="http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics" target="_blank">http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics</a><br>



</div><div><a href="http://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation" target="_blank">http://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation</a><br></div><div><br></div>

<div>

See the docs for a particular estimator to find out what the score method returns (which is generally the score function used in optimizing the model). For instance&nbsp;</div><div><br></div><div><a href="http://jaquesgrobler.github.io/Online-Scikit-Learn-stat-tut/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.score" target="_blank">http://jaquesgrobler.github.io/Online-Scikit-Learn-stat-tut/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.score</a><br>



</div><div><br></div><div><br></div><div>-Roban</div><div><br></div><div><br>On Fri, Jun 21, 2013 at 10:20 AM, Gianni Iannelli &lt;<a href="mailto:***@msn.com" target="_blank">***@msn.com</a>&gt; wrote:<br>



&gt;<br>&gt; Thank You very much for the link!! It does closely what I wanna do!<br>&gt;<br>&gt; In my case I have two classes that are for example 0 and 1. I wanna keep the distribution (in the training set and so also the test set) between them similar. And I also need that are choosen randomly, I don't care if in one case I get the same index for training and test set. Well, to select randomly I think that the sklearn.cross_validation.ShuffleSplit() does what I want and I will investigate on that. To keep the distribution equally between the two classes I was thinking to:<br>



&gt;<br>&gt; split the two classes<br>&gt; apply for each of them the separation in training and test using the ShuffleSplit()<br>&gt; concatenate now the two classes again (they will have the same size before the split)<br>



&gt; add to one of the two index vector the size of one of the two class (depends how I will concatenate the two)<br>&gt; apply my SVM classification<br>&gt;<br>&gt;<br>&gt; What do you think? Do you think is ok? <br>&gt;<br>



&gt; I have one another question. How score works? What it computes? I searched around but I found this:<br>&gt;<br>&gt; sklearn.metrics.classification_report(y_true, y_pred, labels=None, target_names=None)<br>&gt;<br>&gt; That maybe could give me back a confusion matrix where I could compute (maybe, I'm just guessing) an Overall Accuracy and a Kappa Coefficient. <br>



&gt;<br>&gt; Is it correct?<br>&gt;<br>&gt; Thank You Very Much!!!<br>&gt;<br>&gt; ________________________________<br>&gt; Date: Fri, 21 Jun 2013 10:59:13 +1000<br>&gt; From: <a href="mailto:***@student.usyd.edu.au" target="_blank">***@student.usyd.edu.au</a><br>



&gt; To: <a href="mailto:scikit-learn-***@lists.sourceforge.net" target="_blank">scikit-learn-***@lists.sourceforge.net</a><br>&gt; Subject: Re: [Scikit-learn-general] SVM: select the training set randomly<br>&gt;<br>

&gt;<br>&gt; Please see <a href="http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html" target="_blank">http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html</a><br>



&gt;<br>&gt;<br>&gt; On Fri, Jun 21, 2013 at 10:31 AM, Gianni Iannelli &lt;<a href="mailto:***@msn.com" target="_blank">***@msn.com</a>&gt; wrote:<br>&gt;<br>&gt; Dear All,<br>&gt;<br>&gt; I have one question. I have a dataset of 100 vector each with some features. Of this 100 I already know the classification of all of them. What I wanna do is select randomly in this 100 a subset to use as training set and the rest as test set. There is something already implemented in scikit that do it automatically or I have to use an index method? For index method I mean to separate the two classes, for example I have 40 (class A) and 60 (class B) for each of them I select 10 number randomly for each class and put set these 20 vector as training set. After that I select the other 80 vectors (also using the index of the main matrix) and classify. <br>



&gt;<br>&gt; Do you think this is too crazy and there is something simple? There is also a validation of the result that could tell me how the classification is good? I know that this is not a real case because I know the classification result but I just wanna see what happens changing the number of features, number of training elements, and so on. <br>



&gt;<br>&gt; Thanks All!!!<br>&gt;<br>&gt; ------------------------------------------------------------------------------<br>&gt; This SF.net email is sponsored by Windows:<br>&gt;<br>&gt; Build for Windows Store.<br>&gt;<br>



&gt; <a href="http://p.sf.net/sfu/windows-dev2dev" target="_blank">http://p.sf.net/sfu/windows-dev2dev</a><br>&gt; _______________________________________________<br>&gt; Scikit-learn-general mailing list<br>&gt; <a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>



&gt; <a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>&gt;<br>&gt;<br>&gt;<br>&gt; ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. <a href="http://p.sf.net/sfu/windows-dev2dev" target="_blank">http://p.sf.net/sfu/windows-dev2dev</a><br>



&gt; _______________________________________________ Scikit-learn-general mailing list <a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a> <a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>



&gt;<br>&gt; ------------------------------------------------------------------------------<br>&gt; This SF.net email is sponsored by Windows:<br>&gt;<br>&gt; Build for Windows Store.<br>&gt;<br>&gt; <a href="http://p.sf.net/sfu/windows-dev2dev" target="_blank">http://p.sf.net/sfu/windows-dev2dev</a><br>



&gt; _______________________________________________<br>&gt; Scikit-learn-general mailing list<br>&gt; <a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>

&gt; <a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>

&gt;<br></div></div>
<br>------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

<a href="http://p.sf.net/sfu/windows-dev2dev" target="_blank">http://p.sf.net/sfu/windows-dev2dev</a><br>_______________________________________________
Scikit-learn-general mailing list
<a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a></div></div></div> </div></div>
<br>------------------------------------------------------------------------------<br>
This SF.net email is sponsored by Windows:<br>
<br>
Build for Windows Store.<br>
<br>
<a href="http://p.sf.net/sfu/windows-dev2dev" target="_blank">http://p.sf.net/sfu/windows-dev2dev</a><br>_______________________________________________<br>
Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br></blockquote></div><br></div>
<br>------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev<br>_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</div></div> </div></body>
</html>
--_aedb7bf4-5018-442a-99e8-19763712f8a9_--
g***@msn.com
1970-01-01 00:00:00 UTC
Permalink
--_f675bcec-9ea1-45b8-b27c-f89eff55463f_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Found the error...I post below. The problem is that metrics.confusion_matrix accept lists and not numpy.array. So I converted everything in list:
#Compute the confusion matrix y_testlist_tmp = y_test.transpose().tolist() y_testlist = y_testlist_tmp[0] resultlist = result.tolist() cfmat = metrics.confusion_matrix(y_testlist,resultlist)
Thanks All!! If you have any suggestion I'm happy to listen!
From: ***@msn.com
To: scikit-learn-***@lists.sourceforge.net
Date: Fri, 21 Jun 2013 18:23:49 +0200
Subject: Re: [Scikit-learn-general] SVM: select the training set randomly




Ah ok! Yeah, I was thinking that having in my dataset 50/50 (also 40/60) of my dataset for the two classes will be not a problem but since that the ratio is 1/3 I would prefere to have the same distribution for the two, then my choose to use the train_test_split method. I don't know if there are something better but this seems to work :) !!
Now I was trying to understand how to get the confusion matrix but I came out with a problem. Below the code (that comes from the code posted in the previous message) and his relative error:
... #Define a SVM using the best parameters C and gamma clf = svm.SVC(gamma = clfopt.best_estimator_.gamma, C = clfopt.best_estimator_.C) clf.fit(X_train, y_train)
result = clf.predict(X_test) metrics.confusion_matrix(y_test,result) ...
Traceback (most recent call last): File "<pyshell#101>", line 1, in <module> metrics.confusion_matrix(y_test,result) File "C:\Python27_32\lib\site-packages\sklearn\metrics\metrics.py", line 610, in confusion_matrix y_true = np.array([label_to_ind.get(x, n_labels + 1) for x in y_true])TypeError: unhashable type: 'numpy.ndarray'
Thanks for your Precious Support!
From: ***@gmail.com
Date: Fri, 21 Jun 2013 12:11:00 -0400
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] SVM: select the training set randomly

Oh sorry, I was thinking of balanced sets for cross validation, rather than a training and testing split. I don't know of a convenience routine specifically for producing stratified training and testing sets. If both your classes have decent support and the training and testing set sizes aren't too small then you should end up with pretty representative samples anyway. You could check the class balance to make sure they're not too far off. Arguably a slightly different class balance is reasonable anyway if you are trying to check out-of-sample performance.


-Roban


On Fri, Jun 21, 2013 at 11:47 AM, Gianni Iannelli <***@msn.com> wrote:





StratifiedKFold will keep the class distribution the same for you:


http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold


I was looking at this, it is written:
This cross-validation object is a variation of KFold, which returns stratified folds. The folds are made by preserving the percentage of samples for each class.


But I don't know how he could manage since that I pass to him just the training set and I don't also how to set this percentage for each class. Do I miss something?


I have done one simple test code (see below) where I have my two dataset (class A and class B). I add a cicle for where I select the 20% for each as test and the other 80% as training. I concatenate the train and the test. I scale my training and my test. I found the best C and gamma for my RBF SVM. Train the SVM and use on my test set. The results are in a list with the score values. I think that I'm doing something wrong cause I get my score always 0.5 (in this case I tried always with range(3)).


I think that I will take a look on the metrics that you wrote to me! thanks for that!! Do you think that the StratifiedKFold is better than train_test_split ? Could you see some conceptual mistake on the code below?


#TEST


X_noscaled_A = X_noscaled_A[0:100,:]

y_A = y_A[0:100,:]X_noscaled_B = X_noscaled_B[0:100,:]

y_B = y_B[0:100,:]


#Define a list for the resultsscores = list()


for i in range(3):

#Split keeping the ratio X_train_noscal_A, X_test_noscal_A, y_train_A, y_test_A = train_test_split(X_noscaled_A, y_A, test_size = 0.20)

X_train_noscal_B, X_test_noscal_B, y_train_B, y_test_B = train_test_split(X_noscaled_B, y_B, test_size = 0.20)


#Concatenate in order to have just one vector for train and one vector for test

X_train_noscal = numpy.concatenate((X_train_noscal_A, X_train_noscal_B))

y_train = numpy.concatenate((y_train_A,y_train_B)) X_test_noscal = numpy.concatenate((X_test_noscal_A,X_test_noscal_B))

y_test = numpy.concatenate((y_test_A,y_test_B))


#Scale the training set

scaler = preprocessing.StandardScaler().fit(X_train_noscal) X_train = scaler.transform(X_train_noscal)


#Scale the test set using the values obtained from the test set

X_test = scaler.transform(X_test_noscal)


#Optimization of C and gamma

C_range = 10.0 ** numpy.arange(-3, 7) gamma_range = 10.0 ** numpy.arange(-5, 3)

param_grid = dict(gamma=gamma_range, C=C_range) svr = svm.SVC()

clfopt = grid_search.GridSearchCV(svr,param_grid) clfopt.fit(X_train, y_train)


print clfopt.best_estimator_.C

print clfopt.best_estimator_.gamma


#Define a SVM using the best parameters C and gamma clf = svm.SVC(gamma = clfopt.best_estimator_.gamma, C = clfopt.best_estimator_.C)

clf.fit(X_train, y_train)


#Write the result in the list

scores.append(clf.score(X_test,y_test))


#See the resultsprint scores
From: ***@gmail.com


Date: Fri, 21 Jun 2013 10:57:58 -0400
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] SVM: select the training set randomly



StratifiedKFold will keep the class distribution the same for you:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold




There are lots of metrics (score functions, etc.) available:
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics




http://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation





See the docs for a particular estimator to find out what the score method returns (which is generally the score function used in optimizing the model). For instance
http://jaquesgrobler.github.io/Online-Scikit-Learn-stat-tut/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.score






-Roban

On Fri, Jun 21, 2013 at 10:20 AM, Gianni Iannelli <***@msn.com> wrote:




>
> Thank You very much for the link!! It does closely what I wanna do!
>
> In my case I have two classes that are for example 0 and 1. I wanna keep the distribution (in the training set and so also the test set) between them similar. And I also need that are choosen randomly, I don't care if in one case I get the same index for training and test set. Well, to select randomly I think that the sklearn.cross_validation.ShuffleSplit() does what I want and I will investigate on that. To keep the distribution equally between the two classes I was thinking to:




>
> split the two classes
> apply for each of them the separation in training and test using the ShuffleSplit()
> concatenate now the two classes again (they will have the same size before the split)




> add to one of the two index vector the size of one of the two class (depends how I will concatenate the two)
> apply my SVM classification
>
>
> What do you think? Do you think is ok?
>




> I have one another question. How score works? What it computes? I searched around but I found this:
>
> sklearn.metrics.classification_report(y_true, y_pred, labels=None, target_names=None)
>
> That maybe could give me back a confusion matrix where I could compute (maybe, I'm just guessing) an Overall Accuracy and a Kappa Coefficient.




>
> Is it correct?
>
> Thank You Very Much!!!
>
> ________________________________
> Date: Fri, 21 Jun 2013 10:59:13 +1000
> From: ***@student.usyd.edu.au




> To: scikit-learn-***@lists.sourceforge.net
> Subject: Re: [Scikit-learn-general] SVM: select the training set randomly
>


>
> Please see http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html




>
>
> On Fri, Jun 21, 2013 at 10:31 AM, Gianni Iannelli <***@msn.com> wrote:
>
> Dear All,
>
> I have one question. I have a dataset of 100 vector each with some features. Of this 100 I already know the classification of all of them. What I wanna do is select randomly in this 100 a subset to use as training set and the rest as test set. There is something already implemented in scikit that do it automatically or I have to use an index method? For index method I mean to separate the two classes, for example I have 40 (class A) and 60 (class B) for each of them I select 10 number randomly for each class and put set these 20 vector as training set. After that I select the other 80 vectors (also using the index of the main matrix) and classify.




>
> Do you think this is too crazy and there is something simple? There is also a validation of the result that could tell me how the classification is good? I know that this is not a real case because I know the classification result but I just wanna see what happens changing the number of features, number of training elements, and so on.




>
> Thanks All!!!
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>




> http://p.sf.net/sfu/windows-dev2dev
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net




> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev




> _______________________________________________ Scikit-learn-general mailing list Scikit-learn-***@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>
> http://p.sf.net/sfu/windows-dev2dev




> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net


> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


>


------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

This SF.net email is sponsored by Windows:



Build for Windows Store.



http://p.sf.net/sfu/windows-dev2dev
_______________________________________________

Scikit-learn-general mailing list

Scikit-learn-***@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general





------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--_f675bcec-9ea1-45b8-b27c-f89eff55463f_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<style><!--
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 12pt;
font-family:Calibri
}
--></style></head>
<body class='hmmessage'><div dir='ltr'>Found the error...I post below. The problem is that metrics.confusion_matrix accept lists and not numpy.array. So I converted everything in list:<div><br></div><div><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><div><div>&nbsp; &nbsp; #Compute the confusion matrix</div><div>&nbsp; &nbsp; y_testlist_tmp = y_test.transpose().tolist()</div><div>&nbsp; &nbsp; y_testlist = y_testlist_tmp[0]</div><div>&nbsp; &nbsp; resultlist = result.tolist()</div><div>&nbsp; &nbsp; cfmat = metrics.confusion_matrix(y_testlist,resultlist)</div></div></blockquote><div><br></div>Thanks All!! If you have any suggestion I'm happy to listen!<br><div><hr id="stopSpelling">From: ***@msn.com<br>To: scikit-learn-***@lists.sourceforge.net<br>Date: Fri, 21 Jun 2013 18:23:49 +0200<br>Subject: Re: [Scikit-learn-general] SVM: select the training set randomly<br><br>

<style><!--
.ExternalClass .ecxhmmessage P {
padding:0px;
}

.ExternalClass body.ecxhmmessage {
font-size:12pt;
font-family:Calibri;
}

--></style>
<div dir="ltr">Ah ok! Yeah, I was thinking that having in my dataset 50/50 (also 40/60) of my dataset for the two classes will be not a problem but since that the ratio is 1/3 I would prefere to have the same distribution for the two, then my choose to use the train_test_split method. I don't know if there are something better but this seems to work :) !!<div><br></div><div>Now I was trying to understand how to get the confusion matrix but I came out with a problem. Below the code (that comes from the code posted in the previous message) and his relative error:</div><div><blockquote style="border:none;padding:0px;"><div><br></div><div>&nbsp; &nbsp; ...</div><div><span style="font-size:12pt;">&nbsp; &nbsp; #Define a SVM using the best parameters C and gamma</span></div><div>&nbsp; &nbsp; clf = svm.SVC(gamma = clfopt.best_estimator_.gamma, C = clfopt.best_estimator_.C)</div><div>&nbsp; &nbsp; clf.fit(X_train, y_train)</div><div><br></div><div>&nbsp; &nbsp; result = clf.predict(X_test)</div><div>&nbsp; &nbsp;&nbsp;</div><div><span style="font-size:12pt;">&nbsp; &nbsp; metrics.confusion_matrix(y_test,result)</span></div><div>&nbsp; &nbsp; ...</div><div><br></div><div>Traceback (most recent call last):</div><div>&nbsp; File "&lt;pyshell#101&gt;", line 1, in &lt;module&gt;</div><div>&nbsp; &nbsp; metrics.confusion_matrix(y_test,result)</div><div>&nbsp; File "C:\Python27_32\lib\site-packages\sklearn\metrics\metrics.py", line 610, in confusion_matrix</div><div>&nbsp; &nbsp; y_true = np.array([label_to_ind.get(x, n_labels + 1) for x in y_true])</div><div>TypeError: unhashable type: 'numpy.ndarray'</div></blockquote><div><br></div><div>Thanks for your Precious Support!</div><br><div><hr id="ecxstopSpelling">From: ***@gmail.com<br>Date: Fri, 21 Jun 2013 12:11:00 -0400<br>To: scikit-learn-***@lists.sourceforge.net<br>Subject: Re: [Scikit-learn-general] SVM: select the training set randomly<br><br><div dir="ltr">Oh sorry, I was thinking of balanced sets for cross validation, rather than a training and testing split. I don't know of a convenience routine specifically for producing stratified training and testing sets. If both your classes have decent support and the training and testing set sizes aren't too small then you should end up with pretty representative samples anyway. You could check the class balance to make sure they're not too far off. Arguably a slightly different class balance is reasonable anyway if you are trying to check out-of-sample performance.<div>

<br></div><div>-Roban</div><div><br></div></div><div class="ecxgmail_extra"><br><br><div class="ecxgmail_quote">On Fri, Jun 21, 2013 at 11:47 AM, Gianni Iannelli <span dir="ltr">&lt;<a href="mailto:***@msn.com" target="_blank">***@msn.com</a>&gt;</span> wrote:<br>

<blockquote class="ecxgmail_quote" style="border-left:1px #ccc solid;padding-left:1ex;">


<div><div dir="ltr"><div class="ecxim"><blockquote style="border:none;padding:0px;"><div><i>StratifiedKFold will keep the class distribution the same for you:&nbsp;</i></div><div><div><i><br></i></div></div><div>

<div><a href="http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold" target="_blank"><i>http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold</i></a></div>

</div></blockquote><div><br></div></div><div>I was looking at this, it is written:</div><div><br></div><div><blockquote style="border:none;padding:0px;"><div><span style="color:rgb(62,67,73);font-family:Helvetica,Arial,sans-serif;font-size:14px;line-height:21.59375px;"><i>This cross-validation object is a variation of KFold, which returns stratified folds. The folds are made by preserving the percentage of samples for each class.</i></span></div>

<div><span style="color:rgb(62,67,73);font-family:Helvetica,Arial,sans-serif;font-size:14px;line-height:21.59375px;"><i><br></i></span></div></blockquote>&nbsp;But I don't know how he could manage since that I pass to him just the training set and I don't also how to set this percentage for each class. Do I miss something?</div>

<div><br></div><div>I have done one simple test code (see below) where I have my two dataset (class A and class B). I add a cicle for where I select the 20% for each as test and the other 80% as training. I concatenate the train and the test. I scale my training and my test. I found the best C and gamma for my RBF SVM. Train the SVM and use on my test set. The results are in a list with the score values. I think that I'm doing something wrong cause I get my score always 0.5 (in this case I tried always with range(3)).</div>

<div><br></div><div>I think that I will take a look on the metrics that you wrote to me! thanks for that!! Do you think that the StratifiedKFold is better than train_test_split ? Could you see some conceptual mistake on the code below?</div>

<div><br></div><blockquote style="border:none;padding:0px;"><blockquote style="border:none;padding-right:0px;padding-left:0px;">#TEST</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

<br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">X_noscaled_A = X_noscaled_A[0:100,:]</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

y_A = y_A[0:100,:]</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">X_noscaled_B = X_noscaled_B[0:100,:]</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

y_B = y_B[0:100,:]</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;"><br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

#Define a list for the results</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">scores = list()</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

<br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">for i in range(3):</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; #Split keeping the ratio</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; X_train_noscal_A, X_test_noscal_A, y_train_A, y_test_A &nbsp;= train_test_split(X_noscaled_A, y_A, test_size = 0.20)</blockquote>

<blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; X_train_noscal_B, X_test_noscal_B, y_train_B, y_test_B &nbsp;= train_test_split(X_noscaled_B, y_B, test_size = 0.20)</blockquote>

<blockquote style="border:none;padding-right:0px;padding-left:0px;"><br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; #Concatenate in order to have just one vector for train and one vector for test</blockquote>

<blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; X_train_noscal = numpy.concatenate((X_train_noscal_A, X_train_noscal_B))</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; y_train = numpy.concatenate((y_train_A,y_train_B))</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; X_test_noscal = numpy.concatenate((X_test_noscal_A,X_test_noscal_B))</blockquote>

<blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; y_test = numpy.concatenate((y_test_A,y_test_B))</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

<br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; #Scale the training set</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; scaler = preprocessing.StandardScaler().fit(X_train_noscal)</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; X_train = scaler.transform(X_train_noscal)</blockquote>

<blockquote style="border:none;padding-right:0px;padding-left:0px;"><br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; #Scale the test set using the values obtained from the test set</blockquote>

<blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; X_test = scaler.transform(X_test_noscal)</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

<br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; #Optimization of C and gamma</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; C_range = 10.0 ** numpy.arange(-3, 7)</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; gamma_range = 10.0 ** numpy.arange(-5, 3)</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; param_grid = dict(gamma=gamma_range, C=C_range)</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; svr = svm.SVC()</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; clfopt = grid_search.GridSearchCV(svr,param_grid)</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; clfopt.fit(X_train, y_train)</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

<br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; print clfopt.best_estimator_.C</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; print clfopt.best_estimator_.gamma</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;"><br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; #Define a SVM using the best parameters C and gamma</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; clf = svm.SVC(gamma = clfopt.best_estimator_.gamma, C = clfopt.best_estimator_.C)</blockquote>

<blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; clf.fit(X_train, y_train)</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

<br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">&nbsp; &nbsp; #Write the result in the list</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

&nbsp; &nbsp; scores.append(clf.score(X_test,y_test))</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;"><br></blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">

#See the results</blockquote><blockquote style="border:none;padding-right:0px;padding-left:0px;">print scores</blockquote></blockquote><br><div><hr>From: <a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a><br>

Date: Fri, 21 Jun 2013 10:57:58 -0400<div><div class="h5"><br>To: <a href="mailto:scikit-learn-***@lists.sourceforge.net" target="_blank">scikit-learn-***@lists.sourceforge.net</a><br>Subject: Re: [Scikit-learn-general] SVM: select the training set randomly<br>

<br><div dir="ltr">StratifiedKFold will keep the class distribution the same for you:&nbsp;<div><br></div><div><a href="http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold" target="_blank">http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold</a></div>



<div><br></div><div>There are lots of metrics (score functions, etc.) available:</div><div><br></div><div><a href="http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics" target="_blank">http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics</a><br>



</div><div><a href="http://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation" target="_blank">http://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation</a><br></div><div><br></div>

<div>

See the docs for a particular estimator to find out what the score method returns (which is generally the score function used in optimizing the model). For instance&nbsp;</div><div><br></div><div><a href="http://jaquesgrobler.github.io/Online-Scikit-Learn-stat-tut/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.score" target="_blank">http://jaquesgrobler.github.io/Online-Scikit-Learn-stat-tut/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.score</a><br>



</div><div><br></div><div><br></div><div>-Roban</div><div><br></div><div><br>On Fri, Jun 21, 2013 at 10:20 AM, Gianni Iannelli &lt;<a href="mailto:***@msn.com" target="_blank">***@msn.com</a>&gt; wrote:<br>



&gt;<br>&gt; Thank You very much for the link!! It does closely what I wanna do!<br>&gt;<br>&gt; In my case I have two classes that are for example 0 and 1. I wanna keep the distribution (in the training set and so also the test set) between them similar. And I also need that are choosen randomly, I don't care if in one case I get the same index for training and test set. Well, to select randomly I think that the sklearn.cross_validation.ShuffleSplit() does what I want and I will investigate on that. To keep the distribution equally between the two classes I was thinking to:<br>



&gt;<br>&gt; split the two classes<br>&gt; apply for each of them the separation in training and test using the ShuffleSplit()<br>&gt; concatenate now the two classes again (they will have the same size before the split)<br>



&gt; add to one of the two index vector the size of one of the two class (depends how I will concatenate the two)<br>&gt; apply my SVM classification<br>&gt;<br>&gt;<br>&gt; What do you think? Do you think is ok? <br>&gt;<br>



&gt; I have one another question. How score works? What it computes? I searched around but I found this:<br>&gt;<br>&gt; sklearn.metrics.classification_report(y_true, y_pred, labels=None, target_names=None)<br>&gt;<br>&gt; That maybe could give me back a confusion matrix where I could compute (maybe, I'm just guessing) an Overall Accuracy and a Kappa Coefficient. <br>



&gt;<br>&gt; Is it correct?<br>&gt;<br>&gt; Thank You Very Much!!!<br>&gt;<br>&gt; ________________________________<br>&gt; Date: Fri, 21 Jun 2013 10:59:13 +1000<br>&gt; From: <a href="mailto:***@student.usyd.edu.au" target="_blank">***@student.usyd.edu.au</a><br>



&gt; To: <a href="mailto:scikit-learn-***@lists.sourceforge.net" target="_blank">scikit-learn-***@lists.sourceforge.net</a><br>&gt; Subject: Re: [Scikit-learn-general] SVM: select the training set randomly<br>&gt;<br>

&gt;<br>&gt; Please see <a href="http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html" target="_blank">http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html</a><br>



&gt;<br>&gt;<br>&gt; On Fri, Jun 21, 2013 at 10:31 AM, Gianni Iannelli &lt;<a href="mailto:***@msn.com" target="_blank">***@msn.com</a>&gt; wrote:<br>&gt;<br>&gt; Dear All,<br>&gt;<br>&gt; I have one question. I have a dataset of 100 vector each with some features. Of this 100 I already know the classification of all of them. What I wanna do is select randomly in this 100 a subset to use as training set and the rest as test set. There is something already implemented in scikit that do it automatically or I have to use an index method? For index method I mean to separate the two classes, for example I have 40 (class A) and 60 (class B) for each of them I select 10 number randomly for each class and put set these 20 vector as training set. After that I select the other 80 vectors (also using the index of the main matrix) and classify. <br>



&gt;<br>&gt; Do you think this is too crazy and there is something simple? There is also a validation of the result that could tell me how the classification is good? I know that this is not a real case because I know the classification result but I just wanna see what happens changing the number of features, number of training elements, and so on. <br>



&gt;<br>&gt; Thanks All!!!<br>&gt;<br>&gt; ------------------------------------------------------------------------------<br>&gt; This SF.net email is sponsored by Windows:<br>&gt;<br>&gt; Build for Windows Store.<br>&gt;<br>



&gt; <a href="http://p.sf.net/sfu/windows-dev2dev" target="_blank">http://p.sf.net/sfu/windows-dev2dev</a><br>&gt; _______________________________________________<br>&gt; Scikit-learn-general mailing list<br>&gt; <a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>



&gt; <a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>&gt;<br>&gt;<br>&gt;<br>&gt; ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. <a href="http://p.sf.net/sfu/windows-dev2dev" target="_blank">http://p.sf.net/sfu/windows-dev2dev</a><br>



&gt; _______________________________________________ Scikit-learn-general mailing list <a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a> <a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>



&gt;<br>&gt; ------------------------------------------------------------------------------<br>&gt; This SF.net email is sponsored by Windows:<br>&gt;<br>&gt; Build for Windows Store.<br>&gt;<br>&gt; <a href="http://p.sf.net/sfu/windows-dev2dev" target="_blank">http://p.sf.net/sfu/windows-dev2dev</a><br>



&gt; _______________________________________________<br>&gt; Scikit-learn-general mailing list<br>&gt; <a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a><br>

&gt; <a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>

&gt;<br></div></div>
<br>------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

<a href="http://p.sf.net/sfu/windows-dev2dev" target="_blank">http://p.sf.net/sfu/windows-dev2dev</a><br>_______________________________________________
Scikit-learn-general mailing list
<a href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a></div></div></div> </div></div>
<br>------------------------------------------------------------------------------<br>
This SF.net email is sponsored by Windows:<br>
<br>
Build for Windows Store.<br>
<br>
<a href="http://p.sf.net/sfu/windows-dev2dev" target="_blank">http://p.sf.net/sfu/windows-dev2dev</a><br>_______________________________________________<br>
Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br></blockquote></div><br></div>
<br>------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev<br>_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</div></div> </div>
<br>------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev<br>_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</div></div> </div></body>
</html>
--_f675bcec-9ea1-45b8-b27c-f89eff55463f_--
Continue reading on narkive:
Loading...