Ludovico Coletta
2016-04-20 12:46:47 UTC
Hi
guys,
I
am new to Python and scikit learn package so I hope someone can help
me.For
my master thesis I am analyzing a neuroimaging dataset. I have 24
subjects divided in two classes (12 subjects each) that I would like
to classify.
My
idea is to use "SelectKBest" for selecting the best
features, run a GridSearch for the C parameter, filter the held-out
test data with the results of SelectKBest, select the best C from the
GridSearch and use it to classify the the held out samples. To do
this I have to implement two cross-validations on the same dataset:
one "outer" cv for defining the test sample, and a nested
cv for finding the best features and the best C.
As
cross-validation I would like to use the stratified one. Therefore,
if I got things right, I have to do the following (example of the
first fold of the two cross-validation):
subject
1 (subject 1 of class 1) and 13 (subject 1 of class 2) as test of the
outer cross-validation, subjects 2 and 14 as test sample of the
nested cross-validation (for testing the best C) and subjects 3:12
and 15:24 for selecting the best 20000 features and the best C. I
think that I have done everything right until the point in which I
have to filter the held out data with the selected features. Here I'm
doing a mistake and I reached 100% accuracy. I also tried to change
modality (other features), but I keep getting 100% accuracy.
Here the code for the first fold. Any help would be greatly appreciated.
Thank youLudo
The
numbers refer to index of the array in which I stored the data, y are the labels.0:11 --> subjects of class 112:23 --> subjects of class 2I did the same for every fold`# outer cvcv_outer = StratifiedKFold(y, 12)train_nested1 = [[2,3,4,5,6,7,8,9,10,11,14,15,16,17,18,19,20,21,22,23]]test_nested1
= [[1,13]]cv_nested1
= zip(train_nested1,test_nested1);# Classifiers, feature selection, hyperparameter optimization and pipeline#import the classifierfrom sklearn.svm import SVC# Pipelinefrom sklearn.pipeline import Pipeline# import and define a feature reduction techniquefrom sklearn.feature_selection import SelectKBest, f_classifpipeline = Pipeline([('sel', SelectKBest()), ('clf',SVC(kernel='linear'))])param_grid = [{'sel__k': [80000, 40000, 20000, 10000, 5000, 2500], 'clf__C':[0.001,0.01,0.1, 1, 10, 100,1000,10000], 'clf__kernel':['linear']}]
#FOLD 1grid_search1 = GridSearchCV(pipeline, param_grid=param_grid, verbose=1, cv=cv_nested1, scoring='accuracy',n_jobs=1)grid_search1.fit(X,y)print(grid_search1.best_estimator_)print(grid_search1.best_score_)
# Now we test the held out data. Example of the first fold# FOLD 1cv_scores1 = [] a_1 = clf_final1.named_steps['sel'] # Extract the selector objectb_1 = a_1.transform(X[list(cv_outer)[0][1]]) # transform the corresponding held out datac_1 = clf_final1.named_steps['clf'] # Extract the classifier object(best C parameter)labels_pred1 = c_1.predict(b_1) # predictcv_scores1.append(np.sum(labels_pred1 == y[list(cv_outer)[0][1]]))`
guys,
I
am new to Python and scikit learn package so I hope someone can help
me.For
my master thesis I am analyzing a neuroimaging dataset. I have 24
subjects divided in two classes (12 subjects each) that I would like
to classify.
My
idea is to use "SelectKBest" for selecting the best
features, run a GridSearch for the C parameter, filter the held-out
test data with the results of SelectKBest, select the best C from the
GridSearch and use it to classify the the held out samples. To do
this I have to implement two cross-validations on the same dataset:
one "outer" cv for defining the test sample, and a nested
cv for finding the best features and the best C.
As
cross-validation I would like to use the stratified one. Therefore,
if I got things right, I have to do the following (example of the
first fold of the two cross-validation):
subject
1 (subject 1 of class 1) and 13 (subject 1 of class 2) as test of the
outer cross-validation, subjects 2 and 14 as test sample of the
nested cross-validation (for testing the best C) and subjects 3:12
and 15:24 for selecting the best 20000 features and the best C. I
think that I have done everything right until the point in which I
have to filter the held out data with the selected features. Here I'm
doing a mistake and I reached 100% accuracy. I also tried to change
modality (other features), but I keep getting 100% accuracy.
Here the code for the first fold. Any help would be greatly appreciated.
Thank youLudo
The
numbers refer to index of the array in which I stored the data, y are the labels.0:11 --> subjects of class 112:23 --> subjects of class 2I did the same for every fold`# outer cvcv_outer = StratifiedKFold(y, 12)train_nested1 = [[2,3,4,5,6,7,8,9,10,11,14,15,16,17,18,19,20,21,22,23]]test_nested1
= [[1,13]]cv_nested1
= zip(train_nested1,test_nested1);# Classifiers, feature selection, hyperparameter optimization and pipeline#import the classifierfrom sklearn.svm import SVC# Pipelinefrom sklearn.pipeline import Pipeline# import and define a feature reduction techniquefrom sklearn.feature_selection import SelectKBest, f_classifpipeline = Pipeline([('sel', SelectKBest()), ('clf',SVC(kernel='linear'))])param_grid = [{'sel__k': [80000, 40000, 20000, 10000, 5000, 2500], 'clf__C':[0.001,0.01,0.1, 1, 10, 100,1000,10000], 'clf__kernel':['linear']}]
#FOLD 1grid_search1 = GridSearchCV(pipeline, param_grid=param_grid, verbose=1, cv=cv_nested1, scoring='accuracy',n_jobs=1)grid_search1.fit(X,y)print(grid_search1.best_estimator_)print(grid_search1.best_score_)
# Now we test the held out data. Example of the first fold# FOLD 1cv_scores1 = [] a_1 = clf_final1.named_steps['sel'] # Extract the selector objectb_1 = a_1.transform(X[list(cv_outer)[0][1]]) # transform the corresponding held out datac_1 = clf_final1.named_steps['clf'] # Extract the classifier object(best C parameter)labels_pred1 = c_1.predict(b_1) # predictcv_scores1.append(np.sum(labels_pred1 == y[list(cv_outer)[0][1]]))`