[Scikit-learn-general] Univariate feature selection with hyperparameter estimation on a neuroimaging dataset with scikit

Ludovico Coletta

2016-04-22 13:15:15 UTC

Hi everybody,
at the end I think I (partially) solved my own problem by doing the following (maybe it can help somebody else):

# import the classifier
from sklearn.svm import SVC
# Pipeline
from sklearn.pipeline import Pipeline
# import and define a feature reduction technique
from sklearn.feature_selection import SelectKBest, f_classif
# Define the pipeline
pipeline = Pipeline([('sel', SelectKBest()), ('clf', SVC(kernel='linear'))])
param_grid = [{'sel__k': [410628, 200000, 100000, 50000, 20000, 10000, 5000, 2500], 'clf__C': [0.001,0.01,0.1, 1, 10, 100,1000,10000,100000], 'clf__kernel': ['linear']}]
# This works. However, it would be better to have your own cv in the pipeline
clf = GridSearchCV(pipeline, param_grid=param_grid, verbose=1, # cv = ... scoring='accuracy', n_jobs=1)

cv = cross_validation.ShuffleSplit(len(X), n_iter=10, test_size=0.2, random_state=0)
clf.fit(X,y) # I need to call fit because then I have to display the weights on a brain image
scores = cross_validation.cross_val_score(clf, X, y, cv = cv)

However, I would like to ask you a couple of things:
1) My biggest concern is to avoid double dipping. Do you think that what I did above is right? It is possible to somehow retrieve the indices of the nested samples for each of the 10 outer folds?
2) How big should be the biggest C in the grid search?

3) Is there a way to retrieve the weights of the "outer" fold(s) ? I tried with documentation, but I was unsuccessful
BestLudovico

From: ***@hotmail.com
To: scikit-learn-***@lists.sourceforge.net
Date: Wed, 20 Apr 2016 14:46:47 +0200
Subject: [Scikit-learn-general] Univariate feature selection with hyperparameter estimation on a neuroimaging dataset with scikit

Hi
guys,

I
am new to Python and scikit learn package so I hope someone can help
me.For
my master thesis I am analyzing a neuroimaging dataset. I have 24
subjects divided in two classes (12 subjects each) that I would like
to classify.

My
idea is to use "SelectKBest" for selecting the best
features, run a GridSearch for the C parameter, filter the held-out
test data with the results of SelectKBest, select the best C from the
GridSearch and use it to classify the the held out samples. To do
this I have to implement two cross-validations on the same dataset:
one "outer" cv for defining the test sample, and a nested
cv for finding the best features and the best C.

As
cross-validation I would like to use the stratified one. Therefore,
if I got things right, I have to do the following (example of the
first fold of the two cross-validation):

subject
1 (subject 1 of class 1) and 13 (subject 1 of class 2) as test of the
outer cross-validation, subjects 2 and 14 as test sample of the
nested cross-validation (for testing the best C) and subjects 3:12
and 15:24 for selecting the best 20000 features and the best C. I
think that I have done everything right until the point in which I
have to filter the held out data with the selected features. Here I'm
doing a mistake and I reached 100% accuracy. I also tried to change
modality (other features), but I keep getting 100% accuracy.
Here the code for the first fold. Any help would be greatly appreciated.
Thank youLudo
The
numbers refer to index of the array in which I stored the data, y are the labels.0:11 --> subjects of class 112:23 --> subjects of class 2I did the same for every fold`# outer cvcv_outer = StratifiedKFold(y, 12)train_nested1 = [[2,3,4,5,6,7,8,9,10,11,14,15,16,17,18,19,20,21,22,23]]test_nested1
= [[1,13]]cv_nested1
= zip(train_nested1,test_nested1);# Classifiers, feature selection, hyperparameter optimization and pipeline#import the classifierfrom sklearn.svm import SVC# Pipelinefrom sklearn.pipeline import Pipeline# import and define a feature reduction techniquefrom sklearn.feature_selection import SelectKBest, f_classifpipeline = Pipeline([('sel', SelectKBest()), ('clf',SVC(kernel='linear'))])param_grid = [{'sel__k': [80000, 40000, 20000, 10000, 5000, 2500], 'clf__C':[0.001,0.01,0.1, 1, 10, 100,1000,10000], 'clf__kernel':['linear']}]
#FOLD 1grid_search1 = GridSearchCV(pipeline, param_grid=param_grid, verbose=1, cv=cv_nested1, scoring='accuracy',n_jobs=1)grid_search1.fit(X,y)print(grid_search1.best_estimator_)print(grid_search1.best_score_)
# Now we test the held out data. Example of the first fold# FOLD 1cv_scores1 = [] a_1 = clf_final1.named_steps['sel'] # Extract the selector objectb_1 = a_1.transform(X[list(cv_outer)[0][1]]) # transform the corresponding held out datac_1 = clf_final1.named_steps['clf'] # Extract the classifier object(best C parameter)labels_pred1 = c_1.predict(b_1) # predictcv_scores1.append(np.sum(labels_pred1 == y[list(cv_outer)[0][1]]))`