[Scikit-learn-general] Reproducible results of parallel cross-validation

Robert Pollak

2016-03-03 12:43:51 UTC

Hello list!

I want to use parallel cross-validation and still get reproducible results. In my code, I do

if __name__ == '__main__': # This is necessary to use n_jobs > 1.
[...]
clf = DecisionTreeClassifier(max_depth=5)
cross_validation = StratifiedKFold(y, n_folds=10, shuffle=True, random_state=0)
cross_val_prediction = cross_val_predict(clf, X, y, cv=cross_validation, n_jobs=6)

However, this gives different results than with n_jobs=1!

Could it be that there is a race condition between the jobs for access of the RNG?
I noticed that when I set shuffle=False, the number of jobs does not matter.

But isn't the RNG only used for the shuffling?
And doesn't the shuffling happen _before_ launching the parallel jobs?

So: How can I get reproducible results with shuffling and parallel processing?

Best regards,
Robert

P.S.:
I am using:
Windows-7-6.1.7601-SP1
Python 3.5.1 (v3.5.1:37a07cee5969, Dec 6 2015, 01:54:25) [MSC v.1900 64 bit (AMD64)]
NumPy 1.10.4
SciPy 0.17.0
Scikit-Learn 0.17
(all from WinPython-64bit-3.5.1.2).