[Scikit-learn-general] optimal n

Discussion:

[Scikit-learn-general] optimal n_jobs in GridSearchCV

Sheila the angel

2014-08-21 10:32:08 UTC

Hi,
Using GridSearchCV, I am trying to optimize two parameters values.
In total, I have 8 parameter combinations and doing 4 fold cross validation.
I want to run it in parallel environment.
My questions are:
1. What should be the n_jobs value, 8 or (8*4=) 32 ?
(I know I can specify n_jobs=-1 but due to some technical reasons, I want
to know how many jobs GridSearchCV will start.)

2. If I use the classifier such as RandomForestClassifier where 'n_jobs'
can be specified, will it make any difference if I specify "n_jobs" at the
classifier level also-

clf = RandomForestClassifier(n_jobs=-1)
grid_search = GridSearchCV(clf, param_grid, n_jobs = -1)

Will this be faster compare to GridSearchCV(RandomForestClassifier() ) ?

Thanks

--

Sheila

Mr Samuel Hames

2014-08-21 10:46:54 UTC

Permalink

Hi,

1. The n_jobs parameter controls the number of physical processes started in parallel. It should be set depending on the

number of cpu cores available on your machine, independent of the type of or size of the CV search you are trying to

run. On a typical desktop machine with four cores this might be n_jobs = 4.

2. n_jobs should really only be set in one of those places. If you were to set (for example) n_jobs = 4 in CV and RandomForest,

you would end up with 16 distinct processes competing for a much smaller number of physical cores, potentially making it

slower rather than faster as all of the processes compete with each other.

Hope this helps,

Sam

________________________________
From: Sheila the angel <***@gmail.com>
Sent: Thursday, 21 August 2014 8:32 PM
To: scikit-learn-***@lists.sourceforge.net
Subject: [Scikit-learn-general] optimal n_jobs in GridSearchCV

Hi,
Using GridSearchCV, I am trying to optimize two parameters values.
In total, I have 8 parameter combinations and doing 4 fold cross validation.
I want to run it in parallel environment.
My questions are:
1. What should be the n_jobs value, 8 or (8*4=) 32 ?
(I know I can specify n_jobs=-1 but due to some technical reasons, I want to know how many jobs GridSearchCV will start.)

2. If I use the classifier such as RandomForestClassifier where 'n_jobs' can be specified, will it make any difference if I specify "n_jobs" at the classifier level also-

clf = RandomForestClassifier(n_jobs=-1)
grid_search = GridSearchCV(clf, param_grid, n_jobs = -1)

Will this be faster compare to GridSearchCV(RandomForestClassifier() ) ?

Thanks

--

Sheila

Lars Buitinck

2014-08-21 11:03:13 UTC

Permalink

Post by Sheila the angel
1. What should be the n_jobs value, 8 or (8*4=) 32 ?

Sheila the angel

2014-08-21 11:34:52 UTC

Permalink

First Thanks for reply.

@Hames : I understand that n_jobs "should be depending on the number of cpu
cores available on your machine". But I am running code on Grid computing
environment where I have to specify the number of CPUs in advance.

Does this mean if I (reserve 64 cores and) specify n_jobs=64 the job will
take (Approximately) half of the time then n_jobs=32 ?

And again just to be sure :
What if I (reserve and) specify n_jobs= 35 , will all the 35 cores be used
(while the maximum possible job combination is 32)?

Off course, I want to avoid the situation where the core is reserved but
not used.

Post by Lars Buitinck

Post by Sheila the angel
1. What should be the n_jobs value, 8 or (8*4=) 32 ?

Sheila the angel

2014-08-21 12:32:26 UTC

Permalink

I still have following doubt:

I understand that n_jobs "should be depending on the number of cpu cores
available on your machine". But I am running code on Grid computing
environment where I have to specify the number of CPU cores in advance.

Does this mean if I (reserve 64 cores and) specify n_jobs=64 the job will
take (Approximately) half of the time then n_jobs=32 ?

And again just to be sure :
What if I (reserve and) specify n_jobs= 35 , will all the 35 cores be used
(while the maximum possible job combination is 32)?

*I want to avoid the situation where the CPU core is reserved but not used.*

Post by Sheila the angel
First Thanks for reply.
@Hames : I understand that n_jobs "should be depending on the number of
cpu cores available on your machine". But I am running code on Grid
computing environment where I have to specify the number of CPUs in
advance.
Does this mean if I (reserve 64 cores and) specify n_jobs=64 the job will
take (Approximately) half of the time then n_jobs=32 ?
What if I (reserve and) specify n_jobs= 35 , will all the 35 cores be used
(while the maximum possible job combination is 32)?
Off course, I want to avoid the situation where the core is reserved but
not used.

Post by Lars Buitinck

Post by Sheila the angel
1. What should be the n_jobs value, 8 or (8*4=) 32 ?

Arnaud Joly

2014-08-21 12:44:41 UTC

Permalink

If you set n_jobs to XXX, it will spawn XXX threads or processes. Thus, you will
need to ask for XXX cores. Note that its often possible to retrieve XXX in your script using os.environ.

If you use less than the XXX cores, then you wont
use all the available cpu. If you ask for more than XXX cores, you will start
draining computing ressources from the others jobs (slurm jobs / age jobs / ...) that shares the same node
as your job.

Setting an appropriate XXX is dependant on the policy of the cluster.

Does this mean if I (reserve 64 cores and) specify n_jobs=64 the job will take (Approximately) half of the time then n_jobs=32 ?

With joblib, you often have a linear scaling. But it depends upon the algorithm.

Best regards,
Arnaud

I understand that n_jobs "should be depending on the number of cpu cores available on your machine". But I am running code on Grid computing environment where I have to specify the number of CPU cores in advance.
Does this mean if I (reserve 64 cores and) specify n_jobs=64 the job will take (Approximately) half of the time then n_jobs=32 ?
What if I (reserve and) specify n_jobs= 35 , will all the 35 cores be used (while the maximum possible job combination is 32)?
I want to avoid the situation where the CPU core is reserved but not used.
First Thanks for reply.
@Hames : I understand that n_jobs "should be depending on the number of cpu cores available on your machine". But I am running code on Grid computing environment where I have to specify the number of CPUs in advance.
Does this mean if I (reserve 64 cores and) specify n_jobs=64 the job will take (Approximately) half of the time then n_jobs=32 ?
What if I (reserve and) specify n_jobs= 35 , will all the 35 cores be used (while the maximum possible job combination is 32)?
Off course, I want to avoid the situation where the core is reserved but not used.

Post by Sheila the angel
1. What should be the n_jobs value, 8 or (8*4=) 32 ?

n_jobs is the number of CPUs you want to use, not the amount of work.
(It's a misnomer because the number of jobs/work items is variable;
the parameter determines the number of workers performing the jobs.)
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Gael Varoquaux

2014-08-21 11:39:50 UTC

Permalink

Post by Sheila the angel
2. If I use the classifier such as RandomForestClassifier where
'n_jobs' can be specified, will it make any difference if I specify
"n_jobs" at the classifier level also-

We don't support nested parallelism, unfortunately.

G

Joel Nothman

2014-08-21 11:44:37 UTC

Permalink

Post by Gael Varoquaux

Post by Sheila the angel
2. If I use the classifier such as RandomForestClassifier where
'n_jobs' can be specified, will it make any difference if I specify
"n_jobs" at the classifier level also-

We don't support nested parallelism, unfortunately.

I think RandomForestClassifier, using multithreading in version 0.15,
should work nested in multiprocessing.

Gael Varoquaux

2014-08-21 11:46:08 UTC

Permalink

I think RandomForestClassifier, using multithreading in version 0.15, should
work nested in multiprocessing.

Good point, as it uses threading. Thus, for version 0.15, what I just
said was irrelevant.

G

Joel Nothman

2014-08-21 11:47:31 UTC

Permalink

Post by Sheila the angel

Post by Joel Nothman
I think RandomForestClassifier, using multithreading in version 0.15,

should

Post by Joel Nothman
work nested in multiprocessing.

Good point, as it uses threading. Thus, for version 0.15, what I just
said was irrelevant.

But is the exception, rather than the rule!

Lars Buitinck

2014-08-21 11:52:07 UTC

Permalink

I think RandomForestClassifier, using multithreading in version 0.15, should
work nested in multiprocessing.

It would work, but the p * n threads from p processes using n threads
each would still compete for the cores, right?