Discussion:
[Scikit-learn-general] using joblib.dump function in hadoop stream mode
最后的魔杰
2015-06-18 12:12:54 UTC
Permalink
Hi Experts,


Very glad to know the existence of this email alias, in which scikit-learners are discussing interesting topics.
I'm currently working on pipe-lining a logistic regression model in hadoop stream mode.


$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/jars/hadoop-streaming-2.6.0-cdh5.4.2.jar -D mapred.map.tasks=1 -file ${program_path}/sckl_LR_train_mapper.py -mapper "python sckl_LR_train_mapper.py xx 10" -file ${program_path}/sckl_LR_train_mapper.py -reducer "python sckl_LR_train_reducer.py xx 10"-input /user/hive/warehouse/Classification/Logistic_Regression/input/20150615/horseColicTraining.txt -output /output/LR/06181913


I met with a problem when trying to dump a trained model using joblib.dump() in sckl_LR_train_mapper.py.
logisticRegression = linear_model.LogisticRegression()
model = logisticRegression.fit(train_features, train_targets)
joblib.dump(model,"/home/models/model.pkl",compress=9)


However, errors were encountered as follows:
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)



At the beginning I thought in hadoop streaming mode, the system can't recognize the local directory /home/models/model.pkl, but only points to hdfs locations, and I tried
joblib.dump(model,sys.stdout,compress=9)
But the same error was reported....
I'm wondering how to serialize model training, model saving and predicting with the trained model in this case.
Could anyone help out please?


Jackie
最后的魔杰
2015-06-18 13:23:22 UTC
Permalink
A follow-up question:
Instead of using pickle or joblib.dump approaches, is it possible to export model.coef_ values and use these values to predict new unlabeled files?




------------------ 原始邮件 ------------------
发件人: "507562032";<***@qq.com>;
发送时闎: 2015幎6月18日(星期四) 晚䞊8:12
收件人: "scikit-learn-general"<scikit-learn-***@lists.sourceforge.net>;

䞻题: using joblib.dump function in hadoop stream mode



Hi Experts,


Very glad to know the existence of this email alias, in which scikit-learners are discussing interesting topics.
I'm currently working on pipe-lining a logistic regression model in hadoop stream mode.


$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/jars/hadoop-streaming-2.6.0-cdh5.4.2.jar -D mapred.map.tasks=1 -file ${program_path}/sckl_LR_train_mapper.py -mapper "python sckl_LR_train_mapper.py xx 10" -file ${program_path}/sckl_LR_train_mapper.py -reducer "python sckl_LR_train_reducer.py xx 10"-input /user/hive/warehouse/Classification/Logistic_Regression/input/20150615/horseColicTraining.txt -output /output/LR/06181913


I met with a problem when trying to dump a trained model using joblib.dump() in sckl_LR_train_mapper.py.
logisticRegression = linear_model.LogisticRegression()
model = logisticRegression.fit(train_features, train_targets)
joblib.dump(model,"/home/models/model.pkl",compress=9)


However, errors were encountered as follows:
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)



At the beginning I thought in hadoop streaming mode, the system can't recognize the local directory /home/models/model.pkl, but only points to hdfs locations, and I tried
joblib.dump(model,sys.stdout,compress=9)
But the same error was reported....
I'm wondering how to serialize model training, model saving and predicting with the trained model in this case.
Could anyone help out please?


Jackie

Loading...