[Scikit-learn-general] using joblib.dump function in hadoop stream mode

最后的魔杰

2015-06-18 13:23:22 UTC

A follow-up question:
Instead of using pickle or joblib.dump approaches, is it possible to export model.coef_ values and use these values to predict new unlabeled files?

------------------ åå§é®ä»¶ ------------------
åä»¶äºº: "507562032";<***@qq.com>;
åéæ¶éŽ: 2015å¹Ž6æ18æ¥(ææå) æäž8:12
æ¶ä»¶äºº: "scikit-learn-general"<scikit-learn-***@lists.sourceforge.net>;

äž»é¢: using joblib.dump function in hadoop stream mode

Hi Experts,

Very glad to know the existence of this email alias, in which scikit-learners are discussing interesting topics.
I'm currently working on pipe-lining a logistic regression model in hadoop stream mode.

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/jars/hadoop-streaming-2.6.0-cdh5.4.2.jar -D mapred.map.tasks=1 -file ${program_path}/sckl_LR_train_mapper.py -mapper "python sckl_LR_train_mapper.py xx 10" -file ${program_path}/sckl_LR_train_mapper.py -reducer "python sckl_LR_train_reducer.py xx 10"-input /user/hive/warehouse/Classification/Logistic_Regression/input/20150615/horseColicTraining.txt -output /output/LR/06181913

I met with a problem when trying to dump a trained model using joblib.dump() in sckl_LR_train_mapper.py.
logisticRegression = linear_model.LogisticRegression()
model = logisticRegression.fit(train_features, train_targets)
joblib.dump(model,"/home/models/model.pkl",compress=9)

However, errors were encountered as follows:
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

At the beginning I thought in hadoop streaming mode, the system can't recognize the local directory /home/models/model.pkl, but only points to hdfs locations, and I tried
joblib.dump(model,sys.stdout,compress=9)
But the same error was reported....
I'm wondering how to serialize model training, model saving and predicting with the trained model in this case.
Could anyone help out please?

Jackie