最后的魔杰
2015-06-18 12:12:54 UTC
Hi Experts,
Very glad to know the existence of this email alias, in which scikit-learners are discussing interesting topics.
I'm currently working on pipe-lining a logistic regression model in hadoop stream mode.
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/jars/hadoop-streaming-2.6.0-cdh5.4.2.jar -D mapred.map.tasks=1 -file ${program_path}/sckl_LR_train_mapper.py -mapper "python sckl_LR_train_mapper.py xx 10" -file ${program_path}/sckl_LR_train_mapper.py -reducer "python sckl_LR_train_reducer.py xx 10"-input /user/hive/warehouse/Classification/Logistic_Regression/input/20150615/horseColicTraining.txt -output /output/LR/06181913
I met with a problem when trying to dump a trained model using joblib.dump() in sckl_LR_train_mapper.py.
logisticRegression = linear_model.LogisticRegression()
model = logisticRegression.fit(train_features, train_targets)
joblib.dump(model,"/home/models/model.pkl",compress=9)
However, errors were encountered as follows:
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
At the beginning I thought in hadoop streaming mode, the system can't recognize the local directory /home/models/model.pkl, but only points to hdfs locations, and I tried
joblib.dump(model,sys.stdout,compress=9)
But the same error was reported....
I'm wondering how to serialize model training, model saving and predicting with the trained model in this case.
Could anyone help out please?
Jackie
Very glad to know the existence of this email alias, in which scikit-learners are discussing interesting topics.
I'm currently working on pipe-lining a logistic regression model in hadoop stream mode.
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/jars/hadoop-streaming-2.6.0-cdh5.4.2.jar -D mapred.map.tasks=1 -file ${program_path}/sckl_LR_train_mapper.py -mapper "python sckl_LR_train_mapper.py xx 10" -file ${program_path}/sckl_LR_train_mapper.py -reducer "python sckl_LR_train_reducer.py xx 10"-input /user/hive/warehouse/Classification/Logistic_Regression/input/20150615/horseColicTraining.txt -output /output/LR/06181913
I met with a problem when trying to dump a trained model using joblib.dump() in sckl_LR_train_mapper.py.
logisticRegression = linear_model.LogisticRegression()
model = logisticRegression.fit(train_features, train_targets)
joblib.dump(model,"/home/models/model.pkl",compress=9)
However, errors were encountered as follows:
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
At the beginning I thought in hadoop streaming mode, the system can't recognize the local directory /home/models/model.pkl, but only points to hdfs locations, and I tried
joblib.dump(model,sys.stdout,compress=9)
But the same error was reported....
I'm wondering how to serialize model training, model saving and predicting with the trained model in this case.
Could anyone help out please?
Jackie