[Scikit-learn-general] Scikit-learn standards for serializing/saving objects

Discussion:

Keith Lehman

2016-03-23 16:47:21 UTC

Hi:

I'm fairly new to scikit-learn, python, and machine learning. This community has built a great set of libraries though, and is actually a large part of the reason why my company has selected python to experiment with ML.

As we are developing our product, however, we keep running into trouble saving various objects. When possible, we use pickle to save the objects, but this can cause problems in development - objects saved during a debug session can not be loaded outside of the debugger. The reason appears to be because even when pickling a "pickleable" object (such as a trained LinearRegression), pickle finds and saves more primitive objects that have been instantiated within the debug environment. Dill and cpickle have the same issue. My question is, does the scikit-learn community plan to add standard load/save or dump/dumps and load/loads methods that would not create these dependencies?

If there is a better forum for posting questions like these, please let me know and I'll be happy to post there instead.

Thanks!

Keith Lehman
Cell: 617-834-2863
Skype: k.lehman
e-mail: ***@intercapenergy.com

Sebastian Raschka

2016-03-23 20:05:16 UTC

Permalink

I also had some issues with Pickle in the past and have to admit that I actually don't trust pickle files ;). Maybe, I am too paranoid, but I am always afraid of corrupting or losing the data.
Probably not the most elegant solution, but I typically store estimator settings and model parameters as JSON files (since they are human readable in the worst case scenario having "reproducible research" in mind ;)).

For example:

# Model fitting and saving params to JSON

from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_diabetes

diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
regr = LinearRegression()
regr.fit(X, y)

import json

with open('./params.json', 'w', encoding='utf-8') as outfile:
json.dump(regr.get_params(), outfile)

with open('./weights.json', 'w', encoding='utf-8') as outfile:
json.dump(regr.coef_.tolist(), outfile, separators=(',', ':'), sort_keys=True, indent=4)

with open('./intercept.json', 'w', encoding='utf-8') as outfile:
json.dump(regr.intercept_, outfile)

# In a new session: load the params from the JSON files

import json
import codecs
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_diabetes
import numpy as np

diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

obj_text = codecs.open('./params.json', 'r', encoding='utf-8').read()
params = json.loads(obj_text)

obj_text = codecs.open('./weights.json', 'r', encoding='utf-8').read()
weights = json.loads(obj_text)

obj_text = codecs.open('./intercept.json', 'r', encoding='utf-8').read()
intercept = json.loads(obj_text)

regr = LinearRegression()
regr.set_params(**params)
regr.intercept_, regr.coef_ = intercept, np.array(weights)

regr.predict(X[:10])

array([ 206.11706979, 68.07234761, 176.88406035, 166.91796559,
128.45984241, 106.34908972, 73.89417947, 118.85378669,
158.81033076, 213.58408893])

In any case, I know that this isn't pretty, and I would also be looking forward to a better solution!

Best,
Sebastian Raschka

I’m fairly new to scikit-learn, python, and machine learning. This community has built a great set of libraries though, and is actually a large part of the reason why my company has selected python to experiment with ML.
As we are developing our product, however, we keep running into trouble saving various objects. When possible, we use pickle to save the objects, but this can cause problems in development – objects saved during a debug session can not be loaded outside of the debugger. The reason appears to be because even when pickling a “pickleable” object (such as a trained LinearRegression), pickle finds and saves more primitive objects that have been instantiated within the debug environment. Dill and cpickle have the same issue. My question is, does the scikit-learn community plan to add standard load/save or dump/dumps and load/loads methods that would not create these dependencies?
If there is a better forum for posting questions like these, please let me know and I’ll be happy to post there instead.
Thanks!
Keith Lehman
Cell: 617-834-2863
Skype: k.lehman
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Keith Lehman

2016-03-24 01:45:51 UTC

Permalink

Thanks Sebastian.

This is basically what we are doing too. The hard/time consuming part is determining what attributes of each sckikit-learn object need to be saved and how best to extract them.

- Keith

-----Original Message-----
From: Sebastian Raschka [mailto:***@gmail.com]
Sent: Wednesday, March 23, 2016 4:05 PM
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] Scikit-learn standards for serializing/saving objects

I also had some issues with Pickle in the past and have to admit that I actually don't trust pickle files ;). Maybe, I am too paranoid, but I am always afraid of corrupting or losing the data.
Probably not the most elegant solution, but I typically store estimator settings and model parameters as JSON files (since they are human readable in the worst case scenario having "reproducible research" in mind ;)).

For example:

# Model fitting and saving params to JSON

from sklearn.linear_model import LinearRegression from sklearn.datasets import load_diabetes

diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
regr = LinearRegression()
regr.fit(X, y)

import json

with open('./params.json', 'w', encoding='utf-8') as outfile:
json.dump(regr.get_params(), outfile)

with open('./weights.json', 'w', encoding='utf-8') as outfile:
json.dump(regr.coef_.tolist(), outfile, separators=(',', ':'), sort_keys=True, indent=4)

with open('./intercept.json', 'w', encoding='utf-8') as outfile:
json.dump(regr.intercept_, outfile)

# In a new session: load the params from the JSON files

import json
import codecs
from sklearn.linear_model import LinearRegression from sklearn.datasets import load_diabetes import numpy as np

diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

obj_text = codecs.open('./params.json', 'r', encoding='utf-8').read() params = json.loads(obj_text)

obj_text = codecs.open('./weights.json', 'r', encoding='utf-8').read() weights = json.loads(obj_text)

obj_text = codecs.open('./intercept.json', 'r', encoding='utf-8').read() intercept = json.loads(obj_text)

regr = LinearRegression()
regr.set_params(**params)
regr.intercept_, regr.coef_ = intercept, np.array(weights)

regr.predict(X[:10])

array([ 206.11706979, 68.07234761, 176.88406035, 166.91796559,
128.45984241, 106.34908972, 73.89417947, 118.85378669,
158.81033076, 213.58408893])

In any case, I know that this isn't pretty, and I would also be looking forward to a better solution!

Best,
Sebastian Raschka

I’m fairly new to scikit-learn, python, and machine learning. This community has built a great set of libraries though, and is actually a large part of the reason why my company has selected python to experiment with ML.
As we are developing our product, however, we keep running into trouble saving various objects. When possible, we use pickle to save the objects, but this can cause problems in development – objects saved during a debug session can not be loaded outside of the debugger. The reason appears to be because even when pickling a “pickleable” object (such as a trained LinearRegression), pickle finds and saves more primitive objects that have been instantiated within the debug environment. Dill and cpickle have the same issue. My question is, does the scikit-learn community plan to add standard load/save or dump/dumps and load/loads methods that would not create these dependencies?
If there is a better forum for posting questions like these, please let me know and I’ll be happy to post there instead.
Thanks!
Keith Lehman
Cell: 617-834-2863
Skype: k.lehman
----------------------------------------------------------------------
--------
Transform Data into Opportunity.
Accelerate data analysis in your applications with Intel Data
Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140______
_________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Chris Hausler

2016-03-24 02:04:03 UTC

Permalink

We also have similar issues. It'd be great to hear any cool solutions :-)

Post by Keith Lehman
Thanks Sebastian.
This is basically what we are doing too. The hard/time consuming part is
determining what attributes of each sckikit-learn object need to be saved
and how best to extract them.
- Keith
-----Original Message-----
Sent: Wednesday, March 23, 2016 4:05 PM
Subject: Re: [Scikit-learn-general] Scikit-learn standards for
serializing/saving objects
I also had some issues with Pickle in the past and have to admit that I
actually don't trust pickle files ;). Maybe, I am too paranoid, but I am
always afraid of corrupting or losing the data.
Probably not the most elegant solution, but I typically store estimator
settings and model parameters as JSON files (since they are human readable
in the worst case scenario having "reproducible research" in mind ;)).
# Model fitting and saving params to JSON
from sklearn.linear_model import LinearRegression from sklearn.datasets
import load_diabetes
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
regr = LinearRegression()
regr.fit(X, y)
import json
json.dump(regr.get_params(), outfile)
json.dump(regr.coef_.tolist(), outfile, separators=(',', ':'),
sort_keys=True, indent=4)
json.dump(regr.intercept_, outfile)
# In a new session: load the params from the JSON files
import json
import codecs
from sklearn.linear_model import LinearRegression from sklearn.datasets
import load_diabetes import numpy as np
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
obj_text = codecs.open('./params.json', 'r', encoding='utf-8').read()
params = json.loads(obj_text)
obj_text = codecs.open('./weights.json', 'r', encoding='utf-8').read()
weights = json.loads(obj_text)
obj_text = codecs.open('./intercept.json', 'r', encoding='utf-8').read()
intercept = json.loads(obj_text)
regr = LinearRegression()
regr.set_params(**params)
regr.intercept_, regr.coef_ = intercept, np.array(weights)
regr.predict(X[:10])
array([ 206.11706979, 68.07234761, 176.88406035, 166.91796559,
128.45984241, 106.34908972, 73.89417947, 118.85378669,
158.81033076, 213.58408893])
In any case, I know that this isn't pretty, and I would also be looking
forward to a better solution!
Best,
Sebastian Raschka

Iâm fairly new to scikit-learn, python, and machine learning. This

community has built a great set of libraries though, and is actually a
large part of the reason why my company has selected python to experiment
with ML.

As we are developing our product, however, we keep running into trouble

saving various objects. When possible, we use pickle to save the objects,
but this can cause problems in development â objects saved during a debug
session can not be loaded outside of the debugger. The reason appears to be
because even when pickling a âpickleableâ object (such as a trained
LinearRegression), pickle finds and saves more primitive objects that have
been instantiated within the debug environment. Dill and cpickle have the
same issue. My question is, does the scikit-learn community plan to add
standard load/save or dump/dumps and load/loads methods that would not
create these dependencies?

If there is a better forum for posting questions like these, please let

me know and Iâll be happy to post there instead.

Thanks!
Keith Lehman
Cell: 617-834-2863
Skype: k.lehman
----------------------------------------------------------------------
--------
Transform Data into Opportunity.
Accelerate data analysis in your applications with Intel Data
Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140______
_________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with Intel Data Analytics
Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2016.0.7497 / Virus Database: 4545/11867 - Release Date: 03/23/16
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Joel Nothman

2016-03-24 03:30:27 UTC

Permalink

I think all the scikit-learn devs know that the serialisation available in
scikit-learn is inadequate, and recommend storing training data and model
parameters.

Designing a serialisation format that is robust to future changes is a huge
engineering effort, and is likely to result in one of: (a) a framework that
has all the power and hence faults of pickling; (b) an implementation that
is limited to only some parameter values on some estimators; or (c) a
specialised, over-engineered monolith that we can't afford to maintain.

One approach mooted time and again is supporting export to a
framework-independent model description language, like PMML. For this see
the work begun at https://github.com/alex-pirozhenko/sklearn-pmml. The
intention here, however, is not especially to re-load the models in
scikit-learn, but to perform prediction with scikit-learn-fitted models in
other frameworks.

Post by Chris Hausler
We also have similar issues. It'd be great to hear any cool solutions :-)

Post by Keith Lehman
Thanks Sebastian.
This is basically what we are doing too. The hard/time consuming part is
determining what attributes of each sckikit-learn object need to be saved
and how best to extract them.
- Keith
-----Original Message-----
Sent: Wednesday, March 23, 2016 4:05 PM
Subject: Re: [Scikit-learn-general] Scikit-learn standards for serializing/saving objects
I also had some issues with Pickle in the past and have to admit that I
actually don't trust pickle files ;). Maybe, I am too paranoid, but I am
always afraid of corrupting or losing the data.
Probably not the most elegant solution, but I typically store estimator
settings and model parameters as JSON files (since they are human readable
in the worst case scenario having "reproducible research" in mind ;)).
# Model fitting and saving params to JSON
from sklearn.linear_model import LinearRegression from sklearn.datasets
import load_diabetes
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
regr = LinearRegression()
regr.fit(X, y)
import json
json.dump(regr.get_params(), outfile)
json.dump(regr.coef_.tolist(), outfile, separators=(',', ':'),
sort_keys=True, indent=4)
json.dump(regr.intercept_, outfile)
# In a new session: load the params from the JSON files
import json
import codecs
from sklearn.linear_model import LinearRegression from sklearn.datasets
import load_diabetes import numpy as np
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
obj_text = codecs.open('./params.json', 'r', encoding='utf-8').read()
params = json.loads(obj_text)
obj_text = codecs.open('./weights.json', 'r', encoding='utf-8').read()
weights = json.loads(obj_text)
obj_text = codecs.open('./intercept.json', 'r', encoding='utf-8').read()
intercept = json.loads(obj_text)
regr = LinearRegression()
regr.set_params(**params)
regr.intercept_, regr.coef_ = intercept, np.array(weights)
regr.predict(X[:10])
array([ 206.11706979, 68.07234761, 176.88406035, 166.91796559,
128.45984241, 106.34908972, 73.89417947, 118.85378669,
158.81033076, 213.58408893])
In any case, I know that this isn't pretty, and I would also be looking
forward to a better solution!
Best,
Sebastian Raschka

Iâm fairly new to scikit-learn, python, and machine learning. This

community has built a great set of libraries though, and is actually a
large part of the reason why my company has selected python to experiment
with ML.

As we are developing our product, however, we keep running into trouble

If there is a better forum for posting questions like these, please let

me know and Iâll be happy to post there instead.

Andreas Mueller

2016-03-24 21:40:34 UTC

Permalink

Can you give a simple example for reproducing this problem?
I haven't heard of this particular issue.

Im fairly new to scikit-learn, python, and machine learning. This
community has built a great set of libraries though, and is actually a
large part of the reason why my company has selected python to
experiment with ML.
As we are developing our product, however, we keep running into
trouble saving various objects. When possible, we use pickle to save
the objects, but this can cause problems in development objects
saved during a debug session can not be loaded outside of the
debugger. The reason appears to be because even when pickling a
pickleable object (such as a trained LinearRegression), pickle finds
and saves more primitive objects that have been instantiated within
the debug environment. Dill and cpickle have the same issue. My
question is, does the scikit-learn community plan to add standard
load/save or dump/dumps and load/loads methods that would not create
these dependencies?
If there is a better forum for posting questions like these, please
let me know and Ill be happy to post there instead.
Thanks!
Keith Lehman
Cell: 617-834-2863
Skype: k.lehman
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general