[Scikit-learn-general] Using logistic regression on a continuous target variable

Discussion:

George Bezerra

2015-10-04 01:58:08 UTC

Hi there,

I would like to train a logistic regression model on a continuous (i.e.,
not categorical) target variable. The target is a probability, which is why
I am using a logistic regression for this problem. However, the sklearn
function tries to find the class labels by running a unique() on the target
values, which is disastrous if y is continuous.

Is there a way to train logistic regression on a continuous target variable
in sklearn?

Any help is highly appreciated.

Best,

George.

--
George Bezerra

Sebastian Raschka

2015-10-04 02:50:46 UTC

Permalink

Hi, George,
logistic regression is a binary classifier by nature (class labels 0 and 1). Scikit-learn supports multi-class classification via One-vs-One or One-vs-All though; and there is a generalization (softmax) that gives you meaningful probabilities for multiple classes (i.e., class probabilities sum up to 1). In any case, logistic regression works with nominal class labels - categorical class labels with no order implied.

To keep a long story short: Logistic regression is a classifier, not a regressor — the name is misleading, I agree. I think you may want to look into regression analysis for your continuous target variable.

Best,
Sebastian

Post by George Bezerra
Hi there,
I would like to train a logistic regression model on a continuous (i.e., not categorical) target variable. The target is a probability, which is why I am using a logistic regression for this problem. However, the sklearn function tries to find the class labels by running a unique() on the target values, which is disastrous if y is continuous.
Is there a way to train logistic regression on a continuous target variable in sklearn?
Any help is highly appreciated.
Best,
George.
--
George Bezerra
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

George Bezerra

2015-10-04 03:07:44 UTC

Permalink

Thanks Sebastian.

I am trying to follow this paper:
http://research.microsoft.com/en-us/um/people/mattri/papers/www2007/predictingclicks.pdf
(check out section 6.2). They use logistic regression as a regression model
to predict the click through rate (which is continuous).

A linear regression model will violate the assumption that probabilities
vary between 0 and 1 (it will give me values outside this range in some
cases). I would think it is in principle possible to solve the logistic
regression for a continuous value, although scikit doesn't support it.
Perhaps I'm wrong.

Thanks again,

George.

Post by Sebastian Raschka
Hi, George,
logistic regression is a binary classifier by nature (class labels 0 and
1). Scikit-learn supports multi-class classification via One-vs-One or
One-vs-All though; and there is a generalization (softmax) that gives you
meaningful probabilities for multiple classes (i.e., class probabilities
sum up to 1). In any case, logistic regression works with nominal class
labels - categorical class labels with no order implied.
To keep a long story short: Logistic regression is a classifier, not a
regressor â the name is misleading, I agree. I think you may want to look
into regression analysis for your continuous target variable.
Best,
Sebastian

Post by George Bezerra
Hi there,
I would like to train a logistic regression model on a continuous (i.e.,

not categorical) target variable. The target is a probability, which is why
I am using a logistic regression for this problem. However, the sklearn
function tries to find the class labels by running a unique() on the target
values, which is disastrous if y is continuous.

Post by George Bezerra
Is there a way to train logistic regression on a continuous target

variable in sklearn?

Post by George Bezerra
Any help is highly appreciated.
Best,
George.
--
George Bezerra

------------------------------------------------------------------------------

Post by George Bezerra
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
George Bezerra

George Bezerra

2015-10-04 03:09:50 UTC

Permalink

*I meant section 5.

Post by George Bezerra
Thanks Sebastian.
http://research.microsoft.com/en-us/um/people/mattri/papers/www2007/predictingclicks.pdf
(check out section 6.2). They use logistic regression as a regression model
to predict the click through rate (which is continuous).
A linear regression model will violate the assumption that probabilities
vary between 0 and 1 (it will give me values outside this range in some
cases). I would think it is in principle possible to solve the logistic
regression for a continuous value, although scikit doesn't support it.
Perhaps I'm wrong.
Thanks again,
George.

Post by George Bezerra
Hi there,
I would like to train a logistic regression model on a continuous

(i.e., not categorical) target variable. The target is a probability, which
is why I am using a logistic regression for this problem. However, the
sklearn function tries to find the class labels by running a unique() on
the target values, which is disastrous if y is continuous.

Post by George Bezerra
Is there a way to train logistic regression on a continuous target

variable in sklearn?

Post by George Bezerra
Any help is highly appreciated.
Best,
George.
--
George Bezerra

------------------------------------------------------------------------------

Post by George Bezerra
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
George Bezerra

Sturla Molden

2015-10-05 22:15:15 UTC

Permalink

Post by George Bezerra
http://research.microsoft.com/en-us/um/people/mattri/papers/www2007/predictingclicks.pdf
(check out section 6.2). They use logistic regression as a regression
model to predict the click through rate (which is continuous).

I am not sure what to think about this, though I don't have time to look
at it in detail. But modelling rates is usually a case for Poisson
regression rather than logistic regression. Rate and probability is not
the same.

Post by George Bezerra
A linear regression model will violate the assumption that probabilities
vary between 0 and 1 (it will give me values outside this range in some
cases). I would think it is in principle possible to solve the logistic
regression for a continuous value, although scikit doesn't support it.

The word you are looking for is 'generalized linear model'.

Sturla

------------------------------------------------------------------------------

j***@gmail.com

2015-10-05 22:35:46 UTC

Permalink

Post by George Bezerra
http://research.microsoft.com/en-us/um/people/mattri/papers/www2007/predictingclicks.pdf

Post by George Bezerra
(check out section 6.2). They use logistic regression as a regression
model to predict the click through rate (which is continuous).

rate in the sense of proportion is between zero and 1. y percent of all
users that are at this stage click or buy.
Any continuous response on a known interval can be mapped to [0, 1] and be
modeled with Logistic regression (or GLM Binomial in general).
Poisson is for non-negative numbers (real or float) without (known) upper
bound.

One distribution that is defined for continuous
proportions/rates/probabilities would be Beta, and BetaRegression would be
the two parameter regression model.

Josef

Post by George Bezerra

The word you are looking for is 'generalized linear model'.
Sturla
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Sturla Molden

2015-10-06 02:05:42 UTC

Permalink

Post by j***@gmail.com
rate in the sense of proportion is between zero and 1.

Rate usually refers to "events per unit of time or exposure", so we can
either count events in intervals or record time-stamps as our dependent
variable. If the stochastic counting process is memoryless we have a
Poisson process.

Poisson regression can often be used to model this type of data.

Rate in the sense of proportion between 0 and 1 is not really a rate.
But sure, there are many ways to model such data, including assuming a
beta distribution for the proportion.

Sturla

------------------------------------------------------------------------------

j***@gmail.com

2015-10-06 02:38:49 UTC

Permalink

Post by Sturla Molden

Post by j***@gmail.com
rate in the sense of proportion is between zero and 1.

Rate usually refers to "events per unit of time or exposure", so we can
either count events in intervals or record time-stamps as our dependent
variable. If the stochastic counting process is memoryless we have a
Poisson process.
Poisson regression can often be used to model this type of data.
Rate in the sense of proportion between 0 and 1 is not really a rate.
But sure, there are many ways to model such data, including assuming a
beta distribution for the proportion.

I have seen lots of variation on the various terms across fields.
In the current context click-through-rate is a conditional probability,
AFAICS from a quick browsing of the article.

Since the rate/probability is pretty low 2-5%, I guess the constraint < 1
won't be relevant and any regression method for non-negative valued
response should work, including Poisson. It might be more relevant what the
local nonlinearity should be (which link function in terms of GLM).
For classification it sounds like a very unbalanced case.

Josef

Post by Sturla Molden
Sturla
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

j***@gmail.com

2015-10-04 03:20:16 UTC

Permalink

Just to come in here as an econometrician and statsmodels maintainer.

statsmodels intentionally doesn't enforce binary data for Logit or similar
models, any data between 0 and 1 is fine.

Logistic Regression/Logit or similar Binomial/Bernoulli models can
consistently estimate the expected value (predicted mean) for a continuous
variable that is between 0 and 1 like a proportion. (Binomial belongs to
the exponential family where quasi-maximum likelihood method works well.)
Inference has to be adjusted because a logit model cannot be "true" if the
data is not binary.

I have somewhere references and examples for this usecase.

statsmodels doesn't do "classification", i.e. hard thresholding, users can
do it themselves if they need to.
Which means we leave classification to scikit-learn and only do regression,
even for funny data, and statsmodels doesn't have methods that take
advantage of the classification structure of a model.

Josef

Post by George Bezerra
Hi there,
I would like to train a logistic regression model on a continuous (i.e.,

Post by George Bezerra
Is there a way to train logistic regression on a continuous target

variable in sklearn?

Post by George Bezerra
Any help is highly appreciated.
Best,
George.
--
George Bezerra

------------------------------------------------------------------------------

Post by George Bezerra
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

George Bezerra

2015-10-04 03:54:19 UTC

Permalink

Thanks a lot Josef. I guess it is possible to do what I wanted, though
maybe not in scikit. Does the statsmodels version allow l1 or l2
regularization? I'm planning to use a lot of features and let the model
decide what is good.

Thanks again.

Post by j***@gmail.com
Just to come in here as an econometrician and statsmodels maintainer.
statsmodels intentionally doesn't enforce binary data for Logit or similar
models, any data between 0 and 1 is fine.
Logistic Regression/Logit or similar Binomial/Bernoulli models can
consistently estimate the expected value (predicted mean) for a continuous
variable that is between 0 and 1 like a proportion. (Binomial belongs to
the exponential family where quasi-maximum likelihood method works well.)
Inference has to be adjusted because a logit model cannot be "true" if the
data is not binary.
I have somewhere references and examples for this usecase.
statsmodels doesn't do "classification", i.e. hard thresholding, users can
do it themselves if they need to.
Which means we leave classification to scikit-learn and only do
regression, even for funny data, and statsmodels doesn't have methods that
take advantage of the classification structure of a model.
Josef

Post by George Bezerra
Hi there,
I would like to train a logistic regression model on a continuous

Post by George Bezerra
Is there a way to train logistic regression on a continuous target

variable in sklearn?

Post by George Bezerra
Any help is highly appreciated.
Best,
George.
--
George Bezerra

------------------------------------------------------------------------------

Post by George Bezerra
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
George Bezerra

j***@gmail.com

2015-10-04 04:19:34 UTC

Permalink

Post by George Bezerra
Thanks a lot Josef. I guess it is possible to do what I wanted, though
maybe not in scikit. Does the statsmodels version allow l1 or l2
regularization? I'm planning to use a lot of features and let the model
decide what is good.

statsmodels has had L1 regularization for discrete models including Logit
for a while. But I don't have much experience with it, and it uses an
interior point algorithm.
Elastic net for maximum likelihood models using coordinate descend and
other penalized maximum likelihood methods like SCAD and structured L2 are
in PRs and will be merged over the next months.

statsmodels, in contrast to scikit-learn, doesn't have much support for
large sparse features.

Josef

Post by George Bezerra
Thanks again.

Post by j***@gmail.com
Just to come in here as an econometrician and statsmodels maintainer.
statsmodels intentionally doesn't enforce binary data for Logit or
similar models, any data between 0 and 1 is fine.
Logistic Regression/Logit or similar Binomial/Bernoulli models can
consistently estimate the expected value (predicted mean) for a continuous
variable that is between 0 and 1 like a proportion. (Binomial belongs to
the exponential family where quasi-maximum likelihood method works well.)
Inference has to be adjusted because a logit model cannot be "true" if
the data is not binary.
I have somewhere references and examples for this usecase.
statsmodels doesn't do "classification", i.e. hard thresholding, users
can do it themselves if they need to.
Which means we leave classification to scikit-learn and only do
regression, even for funny data, and statsmodels doesn't have methods that
take advantage of the classification structure of a model.
Josef

Post by George Bezerra
Hi there,
I would like to train a logistic regression model on a continuous

Post by George Bezerra
Is there a way to train logistic regression on a continuous target

variable in sklearn?

Post by George Bezerra
Any help is highly appreciated.
Best,
George.
--
George Bezerra

------------------------------------------------------------------------------

Post by George Bezerra
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
George Bezerra
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Michael Eickenberg

2015-10-04 06:11:53 UTC

Permalink

Hi George,

completely agreed that np.unique on continuous targets is messy - I have
run into the same problem.

If I remember correctly, you can work around this by using sample_weight to
inject the continuous target into the cross entropy loss:

If p_i are the targets, then duplicate each sample, give it label 1 and p_i
as sample weight and in the duplicate give it label 0 and 1-p_i as sample
weight.

There is a stackoverflow comment or answer by larsmans pertaining to this,
but I can't find it right now.

Hope this helps!
Michael

Post by j***@gmail.com

statsmodels has had L1 regularization for discrete models including Logit
for a while. But I don't have much experience with it, and it uses an
interior point algorithm.
Elastic net for maximum likelihood models using coordinate descend and
other penalized maximum likelihood methods like SCAD and structured L2 are
in PRs and will be merged over the next months.
statsmodels, in contrast to scikit-learn, doesn't have much support for
large sparse features.
Josef

Post by George Bezerra
Thanks again.

Post by j***@gmail.com
Just to come in here as an econometrician and statsmodels maintainer.
statsmodels intentionally doesn't enforce binary data for Logit or
similar models, any data between 0 and 1 is fine.
Logistic Regression/Logit or similar Binomial/Bernoulli models can
consistently estimate the expected value (predicted mean) for a continuous
variable that is between 0 and 1 like a proportion. (Binomial belongs to
the exponential family where quasi-maximum likelihood method works well.)
Inference has to be adjusted because a logit model cannot be "true" if
the data is not binary.
I have somewhere references and examples for this usecase.
statsmodels doesn't do "classification", i.e. hard thresholding, users
can do it themselves if they need to.
Which means we leave classification to scikit-learn and only do
regression, even for funny data, and statsmodels doesn't have methods that
take advantage of the classification structure of a model.
Josef

Post by Sebastian Raschka
Hi, George,
logistic regression is a binary classifier by nature (class labels 0
and 1). Scikit-learn supports multi-class classification via One-vs-One or
One-vs-All though; and there is a generalization (softmax) that gives you
meaningful probabilities for multiple classes (i.e., class probabilities
sum up to 1). In any case, logistic regression works with nominal class
labels - categorical class labels with no order implied.
To keep a long story short: Logistic regression is a classifier, not a
regressor â the name is misleading, I agree. I think you may want to look
into regression analysis for your continuous target variable.
Best,
Sebastian

Post by George Bezerra
Hi there,
I would like to train a logistic regression model on a continuous

Post by George Bezerra
Is there a way to train logistic regression on a continuous target

variable in sklearn?

Post by George Bezerra
Any help is highly appreciated.
Best,
George.
--
George Bezerra

------------------------------------------------------------------------------

Post by George Bezerra
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Mathieu Blondel

2015-10-04 08:20:48 UTC

Permalink

I've seen logistic regression used in a regression setting in a few papers
as well. A nice thing is that the predictions are mapped to [0, 1].

The correct way to add this to scikit-learn would be to add a regression
class `LogisticRegressor` and rename the existing class to
`LogisticClassifier`. The np.unique check would be only in the classifier.

We can also add it to SGDRegressor.

Mathieu

On Sun, Oct 4, 2015 at 3:11 PM, Michael Eickenberg <

Post by Michael Eickenberg
Hi George,
completely agreed that np.unique on continuous targets is messy - I have
run into the same problem.
If I remember correctly, you can work around this by using sample_weight
If p_i are the targets, then duplicate each sample, give it label 1 and
p_i as sample weight and in the duplicate give it label 0 and 1-p_i as
sample weight.
There is a stackoverflow comment or answer by larsmans pertaining to this,
but I can't find it right now.
Hope this helps!
Michael

Post by j***@gmail.com

statsmodels has had L1 regularization for discrete models including Logit
for a while. But I don't have much experience with it, and it uses an
interior point algorithm.
Elastic net for maximum likelihood models using coordinate descend and
other penalized maximum likelihood methods like SCAD and structured L2 are
in PRs and will be merged over the next months.
statsmodels, in contrast to scikit-learn, doesn't have much support for
large sparse features.
Josef

Post by George Bezerra
Thanks again.

Post by j***@gmail.com
Just to come in here as an econometrician and statsmodels maintainer.
statsmodels intentionally doesn't enforce binary data for Logit or
similar models, any data between 0 and 1 is fine.
Logistic Regression/Logit or similar Binomial/Bernoulli models can
consistently estimate the expected value (predicted mean) for a continuous
variable that is between 0 and 1 like a proportion. (Binomial belongs to
the exponential family where quasi-maximum likelihood method works well.)
Inference has to be adjusted because a logit model cannot be "true" if
the data is not binary.
I have somewhere references and examples for this usecase.
statsmodels doesn't do "classification", i.e. hard thresholding, users
can do it themselves if they need to.
Which means we leave classification to scikit-learn and only do
regression, even for funny data, and statsmodels doesn't have methods that
take advantage of the classification structure of a model.
Josef
On Sat, Oct 3, 2015 at 10:50 PM, Sebastian Raschka <

Post by Sebastian Raschka
Hi, George,
logistic regression is a binary classifier by nature (class labels 0
and 1). Scikit-learn supports multi-class classification via One-vs-One or
One-vs-All though; and there is a generalization (softmax) that gives you
meaningful probabilities for multiple classes (i.e., class probabilities
sum up to 1). In any case, logistic regression works with nominal class
labels - categorical class labels with no order implied.
To keep a long story short: Logistic regression is a classifier, not a
regressor â the name is misleading, I agree. I think you may want to look
into regression analysis for your continuous target variable.
Best,
Sebastian

Post by George Bezerra
Hi there,
I would like to train a logistic regression model on a continuous

Post by George Bezerra
Is there a way to train logistic regression on a continuous target

variable in sklearn?

Post by George Bezerra
Any help is highly appreciated.
Best,
George.
--
George Bezerra

------------------------------------------------------------------------------

Post by George Bezerra
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Andreas Mueller

2015-10-05 18:18:09 UTC

Permalink

Post by Michael Eickenberg
Hi George,
completely agreed that np.unique on continuous targets is messy - I
have run into the same problem.

It's fixed here:
https://github.com/scikit-learn/scikit-learn/pull/5084

------------------------------------------------------------------------------

Continue reading on narkive:

Search results for '[Scikit-learn-general] Using logistic regression on a continuous target variable' (Questions and Answers)

replies

my father has drank whiskey for the last 40 years they now say he has alachol hepititas what can be done for h

started 2006-07-08 08:43:21 UTC

mental health