[Scikit-learn-general] GSoC suggestions : work on various stalled PRs and issues

Discussion:

Maniteja Nandana

2016-03-21 18:35:44 UTC

Hello everyone,

My name is Maniteja, a senior year computer science student from India (
github <https://github.com/maniteja123>)
It was been a wonderful learning opportunity contributing to the library
for the past few months and would like to thank everyone for their support
and patiently answering my questions. I am really eager to contribute more
to my best abilities. Since it was proposed to work on existing PRs, I have
also added better detailed version at here
<https://github.com/maniteja123/scikit-learn/wiki/Various-enhancements-to-scikit-learn>

I wanted to seek feedback on the following issues and PRs . If any of the
authors of the following PRs are interested to work on their PRs please let
me know and I am sorry for not asking prior permission since I couldn't
contact each of you and also didn't want to create noise by commenting on
all the PRs. Hope you understand. If it is okay for me to try working on
these, please let me know your opinions and suggestions.

Semi-supervised Naive Bayes using Expectation Maximization #430
<https://github.com/scikit-learn/scikit-learn/pull/430>
Meta estimator for self trained model #1243
<https://github.com/scikit-learn/scikit-learn/issues/1243>
Use Bayesian priors in Nearest Neighbors classifier #399
<https://github.com/scikit-learn/scikit-learn/issues/399> #970
<https://github.com/scikit-learn/scikit-learn/pull/970%5C>
Classifier Chain for multi-label problems PRs: #3727
<https://github.com/scikit-learn/scikit-learn/pull/3727> #4759
<https://github.com/scikit-learn/scikit-learn/issues/4759>
Label power set multilabel classification strategy PRs: #2461
<https://github.com/scikit-learn/scikit-learn/pull/2461>
Multioutput bagging #4848
<https://github.com/scikit-learn/scikit-learn/pull/4848>
Added 'average' option to passive aggressive classifier/regressor. #4939
<https://github.com/scikit-learn/scikit-learn/pull/4939>
Add "grouped" option to Scaler classes: #4963
<https://github.com/scikit-learn/scikit-learn/pull/4963>
Metric precision at k score #4975
<https://github.com/scikit-learn/scikit-learn/4975>
Implement haversine metric in pairwise #4458
<https://github.com/scikit-learn/scikit-learn/pull/4458> #4453
<https://github.com/scikit-learn/scikit-learn/issues/4453>
Add KNN strategy for imputation #4844
<https://github.com/scikit-learn/scikit-learn/pull/4844>
Add resample to preprocessing. #1454
<https://github.com/scikit-learn/scikit-learn/pull/1454> #6568
<https://github.com/scikit-learn/scikit-learn/issues/6568>
Added metrics support for multiclass-multioutput classification #3681
<https://github.com/scikit-learn/scikit-learn/pull/3681>
random neural network algorithm #4703
<https://github.com/scikit-learn/scikit-learn/pull/4703>

Thank you for your time and waiting to hear back from you !

Yours sincerely,
Maniteja.

Raghav R V

2016-03-23 11:55:59 UTC

Permalink

Hey Maniteja,

Having taken a quick look at the list... my thoughts -

* The KNN imputation is an important addition that got stalled.
* The semi-supervised NB with EM seems like a good addition, Olivier,
Larsmans (and Joel?) have to comment on whether it should be a priority.
* The haversine metric is tagged "easy".
* "Meta-estimator for semi-supervised learning" is not hard but I believe
is API heavy and would involve devoting considerable amount of time for API
discussions...
* "Label power set multilabel classification strategy" doesn't look like a
priority.
* I am not very sure if infomax ICA had good interest among core devs.
* *I think* People were pretty interested in Metric Learning NCA and Matrix
completion with missing values, but I believe they are math heavy. Make
sure you can handle that! Ping Olivier if you need more information.

Also please note that the proposal needs to have a central theme like
"Improvements in linear models" or "Improvements in tree models" and your
should propose to complete the stalled PRs under that theme...

Thanks for the mail! Good luck on your proposal! Please note that the
deadline is on 25th of this month!

Raghav

On Mon, Mar 21, 2016 at 7:35 PM, Maniteja Nandana <

Post by Maniteja Nandana
Hello everyone,
My name is Maniteja, a senior year computer science student from India (
github <https://github.com/maniteja123>)
It was been a wonderful learning opportunity contributing to the library
for the past few months and would like to thank everyone for their support
and patiently answering my questions. I am really eager to contribute more
to my best abilities. Since it was proposed to work on existing PRs, I have
also added better detailed version at here
<https://github.com/maniteja123/scikit-learn/wiki/Various-enhancements-to-scikit-learn>
I wanted to seek feedback on the following issues and PRs . If any of the
authors of the following PRs are interested to work on their PRs please let
me know and I am sorry for not asking prior permission since I couldn't
contact each of you and also didn't want to create noise by commenting on
all the PRs. Hope you understand. If it is okay for me to try working on
these, please let me know your opinions and suggestions.
Semi-supervised Naive Bayes using Expectation Maximization #430
<https://github.com/scikit-learn/scikit-learn/pull/430>
Meta estimator for self trained model #1243
<https://github.com/scikit-learn/scikit-learn/issues/1243>
Use Bayesian priors in Nearest Neighbors classifier #399
<https://github.com/scikit-learn/scikit-learn/issues/399> #970
<https://github.com/scikit-learn/scikit-learn/pull/970%5C>
Classifier Chain for multi-label problems PRs: #3727
<https://github.com/scikit-learn/scikit-learn/pull/3727> #4759
<https://github.com/scikit-learn/scikit-learn/issues/4759>
Label power set multilabel classification strategy PRs: #2461
<https://github.com/scikit-learn/scikit-learn/pull/2461>
Multioutput bagging #4848
<https://github.com/scikit-learn/scikit-learn/pull/4848>
Added 'average' option to passive aggressive classifier/regressor. #4939
<https://github.com/scikit-learn/scikit-learn/pull/4939>
Add "grouped" option to Scaler classes: #4963
<https://github.com/scikit-learn/scikit-learn/pull/4963>
Metric precision at k score #4975
<https://github.com/scikit-learn/scikit-learn/4975>
Implement haversine metric in pairwise #4458
<https://github.com/scikit-learn/scikit-learn/pull/4458> #4453
<https://github.com/scikit-learn/scikit-learn/issues/4453>
Add KNN strategy for imputation #4844
<https://github.com/scikit-learn/scikit-learn/pull/4844>
Add resample to preprocessing. #1454
<https://github.com/scikit-learn/scikit-learn/pull/1454> #6568
<https://github.com/scikit-learn/scikit-learn/issues/6568>
Added metrics support for multiclass-multioutput classification #3681
<https://github.com/scikit-learn/scikit-learn/pull/3681>
random neural network algorithm #4703
<https://github.com/scikit-learn/scikit-learn/pull/4703>
Thank you for your time and waiting to hear back from you !
Yours sincerely,
Maniteja.
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Maniteja Nandana

2016-03-23 16:46:48 UTC

Permalink

Hi Raghav,

Thanks a lot for your reply. That helps so much.

I understand that the proposal should be specific to a module but right now
I am not sure which of these implementation are the most sought-after. I
will update the proposal based on the inputs.

I also have looked at the stalled PRs of Metric learning NCA and Matrix
Completion for missing values, but they have heavy on math. If they are of
utmost importance, I would gladly spend time to read through the reference
papers.

I would really appreciate any other feedback on this proposal.

Thank you again for your time !

Best regards,
Maniteja.

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>://
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
lists.sourceforge.net
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>/lists/
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>listinfo
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>/
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
scikit-learn-general
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>

Post by Raghav R V
Hey Maniteja,
Having taken a quick look at the list... my thoughts -
* The KNN imputation is an important addition that got stalled.
* The semi-supervised NB with EM seems like a good addition, Olivier,
Larsmans (and Joel?) have to comment on whether it should be a priority.
* The haversine metric is tagged "easy".
* "Meta-estimator for semi-supervised learning" is not hard but I believe
is API heavy and would involve devoting considerable amount of time for API
discussions...
* "Label power set multilabel classification strategy" doesn't look like a
priority.
* I am not very sure if infomax ICA had good interest among core devs.
* *I think* People were pretty interested in Metric Learning NCA and
Matrix completion with missing values, but I believe they are math heavy.
Make sure you can handle that! Ping Olivier if you need more information.
Also please note that the proposal needs to have a central theme like
"Improvements in linear models" or "Improvements in tree models" and your
should propose to complete the stalled PRs under that theme...
Thanks for the mail! Good luck on your proposal! Please note that the
deadline is on 25th of this month!
Raghav
On Mon, Mar 21, 2016 at 7:35 PM, Maniteja Nandana <

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Raghav R V

2016-03-25 15:11:25 UTC

Permalink

Hey Maniteja,

I took a look at your proposal. As I said before I feel it is a bit broad
and you should try to narrow it down to a good theme.

Since you have chosen more than one PRs which are missing value related, I
have a suggestion for a theme -

"Better Missing Value Handling"

You could group the knn imputation, matrix factorization with missing
values and *outputting dummy one-hot encoded features for imputer to
specify if the feature value is imputed or not. Implementing these properly
and merging should be sufficient for a GSoC I feel. As an optional thing,
you could add another imputation strategy.

*I'll raise an issue so you understand that better.

Thanks,

Raghav R V

On Wed, Mar 23, 2016 at 5:46 PM, Maniteja Nandana <

Post by Maniteja Nandana
Hi Raghav,
Thanks a lot for your reply. That helps so much.
I understand that the proposal should be specific to a module but right
now I am not sure which of these implementation are the most sought-after.
I will update the proposal based on the inputs.
I also have looked at the stalled PRs of Metric learning NCA and Matrix
Completion for missing values, but they have heavy on math. If they are of
utmost importance, I would gladly spend time to read through the reference
papers.
I would really appreciate any other feedback on this proposal.
Thank you again for your time !
Best regards,
Maniteja.
_______________________________________________
Scikit-learn-general mailing list
https <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
:// <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
lists.sourceforge.net
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>/lists/
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
listinfo
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>/
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
scikit-learn-general
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>

Post by Raghav R V
Hey Maniteja,
Having taken a quick look at the list... my thoughts -
* The KNN imputation is an important addition that got stalled.
* The semi-supervised NB with EM seems like a good addition, Olivier,
Larsmans (and Joel?) have to comment on whether it should be a priority.
* The haversine metric is tagged "easy".
* "Meta-estimator for semi-supervised learning" is not hard but I believe
is API heavy and would involve devoting considerable amount of time for API
discussions...
* "Label power set multilabel classification strategy" doesn't look like
a priority.
* I am not very sure if infomax ICA had good interest among core devs.
* *I think* People were pretty interested in Metric Learning NCA and
Matrix completion with missing values, but I believe they are math heavy.
Make sure you can handle that! Ping Olivier if you need more information.
Also please note that the proposal needs to have a central theme like
"Improvements in linear models" or "Improvements in tree models" and your
should propose to complete the stalled PRs under that theme...
Thanks for the mail! Good luck on your proposal! Please note that the
deadline is on 25th of this month!
Raghav
On Mon, Mar 21, 2016 at 7:35 PM, Maniteja Nandana <

Post by Maniteja Nandana
Hello everyone,
My name is Maniteja, a senior year computer science student from India (
github <https://github.com/maniteja123>)
It was been a wonderful learning opportunity contributing to the library
for the past few months and would like to thank everyone for their support
and patiently answering my questions. I am really eager to contribute more
to my best abilities. Since it was proposed to work on existing PRs, I have
also added better detailed version at here
<https://github.com/maniteja123/scikit-learn/wiki/Various-enhancements-to-scikit-learn>
I wanted to seek feedback on the following issues and PRs . If any of
the authors of the following PRs are interested to work on their PRs please
let me know and I am sorry for not asking prior permission since I couldn't
contact each of you and also didn't want to create noise by commenting on
all the PRs. Hope you understand. If it is okay for me to try working on
these, please let me know your opinions and suggestions.
Semi-supervised Naive Bayes using Expectation Maximization #430
<https://github.com/scikit-learn/scikit-learn/pull/430>
Meta estimator for self trained model #1243
<https://github.com/scikit-learn/scikit-learn/issues/1243>
Use Bayesian priors in Nearest Neighbors classifier #399
<https://github.com/scikit-learn/scikit-learn/issues/399> #970
<https://github.com/scikit-learn/scikit-learn/pull/970%5C>
Classifier Chain for multi-label problems PRs: #3727
<https://github.com/scikit-learn/scikit-learn/pull/3727> #4759
<https://github.com/scikit-learn/scikit-learn/issues/4759>
Label power set multilabel classification strategy PRs: #2461
<https://github.com/scikit-learn/scikit-learn/pull/2461>
Multioutput bagging #4848
<https://github.com/scikit-learn/scikit-learn/pull/4848>
Added 'average' option to passive aggressive classifier/regressor. #4939
<https://github.com/scikit-learn/scikit-learn/pull/4939>
Add "grouped" option to Scaler classes: #4963
<https://github.com/scikit-learn/scikit-learn/pull/4963>
Metric precision at k score #4975
<https://github.com/scikit-learn/scikit-learn/4975>
Implement haversine metric in pairwise #4458
<https://github.com/scikit-learn/scikit-learn/pull/4458> #4453
<https://github.com/scikit-learn/scikit-learn/issues/4453>
Add KNN strategy for imputation #4844
<https://github.com/scikit-learn/scikit-learn/pull/4844>
Add resample to preprocessing. #1454
<https://github.com/scikit-learn/scikit-learn/pull/1454> #6568
<https://github.com/scikit-learn/scikit-learn/issues/6568>
Added metrics support for multiclass-multioutput classification #3681
<https://github.com/scikit-learn/scikit-learn/pull/3681>
random neural network algorithm #4703
<https://github.com/scikit-learn/scikit-learn/pull/4703>
Thank you for your time and waiting to hear back from you !
Yours sincerely,
Maniteja.
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Andreas Mueller

2016-03-25 15:55:34 UTC

Permalink

Post by Raghav R V
Hey Maniteja,
I took a look at your proposal. As I said before I feel it is a bit
broad and you should try to narrow it down to a good theme.
Since you have chosen more than one PRs which are missing value
related, I have a suggestion for a theme -
"Better Missing Value Handling"
You could group the knn imputation, matrix factorization with missing
values and *outputting dummy one-hot encoded features for imputer to
specify if the feature value is imputed or not. Implementing these
properly and merging should be sufficient for a GSoC I feel. As an
optional thing, you could add another imputation strategy.
*I'll raise an issue so you understand that better.

Maniteja Nandana

2016-03-25 17:21:29 UTC

Permalink

Hi Raghav,

Thanks a lot for the idea. I would be glad to work on it and along with the
"output dummy one-hot encoder features for imputer to specify if the feature
value is imputed or not", would the the idea to add " binary indicator
feature (for each possibly missing feature) that indicate feature
was imputed" as suggested here
<https://github.com/scikit-learn/scikit-learn/issues/6556> probably be a
nice and easy addition ?

Thanks,
Maniteja.
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Post by Maniteja Nandana

+1
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Raghav R V

2016-03-25 18:45:54 UTC

Permalink

Yes! Exactly the same!

On Fri, Mar 25, 2016 at 6:21 PM, Maniteja Nandana <

Post by Maniteja Nandana
Hi Raghav,
Thanks a lot for the idea. I would be glad to work on it and along with
the "output dummy one-hot encoder features for imputer to specify if the feature
value is imputed or not", would the the idea to add " binary indicator
feature (for each possibly missing feature) that indicate feature
was imputed" as suggested here
<https://github.com/scikit-learn/scikit-learn/issues/6556> probably be a
nice and easy addition ?
Thanks,
Maniteja.
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Post by Maniteja Nandana

Maniteja Nandana

2016-03-29 19:24:04 UTC

Permalink

Hi everyone,

Thanks for the inputs. I have created a wiki page here
<https://github.com/maniteja123/scikit-learn/wiki/Better-Missing-Value-Handling-in-scikit-learn>
for
the work aimed to be done in better handling of missing data including
working on the stalled PR on Matrix Factorization, KNN imputation and also
on some additional features as suggested above. Please do have a look at it
and would be really grateful if anyone has any input or suggestions on the
proposal and also correct me in case I had missed something.

Thanks for your time.

Best regards,
Maniteja.
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Post by Raghav R V
Yes! Exactly the same!
On Fri, Mar 25, 2016 at 6:21 PM, Maniteja Nandana <

Post by Maniteja Nandana

Maniteja Nandana

2016-04-06 12:52:59 UTC

Permalink

Hi Andreas, Raghav and Jacob,

Thank you for your inputs. I have attached the links to the final draft of
the proposal. I would really be grateful if anyone has any other
suggestions and would be happy to incorporate them. Thanks for your time.

Wiki proposal
<https://github.com/scikit-learn/scikit-learn/wiki/%5BGSoc-2016%5D-Better-Missing-Value-Handling-in-scikit-learn>
PDF Proposal
<https://drive.google.com/file/d/0BzDDRCWPRL5Zd0FJVWlPX3FQVE0/view>

Regards,
Maniteja
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

On Wed, Mar 30, 2016 at 12:54 AM, Maniteja Nandana <

Post by Maniteja Nandana
Hi everyone,
Thanks for the inputs. I have created a wiki page here
<https://github.com/maniteja123/scikit-learn/wiki/Better-Missing-Value-Handling-in-scikit-learn> for
the work aimed to be done in better handling of missing data including
working on the stalled PR on Matrix Factorization, KNN imputation and also
on some additional features as suggested above. Please do have a look at it
and would be really grateful if anyone has any input or suggestions on the
proposal and also correct me in case I had missed something.
Thanks for your time.
Best regards,
Maniteja.
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Post by Raghav R V
Yes! Exactly the same!
On Fri, Mar 25, 2016 at 6:21 PM, Maniteja Nandana <

Post by Maniteja Nandana
Hi Raghav,
Thanks a lot for the idea. I would be glad to work on it and along with
the "output dummy one-hot encoder features for imputer to specify if
the feature value is imputed or not", would the the idea to add "
binary indicator feature (for each possibly missing feature) that indicate
feature
was imputed" as suggested here
<https://github.com/scikit-learn/scikit-learn/issues/6556> probably be
a nice and easy addition ?
Thanks,
Maniteja.
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Post by Maniteja Nandana