-
-
Notifications
You must be signed in to change notification settings - Fork 37
Able to run at scale: handle larger datasets #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm glad you have been able to resolve this temporarily for yourself. |
While you think about it and share your thoughts, I will take a look at it and try to improve this aspect of the library. Thanks for nudging me about it. |
@strivedi02 I'm working on an implementation to improve this issue via this branch https://github.com/neomatrix369/nlp_profiler/tree/scale-when-applied-to-larger-datasets, if you can test this out separately it would be cool, also have a look at this conversation for more context: https://www.kaggle.com/viratkothari/nlp-profiler-profiling-of-textual-dataset/comments#1015859 |
Some metrics gathered during implementation of this feature, comparing before and after the implementation:
|
@strivedi02 can you please share your metrics for the above (#2 (comment)) - please provide info for each and every column possible |
Closed by PR #9 |
@neomatrix369 for me when applied to scale-when-applied-to-larger-datasets is 4 minutes 37 seconds Output of %%time |
@kurianbenoy Can you please provide the other before and after details like commit ids of the branch you used to install the library? It should not be hard to find out, if you look at the logs it should be there. |
@neomatrix369 I was running this in Kaggle. The previous experiment with associated time can be found here. I was probably using version 21 of your NLP Profiler Class notebook. The recent version can be found here. I hope it helps you find the exact version |
@strivedi02 @kurianbenoy 🙇 thanks both for the references, you can see above the updated table of the approximate speed ups |
@strivedi02 thanks for raising the initial discussion #1, and pointers raised about the different issues, this and other issues have been resolved (we still have pending ones but that is fine) as result of user/community feedback and interactions. With regards to performance of the library, it's an ongoing effort to keep in mind. But adding new NLP features would usually precede such issues. |
@neomatrix369 I always had to struggle to keep all my scripts in one place or I would have to remember which code is where, but now thanks to you we won't have to remember all that. Through this package, a lot of things will become easy, and I think in the future it will keep growing in terms of usage by the community. |
That's really good to know. Glad it helps everyone. It is also what I observed, everyone was using their own recipes, now you can share and contribute and extend a central recipe. @strivedi02 Does the library have most if not all of the things you use or one would need when dealing with text? I think there is room for a lot more. Feel free to open issues/pull requests to extend the existing functionalities to contain additional relevant ones - that is useful for NLP practitioners. |
@loopyme I'll be happy to hear your feedback on the work done via this issue, please let me know how I can answer your questions and clarify any doubts. I have tried to build this library from ground-up paying attention to the cohesive modules and structure of the library as a whole. |
At the moment the library runs slow and takes a long time to handle large datasets due to the processing require per record, this could be optimised and improved in small steps to be able to handle larger datasets
Opened on the back discussions in #1. Partially related to #3 although independent of the issue.
The text was updated successfully, but these errors were encountered: