are independent each containing avg 10 millions of rows, X is common to be used in join with A, B, C.
Steps:
Join A and X, do processing
Join B and X, do processing
Join C and X, do processing
As part of processing, I am doing calculations on columns.
Here,Will multithreading be beneficial?
No, you probably want multiprocessing.
That probably depends on the processing that you need to do, since computers are fast. I'd recommend you implement the whole thing without any parallelization first, and just see how long it takes to process maybe 10k rows. There's a good chance everything is quick enough that you don't really need anything else. Even if the process takes a few hours, if you don't have to do it regularly or anything you're probably going to save time by just letting it run and doing something else instead of optimizing your code. And should this end up slow, you'll probably need the parts you've written anyways. If you do end up needing parallelization, as someone else pointed out, I think you'd need to use multi-processing, not multi-threading. Python has the Global Interpreter Lock, which in short means that only one thread of Python code can run in one process (with one interpreter) at once. Multiprocessing runs multiple interpreters which can actually do the work spread out on multiple CPU cores in parallel. There's a lot of nuance here: For example, the Python wiki says that some kinds of work that are not run in Python directly (like disk access or calculations in NumPy) are not affected by the GIL, and so can benefit from multi-threading. Also, it might end up that maybe your storage or even the network end up becoming the bottleneck that take up the most time, in which case parallelization won't do much. TLDR: Do a simple, non-parallelized implementation first. If it's too slow, you'll probably benefit more from multi-processing than multi-threading, but both can be worth a try.
Can you please share any tutorial for same?
Thanks for detailed answer
https://docs.python.org/3/library/multiprocessing.html
Обсуждают сегодня