Python multiprocessing.Pool slower than sequential execution

I'm trying to write a program that operates on a long list of elements (the list is called training_set in the code example). Each row of the list contains two numbers that have to be found on another list called IDs: hence, my program iterates on training_set's rows and, for each of them, finds the corresponding 2 numbers in IDs and then performs some more computation (not shown in the code).

With sequential execution, this requires about 300s. Since each row of training_set is indipendend from the others, I was thinking of parallelizing the computation by splitting the input by the #cpu_cores, using multiprocessing.Pool.
However, the parallelized version is slower than the sequential one.

num_procs = int(multiprocessing.cpu_count())



with open("training_set.txt", "r") as f:

    reader = csv.reader(f)

    training_set  = list(reader)

training_set = [element[0].split(" ") for element in training_set]



with open("node_information.csv", "r") as f:

    reader = csv.reader(f)

    node_info  = list(reader)

IDs = [element[0] for element in node_info]



batch_size = int(len(training_set)/num_procs)

inputs=



# split list into batches to feed to the different threads

for i in range(num_procs):

    if i == (num_procs-1): inputs.append(list(training_set[int(i*batch_size):(len(training_set)-1)]))

    else: inputs.append(list(training_set[int(i*batch_size): int((i+1)*batch_size)]))



def init(IDs):

    global identities

    identities = copy.deepcopy(IDs)



def analyze_pairs(partialList):

    pairsSet = copy.deepcopy(partialList)

    for i in range(len(pairsSet)):

        source = pairsSet[i][0] # an ID of edges

        target = pairsSet[i][1] # an ID of edges

        ## find an index maching to the source ID

        index_source = identities.index(source)

        index_target = identities.index(target)

        ***additional computation***



if __name__ == '__main__':

    pool = Pool(num_procs, initializer=init, initargs=(IDs,))

    training_features = pool.map(analyze_pairs, inputs)

I'm not showing the rest of the code of the for loop (at the end of analyze_pairs()) because the problem persists even if i remove that code, hence it's not there that the problem resides.)

I know that there are already many questions on this topic, but I couldn't find a solution for my case.
I don't think that here parallelism introduces more overhead than speedup because the input of each thread is large (on a 8 threads cpu, each thread should take at least 35s) and there is no explicit message passing. I also tried to use copy.deepcopy to make sure that each thread works on a separate list(althought it shouldn't be a problem since each thread only performs 'read' actions on the list), but it didn't work.

What could the problem be? Thanks in advance.

asked Nov 25 '18 at 22:16

gnigni

you might be loading the files multiple times which could take a long time; maybe put the file loading code into the __main__?

– Sam Mason
Nov 25 '18 at 23:27

@SamMason i tried but it didn't help, anyway thank you for your answer ;)

– gnigni
Nov 25 '18 at 23:41

my guess is still around file loading time; have you tried adding print statements at various points in the code to figure out where your time is going?

– Sam Mason
Nov 26 '18 at 21:51

add a comment |

num_procs = int(multiprocessing.cpu_count())



with open("training_set.txt", "r") as f:

    reader = csv.reader(f)

    training_set  = list(reader)

training_set = [element[0].split(" ") for element in training_set]



with open("node_information.csv", "r") as f:

    reader = csv.reader(f)

    node_info  = list(reader)

IDs = [element[0] for element in node_info]



batch_size = int(len(training_set)/num_procs)

inputs=



# split list into batches to feed to the different threads

for i in range(num_procs):

    if i == (num_procs-1): inputs.append(list(training_set[int(i*batch_size):(len(training_set)-1)]))

    else: inputs.append(list(training_set[int(i*batch_size): int((i+1)*batch_size)]))



def init(IDs):

    global identities

    identities = copy.deepcopy(IDs)



def analyze_pairs(partialList):

    pairsSet = copy.deepcopy(partialList)

    for i in range(len(pairsSet)):

        source = pairsSet[i][0] # an ID of edges

        target = pairsSet[i][1] # an ID of edges

        ## find an index maching to the source ID

        index_source = identities.index(source)

        index_target = identities.index(target)

        ***additional computation***



if __name__ == '__main__':

    pool = Pool(num_procs, initializer=init, initargs=(IDs,))

    training_features = pool.map(analyze_pairs, inputs)

I'm not showing the rest of the code of the for loop (at the end of analyze_pairs()) because the problem persists even if i remove that code, hence it's not there that the problem resides.)

asked Nov 25 '18 at 22:16

gnigni

you might be loading the files multiple times which could take a long time; maybe put the file loading code into the __main__?

– Sam Mason
Nov 25 '18 at 23:27

@SamMason i tried but it didn't help, anyway thank you for your answer ;)

– gnigni
Nov 25 '18 at 23:41

my guess is still around file loading time; have you tried adding print statements at various points in the code to figure out where your time is going?

– Sam Mason
Nov 26 '18 at 21:51

add a comment |

num_procs = int(multiprocessing.cpu_count())



with open("training_set.txt", "r") as f:

    reader = csv.reader(f)

    training_set  = list(reader)

training_set = [element[0].split(" ") for element in training_set]



with open("node_information.csv", "r") as f:

    reader = csv.reader(f)

    node_info  = list(reader)

IDs = [element[0] for element in node_info]



batch_size = int(len(training_set)/num_procs)

inputs=



# split list into batches to feed to the different threads

for i in range(num_procs):

    if i == (num_procs-1): inputs.append(list(training_set[int(i*batch_size):(len(training_set)-1)]))

    else: inputs.append(list(training_set[int(i*batch_size): int((i+1)*batch_size)]))



def init(IDs):

    global identities

    identities = copy.deepcopy(IDs)



def analyze_pairs(partialList):

    pairsSet = copy.deepcopy(partialList)

    for i in range(len(pairsSet)):

        source = pairsSet[i][0] # an ID of edges

        target = pairsSet[i][1] # an ID of edges

        ## find an index maching to the source ID

        index_source = identities.index(source)

        index_target = identities.index(target)

        ***additional computation***



if __name__ == '__main__':

    pool = Pool(num_procs, initializer=init, initargs=(IDs,))

    training_features = pool.map(analyze_pairs, inputs)

I'm not showing the rest of the code of the for loop (at the end of analyze_pairs()) because the problem persists even if i remove that code, hence it's not there that the problem resides.)

asked Nov 25 '18 at 22:16

gnigni

num_procs = int(multiprocessing.cpu_count())



with open("training_set.txt", "r") as f:

    reader = csv.reader(f)

    training_set  = list(reader)

training_set = [element[0].split(" ") for element in training_set]



with open("node_information.csv", "r") as f:

    reader = csv.reader(f)

    node_info  = list(reader)

IDs = [element[0] for element in node_info]



batch_size = int(len(training_set)/num_procs)

inputs=



# split list into batches to feed to the different threads

for i in range(num_procs):

    if i == (num_procs-1): inputs.append(list(training_set[int(i*batch_size):(len(training_set)-1)]))

    else: inputs.append(list(training_set[int(i*batch_size): int((i+1)*batch_size)]))



def init(IDs):

    global identities

    identities = copy.deepcopy(IDs)



def analyze_pairs(partialList):

    pairsSet = copy.deepcopy(partialList)

    for i in range(len(pairsSet)):

        source = pairsSet[i][0] # an ID of edges

        target = pairsSet[i][1] # an ID of edges

        ## find an index maching to the source ID

        index_source = identities.index(source)

        index_target = identities.index(target)

        ***additional computation***



if __name__ == '__main__':

    pool = Pool(num_procs, initializer=init, initargs=(IDs,))

    training_features = pool.map(analyze_pairs, inputs)

I'm not showing the rest of the code of the for loop (at the end of analyze_pairs()) because the problem persists even if i remove that code, hence it's not there that the problem resides.)

python multiprocessing threadpool

asked Nov 25 '18 at 22:16

gnigni

asked Nov 25 '18 at 22:16

gnigni

asked Nov 25 '18 at 22:16

gnigni

asked Nov 25 '18 at 22:16

gnigni

asked Nov 25 '18 at 22:16

gnigni

you might be loading the files multiple times which could take a long time; maybe put the file loading code into the __main__?

– Sam Mason
Nov 25 '18 at 23:27

@SamMason i tried but it didn't help, anyway thank you for your answer ;)

– gnigni
Nov 25 '18 at 23:41

my guess is still around file loading time; have you tried adding print statements at various points in the code to figure out where your time is going?

– Sam Mason
Nov 26 '18 at 21:51

add a comment |

you might be loading the files multiple times which could take a long time; maybe put the file loading code into the __main__?

– Sam Mason
Nov 25 '18 at 23:27

@SamMason i tried but it didn't help, anyway thank you for your answer ;)

– gnigni
Nov 25 '18 at 23:41

my guess is still around file loading time; have you tried adding print statements at various points in the code to figure out where your time is going?

– Sam Mason
Nov 26 '18 at 21:51

you might be loading the files multiple times which could take a long time; maybe put the file loading code into the __main__?

– Sam Mason
Nov 25 '18 at 23:27

@SamMason i tried but it didn't help, anyway thank you for your answer ;)

– gnigni
Nov 25 '18 at 23:41

my guess is still around file loading time; have you tried adding print statements at various points in the code to figure out where your time is going?

– Sam Mason
Nov 26 '18 at 21:51

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53472551%2fpython-multiprocessing-pool-slower-than-sequential-execution%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk