Spark - combine filter results from all executors

I have 3 executors in my spark streaming job which consumes from Kafka. Executor count depends on partition count in topic. When a message consumed from this topic, I am starting query on Hazelcast. Every executor finds results from some filtering operation on hazelcast and returns duplicated results. Because data statuses are not updated when executor returns the data and other executor finds the same data.

My question is, is there a way to combine all results in only one list which are found by executors during streaming?

edited Nov 23 '18 at 11:21

asked Nov 23 '18 at 9:00

masay

55621131

use accumulators...pls share ur code..

– Taha Naqvi
Nov 23 '18 at 10:23

thx for your comment. I detailed my question. Accumulator is still on the table and I am reading about it.

– masay
Nov 23 '18 at 11:22

add a comment |

My question is, is there a way to combine all results in only one list which are found by executors during streaming?

edited Nov 23 '18 at 11:21

asked Nov 23 '18 at 9:00

masay

55621131

use accumulators...pls share ur code..

– Taha Naqvi
Nov 23 '18 at 10:23

thx for your comment. I detailed my question. Accumulator is still on the table and I am reading about it.

– masay
Nov 23 '18 at 11:22

add a comment |

My question is, is there a way to combine all results in only one list which are found by executors during streaming?

edited Nov 23 '18 at 11:21

asked Nov 23 '18 at 9:00

masay

55621131

My question is, is there a way to combine all results in only one list which are found by executors during streaming?

java apache-spark hazelcast

edited Nov 23 '18 at 11:21

asked Nov 23 '18 at 9:00

masay

55621131

edited Nov 23 '18 at 11:21

asked Nov 23 '18 at 9:00

masay

55621131

edited Nov 23 '18 at 11:21

asked Nov 23 '18 at 9:00

masay

55621131

asked Nov 23 '18 at 9:00

masay

55621131

asked Nov 23 '18 at 9:00

masay

55621131

use accumulators...pls share ur code..

– Taha Naqvi
Nov 23 '18 at 10:23

thx for your comment. I detailed my question. Accumulator is still on the table and I am reading about it.

– masay
Nov 23 '18 at 11:22

add a comment |

use accumulators...pls share ur code..

– Taha Naqvi
Nov 23 '18 at 10:23

thx for your comment. I detailed my question. Accumulator is still on the table and I am reading about it.

– masay
Nov 23 '18 at 11:22

use accumulators...pls share ur code..

– Taha Naqvi
Nov 23 '18 at 10:23

thx for your comment. I detailed my question. Accumulator is still on the table and I am reading about it.

– masay
Nov 23 '18 at 11:22

add a comment |

2 Answers
2

active

oldest

votes

Spark Executors are distributed across Cluster, so if you are trying to deduplicate data across cluster. So deduplicating is difficult. you have following options

Use accumulators.- problem here is that accumulators are not consistent when job is running and you may end up reading stale data

Other option is Offload this work to external system. - store your output in some external storage which can deduplicate it. (Probably HBase). efficiency of this storage system becomes key here.

I hope this helps

answered Nov 28 '18 at 6:47

Harjeet Kumar

3115

add a comment |

To avoid duplicate data read, you need to maintain the offset somewhere, preferred in HBase and everytime you consume the data from Kafka, you read it from HBase and then check the offset for each topic which is already consumed and then start reading and writing it. After each successful write, you must update the offset count.

Do you think that way it solves the issue?

answered Nov 29 '18 at 14:24

H Roy

12926

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53443456%2fspark-combine-filter-results-from-all-executors%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Spark Executors are distributed across Cluster, so if you are trying to deduplicate data across cluster. So deduplicating is difficult. you have following options

Use accumulators.- problem here is that accumulators are not consistent when job is running and you may end up reading stale data

Other option is Offload this work to external system. - store your output in some external storage which can deduplicate it. (Probably HBase). efficiency of this storage system becomes key here.

I hope this helps

answered Nov 28 '18 at 6:47

Harjeet Kumar

3115

add a comment |

Spark Executors are distributed across Cluster, so if you are trying to deduplicate data across cluster. So deduplicating is difficult. you have following options

Use accumulators.- problem here is that accumulators are not consistent when job is running and you may end up reading stale data

Other option is Offload this work to external system. - store your output in some external storage which can deduplicate it. (Probably HBase). efficiency of this storage system becomes key here.

I hope this helps

answered Nov 28 '18 at 6:47

Harjeet Kumar

3115

add a comment |

Spark Executors are distributed across Cluster, so if you are trying to deduplicate data across cluster. So deduplicating is difficult. you have following options

Use accumulators.- problem here is that accumulators are not consistent when job is running and you may end up reading stale data

Other option is Offload this work to external system. - store your output in some external storage which can deduplicate it. (Probably HBase). efficiency of this storage system becomes key here.

I hope this helps

answered Nov 28 '18 at 6:47

Harjeet Kumar

3115

Spark Executors are distributed across Cluster, so if you are trying to deduplicate data across cluster. So deduplicating is difficult. you have following options

Use accumulators.- problem here is that accumulators are not consistent when job is running and you may end up reading stale data

Other option is Offload this work to external system. - store your output in some external storage which can deduplicate it. (Probably HBase). efficiency of this storage system becomes key here.

I hope this helps

answered Nov 28 '18 at 6:47

Harjeet Kumar

3115

answered Nov 28 '18 at 6:47

Harjeet Kumar

3115

answered Nov 28 '18 at 6:47

Harjeet Kumar

3115

answered Nov 28 '18 at 6:47

Harjeet Kumar

3115

add a comment |

Do you think that way it solves the issue?

answered Nov 29 '18 at 14:24

H Roy

12926

add a comment |

Do you think that way it solves the issue?

answered Nov 29 '18 at 14:24

H Roy

12926

add a comment |

Do you think that way it solves the issue?

answered Nov 29 '18 at 14:24

H Roy

12926

Do you think that way it solves the issue?

answered Nov 29 '18 at 14:24

H Roy

12926

answered Nov 29 '18 at 14:24

H Roy

12926

answered Nov 29 '18 at 14:24

H Roy

12926

answered Nov 29 '18 at 14:24

H Roy

12926

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

O,Tl4kT4v

搜尋此網誌

Tukukkk