Stream processing architecture












6















I am in the process of designing a system where there's a main stream of objects and there are multiple workers which produces some result from that object. Finally, there is some special/unique worker (sort of a "sink", in terms of graph theory) which takes all the results, and process them to some final object which is written to some DB.



It is possible for a worker to be dependent on the result of some other workers (hence, waiting for their results)



Now, I'm facing several problems:




  1. It could be that one worker is much slower than another. How do you deal with that? Adding more workers (= scaling) of the slower type? (maybe dynamically)

  2. Suppose W_B is dependent on W_A. If W_B is down for some reason then the flow will stop and the system will stop working. So I'd like the system to bypass this worker, somehow.

  3. Moreover, how do the final worker decide when to operate on the set of results? Suppose it has the results of A and B but lacking the result of C. It may be that C is down or it's just very slow at the moment. How can it make a decision?


It is worth mentioning that it's not a realtime application but rather an offline processing system (i.e. you may access the DB and alter a record), but at the same time, it has to deal with relatively large amount of objects in an "high pace".



Regarding technologies,

I'm developing the system with Java but I'm not bounded to a specific technology.



I'd be glad if you could help me with the general design of the system.



Thanks a lot!










share|improve this question























  • What you need to do depends on your use case. You either need certain components or you don't. It's pretty rare for there to be "nice to have" components. Is there any advantage to having less than your full set of processes, some of the time?

    – Peter Lawrey
    Nov 22 '18 at 8:18













  • You probably have a very specific (possibly company secret) problem, and you write it down in general terms. However as outsiders we cannot fully grasp your problem from these general terms because it's too ambiguous (which is not obvious to you since you're in it knee-deep). Maybe translate it into a concrete problem which we can work with?

    – Mark Jeronimus
    Nov 22 '18 at 8:39


















6















I am in the process of designing a system where there's a main stream of objects and there are multiple workers which produces some result from that object. Finally, there is some special/unique worker (sort of a "sink", in terms of graph theory) which takes all the results, and process them to some final object which is written to some DB.



It is possible for a worker to be dependent on the result of some other workers (hence, waiting for their results)



Now, I'm facing several problems:




  1. It could be that one worker is much slower than another. How do you deal with that? Adding more workers (= scaling) of the slower type? (maybe dynamically)

  2. Suppose W_B is dependent on W_A. If W_B is down for some reason then the flow will stop and the system will stop working. So I'd like the system to bypass this worker, somehow.

  3. Moreover, how do the final worker decide when to operate on the set of results? Suppose it has the results of A and B but lacking the result of C. It may be that C is down or it's just very slow at the moment. How can it make a decision?


It is worth mentioning that it's not a realtime application but rather an offline processing system (i.e. you may access the DB and alter a record), but at the same time, it has to deal with relatively large amount of objects in an "high pace".



Regarding technologies,

I'm developing the system with Java but I'm not bounded to a specific technology.



I'd be glad if you could help me with the general design of the system.



Thanks a lot!










share|improve this question























  • What you need to do depends on your use case. You either need certain components or you don't. It's pretty rare for there to be "nice to have" components. Is there any advantage to having less than your full set of processes, some of the time?

    – Peter Lawrey
    Nov 22 '18 at 8:18













  • You probably have a very specific (possibly company secret) problem, and you write it down in general terms. However as outsiders we cannot fully grasp your problem from these general terms because it's too ambiguous (which is not obvious to you since you're in it knee-deep). Maybe translate it into a concrete problem which we can work with?

    – Mark Jeronimus
    Nov 22 '18 at 8:39
















6












6








6








I am in the process of designing a system where there's a main stream of objects and there are multiple workers which produces some result from that object. Finally, there is some special/unique worker (sort of a "sink", in terms of graph theory) which takes all the results, and process them to some final object which is written to some DB.



It is possible for a worker to be dependent on the result of some other workers (hence, waiting for their results)



Now, I'm facing several problems:




  1. It could be that one worker is much slower than another. How do you deal with that? Adding more workers (= scaling) of the slower type? (maybe dynamically)

  2. Suppose W_B is dependent on W_A. If W_B is down for some reason then the flow will stop and the system will stop working. So I'd like the system to bypass this worker, somehow.

  3. Moreover, how do the final worker decide when to operate on the set of results? Suppose it has the results of A and B but lacking the result of C. It may be that C is down or it's just very slow at the moment. How can it make a decision?


It is worth mentioning that it's not a realtime application but rather an offline processing system (i.e. you may access the DB and alter a record), but at the same time, it has to deal with relatively large amount of objects in an "high pace".



Regarding technologies,

I'm developing the system with Java but I'm not bounded to a specific technology.



I'd be glad if you could help me with the general design of the system.



Thanks a lot!










share|improve this question














I am in the process of designing a system where there's a main stream of objects and there are multiple workers which produces some result from that object. Finally, there is some special/unique worker (sort of a "sink", in terms of graph theory) which takes all the results, and process them to some final object which is written to some DB.



It is possible for a worker to be dependent on the result of some other workers (hence, waiting for their results)



Now, I'm facing several problems:




  1. It could be that one worker is much slower than another. How do you deal with that? Adding more workers (= scaling) of the slower type? (maybe dynamically)

  2. Suppose W_B is dependent on W_A. If W_B is down for some reason then the flow will stop and the system will stop working. So I'd like the system to bypass this worker, somehow.

  3. Moreover, how do the final worker decide when to operate on the set of results? Suppose it has the results of A and B but lacking the result of C. It may be that C is down or it's just very slow at the moment. How can it make a decision?


It is worth mentioning that it's not a realtime application but rather an offline processing system (i.e. you may access the DB and alter a record), but at the same time, it has to deal with relatively large amount of objects in an "high pace".



Regarding technologies,

I'm developing the system with Java but I'm not bounded to a specific technology.



I'd be glad if you could help me with the general design of the system.



Thanks a lot!







java bigdata system-design stream-processing event-stream-processing






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 22 '18 at 8:10









yasecoyaseco

33511




33511













  • What you need to do depends on your use case. You either need certain components or you don't. It's pretty rare for there to be "nice to have" components. Is there any advantage to having less than your full set of processes, some of the time?

    – Peter Lawrey
    Nov 22 '18 at 8:18













  • You probably have a very specific (possibly company secret) problem, and you write it down in general terms. However as outsiders we cannot fully grasp your problem from these general terms because it's too ambiguous (which is not obvious to you since you're in it knee-deep). Maybe translate it into a concrete problem which we can work with?

    – Mark Jeronimus
    Nov 22 '18 at 8:39





















  • What you need to do depends on your use case. You either need certain components or you don't. It's pretty rare for there to be "nice to have" components. Is there any advantage to having less than your full set of processes, some of the time?

    – Peter Lawrey
    Nov 22 '18 at 8:18













  • You probably have a very specific (possibly company secret) problem, and you write it down in general terms. However as outsiders we cannot fully grasp your problem from these general terms because it's too ambiguous (which is not obvious to you since you're in it knee-deep). Maybe translate it into a concrete problem which we can work with?

    – Mark Jeronimus
    Nov 22 '18 at 8:39



















What you need to do depends on your use case. You either need certain components or you don't. It's pretty rare for there to be "nice to have" components. Is there any advantage to having less than your full set of processes, some of the time?

– Peter Lawrey
Nov 22 '18 at 8:18







What you need to do depends on your use case. You either need certain components or you don't. It's pretty rare for there to be "nice to have" components. Is there any advantage to having less than your full set of processes, some of the time?

– Peter Lawrey
Nov 22 '18 at 8:18















You probably have a very specific (possibly company secret) problem, and you write it down in general terms. However as outsiders we cannot fully grasp your problem from these general terms because it's too ambiguous (which is not obvious to you since you're in it knee-deep). Maybe translate it into a concrete problem which we can work with?

– Mark Jeronimus
Nov 22 '18 at 8:39







You probably have a very specific (possibly company secret) problem, and you write it down in general terms. However as outsiders we cannot fully grasp your problem from these general terms because it's too ambiguous (which is not obvious to you since you're in it knee-deep). Maybe translate it into a concrete problem which we can work with?

– Mark Jeronimus
Nov 22 '18 at 8:39














2 Answers
2






active

oldest

votes


















2














As Peter said, it really depends on the use case. Some general remarks though:




  1. If a worker is slower than the other, maybe create more instances of that type; eg Kubernetes allows dynamic Node creation, and Kafka allows to partition a topic so more than one instance can read off and process it.


  2. If B depends on A and A is down, B can't work and that's it. Maybe restart A? Maybe you can do a regular health check on it.


  3. If the final worker needs the results of A, B and C, how would it process without C being available? If it can, it can store the results of A and B, install a timer, and if that goes off without C having arrived, continue.







share|improve this answer
























  • Thanks for those remarks. Can you elaborate more on how to maintain dynamically the number of workers (of some type) with respect to their performance (= rate of processing)

    – yaseco
    Nov 22 '18 at 9:56



















1














Some additional thoughts:




  1. If you mean to say that some subtasks of the overall application are quicker to execute than others, then it can be a good idea to slice up the application so that each worker is doing a bit of everything -- in other words, a share of the quick work and a share of the slow work. But if you mean to say that some machines are slower than others, then you could run fewer workers on the slow machines, and more on the faster ones, so as to balance things so that each worker has roughly the same resources.


  2. You might want to decouple your architecture with some sort of durable queueing between the workers.


  3. It's common to use heartbeats with timeouts and restarts.



Distributed stream processing quickly becomes very complex. Your life will be much easier if you build on top a stream processing framework that provides high availability and exactly-once semantics out of the box.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53426418%2fstream-processing-architecture%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    As Peter said, it really depends on the use case. Some general remarks though:




    1. If a worker is slower than the other, maybe create more instances of that type; eg Kubernetes allows dynamic Node creation, and Kafka allows to partition a topic so more than one instance can read off and process it.


    2. If B depends on A and A is down, B can't work and that's it. Maybe restart A? Maybe you can do a regular health check on it.


    3. If the final worker needs the results of A, B and C, how would it process without C being available? If it can, it can store the results of A and B, install a timer, and if that goes off without C having arrived, continue.







    share|improve this answer
























    • Thanks for those remarks. Can you elaborate more on how to maintain dynamically the number of workers (of some type) with respect to their performance (= rate of processing)

      – yaseco
      Nov 22 '18 at 9:56
















    2














    As Peter said, it really depends on the use case. Some general remarks though:




    1. If a worker is slower than the other, maybe create more instances of that type; eg Kubernetes allows dynamic Node creation, and Kafka allows to partition a topic so more than one instance can read off and process it.


    2. If B depends on A and A is down, B can't work and that's it. Maybe restart A? Maybe you can do a regular health check on it.


    3. If the final worker needs the results of A, B and C, how would it process without C being available? If it can, it can store the results of A and B, install a timer, and if that goes off without C having arrived, continue.







    share|improve this answer
























    • Thanks for those remarks. Can you elaborate more on how to maintain dynamically the number of workers (of some type) with respect to their performance (= rate of processing)

      – yaseco
      Nov 22 '18 at 9:56














    2












    2








    2







    As Peter said, it really depends on the use case. Some general remarks though:




    1. If a worker is slower than the other, maybe create more instances of that type; eg Kubernetes allows dynamic Node creation, and Kafka allows to partition a topic so more than one instance can read off and process it.


    2. If B depends on A and A is down, B can't work and that's it. Maybe restart A? Maybe you can do a regular health check on it.


    3. If the final worker needs the results of A, B and C, how would it process without C being available? If it can, it can store the results of A and B, install a timer, and if that goes off without C having arrived, continue.







    share|improve this answer













    As Peter said, it really depends on the use case. Some general remarks though:




    1. If a worker is slower than the other, maybe create more instances of that type; eg Kubernetes allows dynamic Node creation, and Kafka allows to partition a topic so more than one instance can read off and process it.


    2. If B depends on A and A is down, B can't work and that's it. Maybe restart A? Maybe you can do a regular health check on it.


    3. If the final worker needs the results of A, B and C, how would it process without C being available? If it can, it can store the results of A and B, install a timer, and if that goes off without C having arrived, continue.








    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 22 '18 at 8:33









    daniudaniu

    7,42521635




    7,42521635













    • Thanks for those remarks. Can you elaborate more on how to maintain dynamically the number of workers (of some type) with respect to their performance (= rate of processing)

      – yaseco
      Nov 22 '18 at 9:56



















    • Thanks for those remarks. Can you elaborate more on how to maintain dynamically the number of workers (of some type) with respect to their performance (= rate of processing)

      – yaseco
      Nov 22 '18 at 9:56

















    Thanks for those remarks. Can you elaborate more on how to maintain dynamically the number of workers (of some type) with respect to their performance (= rate of processing)

    – yaseco
    Nov 22 '18 at 9:56





    Thanks for those remarks. Can you elaborate more on how to maintain dynamically the number of workers (of some type) with respect to their performance (= rate of processing)

    – yaseco
    Nov 22 '18 at 9:56













    1














    Some additional thoughts:




    1. If you mean to say that some subtasks of the overall application are quicker to execute than others, then it can be a good idea to slice up the application so that each worker is doing a bit of everything -- in other words, a share of the quick work and a share of the slow work. But if you mean to say that some machines are slower than others, then you could run fewer workers on the slow machines, and more on the faster ones, so as to balance things so that each worker has roughly the same resources.


    2. You might want to decouple your architecture with some sort of durable queueing between the workers.


    3. It's common to use heartbeats with timeouts and restarts.



    Distributed stream processing quickly becomes very complex. Your life will be much easier if you build on top a stream processing framework that provides high availability and exactly-once semantics out of the box.






    share|improve this answer




























      1














      Some additional thoughts:




      1. If you mean to say that some subtasks of the overall application are quicker to execute than others, then it can be a good idea to slice up the application so that each worker is doing a bit of everything -- in other words, a share of the quick work and a share of the slow work. But if you mean to say that some machines are slower than others, then you could run fewer workers on the slow machines, and more on the faster ones, so as to balance things so that each worker has roughly the same resources.


      2. You might want to decouple your architecture with some sort of durable queueing between the workers.


      3. It's common to use heartbeats with timeouts and restarts.



      Distributed stream processing quickly becomes very complex. Your life will be much easier if you build on top a stream processing framework that provides high availability and exactly-once semantics out of the box.






      share|improve this answer


























        1












        1








        1







        Some additional thoughts:




        1. If you mean to say that some subtasks of the overall application are quicker to execute than others, then it can be a good idea to slice up the application so that each worker is doing a bit of everything -- in other words, a share of the quick work and a share of the slow work. But if you mean to say that some machines are slower than others, then you could run fewer workers on the slow machines, and more on the faster ones, so as to balance things so that each worker has roughly the same resources.


        2. You might want to decouple your architecture with some sort of durable queueing between the workers.


        3. It's common to use heartbeats with timeouts and restarts.



        Distributed stream processing quickly becomes very complex. Your life will be much easier if you build on top a stream processing framework that provides high availability and exactly-once semantics out of the box.






        share|improve this answer













        Some additional thoughts:




        1. If you mean to say that some subtasks of the overall application are quicker to execute than others, then it can be a good idea to slice up the application so that each worker is doing a bit of everything -- in other words, a share of the quick work and a share of the slow work. But if you mean to say that some machines are slower than others, then you could run fewer workers on the slow machines, and more on the faster ones, so as to balance things so that each worker has roughly the same resources.


        2. You might want to decouple your architecture with some sort of durable queueing between the workers.


        3. It's common to use heartbeats with timeouts and restarts.



        Distributed stream processing quickly becomes very complex. Your life will be much easier if you build on top a stream processing framework that provides high availability and exactly-once semantics out of the box.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 24 '18 at 10:18









        David AndersonDavid Anderson

        5,15621121




        5,15621121






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53426418%2fstream-processing-architecture%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            404 Error Contact Form 7 ajax form submitting

            How to know if a Active Directory user can login interactively

            TypeError: fit_transform() missing 1 required positional argument: 'X'