Distributed Task Scheduling with Akka, Kafka, Cassandra


(David van Geest) #1

(Kumar) #2

Hi thanks sharing, any sample application available.

(David van Geest) #3

Hi Kumar, as we discussed in this GitHub issue, we do not currently have a sample application available, sorry!

(Gatikrushna Sahu) #4

Hi David
Vary nice topic.I have one query if database has 1 million row how scheduler is filtering efficiently and how many instance of scheduler is running.

(David van Geest) #5

Hi Gatikrushna,

Thanks for your questions.

The Cassandra database uses a wide row format, meaning that as tasks are persisted that are appended to the end of a row, ordered in time. These rows are partitioned in two different ways. First, we use the Kafka partition number of the task to break down the rows. Secondly, we each row only holds an hour’s worth of tasks - when the hour changes, a new row is started. These two partitioning methods prevent the rows from becoming too large, even in the face of millions of tasks. You can see the row key construction here

For a given service, many instances of scheduler may be running. One of the services that uses it is our on-call handoff notification service. There may be anywhere between 3 and 8 instances of this service (and thus 3 to 8 instances of scheduler) running at the same time. Since each instance is assigned Kafka partitions to work (via Kafka consumer groups), each instance will work different tasks. Dynamic partition assignment via consumer groups allows us to add or remove service/scheduler instances at will.

Depending on the nature of your tasks (how long they take to run, how CPU-intensive they are, etc.), you may need to have many or few instances of scheduler. Essentially this is a decision best left up to the user of the library.