Simple Algorithm For Scheduling List Of Tasks
Scheduling necessity in the operational divisions of the current IT
infrastructure scales from single task execution to complicated workflow
management platforms including BPM and SOA enabled applications.
Some of the approaches miss the target by focusing on impossible or
unnecessary giant sized integration projects usually never ends or ends with
few percent of the goals achieved because of disregarding the building blocks
of the structural automation approach which has to include even the single
batch or crone enabled task. This is something like building space shuttle
while still not even having highways between your big cities. At the end you
may have it, but it will not solve your problems.
In this paper, a simple but powerful approach is described that can
be used in job management applications.
Description
Nowadays, especially the big data fashion polishes the very old
story of the grid enabled scheduling and workload management with resource
optimization on big data processing through Hadoop.
At the beginning, mass data processing was only the difficulty of
the scientists but for today, it became the subject of the business life as the
end user data including social media grows exponentially.
There are popular algorithms implemented in hadoop implementations
to process the data in valuable time. Yarn documentation and Tez implementation
documentations are great resources for the information and also the most common
scheduling algorithms are discussed in details in Workflow
Scheduling Algorithms for Grid Computing [1].
Our solution is combination of Heuristics [2] based List scheduling
[[1], Figure 5.2] with some exceptions and simplifications. This algorithm is implemented in TlosLite
1.7.x [4]
There are two types of AWM, Abstract Workflow
Model, deterministic and non-deterministic [1] and we will go
through deterministic AWM.
The problem rose while researching a solution for a real life
problem, the ETL process management. Because of the native structure of ETL
process, we have to deal with really big amount of jobs in number having
complex dependencies. Most of these kind of jobs can be defined as operating
system level scripts like batch or shell.
Lets simplify and define the problem;
“To execute the group of
tasks at the defined execution time regarding the dependencies if existed”
Solution
The group may contain discrete tasks or DAG, Directed Acyclic Graph groups. The
algorithm will work on group, that are conceptually the part of the same
business logic. There is no priority consideration at all. The only restriction
beside time and dependency for the concurrent execution is the upper and lower
pool thresholds values defined for the group. No other performance or resource
optimization is regarded during the development of the algorithm.
Here is the definition of the AWM
[1]:
Let Γ be the finite set of tasks Ti(1 ≤ i ≤ n).
Let Λ be the set of directed edges. Each edge is denoted by (Ti, Tj ),
corresponding to the data communication between task Ti and Tj , where Ti is called an
immediate parent task of Tj , and Tj the immediate child task of Ti. We assume that a child task cannot
be executed until all of its parent tasks are completed. Then, the workflow
application can be described as a tuple Ω(Γ, Λ).
To implement our own algorithm, we will make small changes;
Let
Γ be the finite set of tasks Ti(1 ≤ i ≤ n). Let Λ be the set of directed edges which
can have 0 member. Each edge is denoted by (Ti, Tj ), corresponding to the dependency
between task Ti
and Tj,
where Ti is
called an immediate parent task of Tj , and Tj the immediate child task of Ti. We assume
that a child task cannot be executed until all of its parent tasks are
completed. Then, the workflow application can be described as a tuple Ω(Γ, Λ).
The major difference in our approach is that we assume there may be
individual tasks that are not member of any DAG, and there may be many of DAGs
in our sample group.
Assumptions :
• Tasks in Γ has the same
execution period, daily execution scenario selected for our case
•
Λ may be empty set
•
All tasks
in Γ has to be executed and
completed successfully to be iterated for the next execution time.
According to our algorithm, the tasks are loaded
in to memory map ant the iteration base execution mechanism is started after
loading.
Lets check the following Algorithm 4 [1];
Algorithm 4[1] Myopic scheduling algorithm [3]
1: while ∃t ∈ Γ is not completed do
2: task ← get a ready task whose parent tasks have been completed
3: r ← for t ∈ task, get a resource which can complete t at the
earliest time
4: schedule t on r
5: end while
This algorithm assumes that all the tasks in list are in same DAG
and the resource availability should be considered.
Removing the resource dependency and the DAG restriction we may
define our own;
Algorithm 1
1: while ∃t ∈ Γ is not completed do
2: if not over concurrency threshold
3: task ← get a ready
task whose execution time is reached and parent tasks – if any - have been
completed
4: execute t
end if
5: end while
The iteration goes through the list and after the end of the list is
reached, before re-track, waits for the predefined delay time to be elapsed. If
all the tasks are completed successfully re-scheduling procedure sets the new
execution times of the tasks in group. If any of the tasks fails, manual
operation is required to decide what to do for the next step.
Gap Analyses
As we call it simple, this approach has some pitfalls that should be
enhanced for the future releases.
• The major difficulty of the algorithm is, it has only single
execution interval for the whole group. It is better to define a solution for
varying interval support.
• All the jobs in the group has to be completed successfully to
schedule the group for next time but it is better to resolve the dependency
graphs and manage the graphs saperately
• As there is no priority or sort consideration, the iteration goes
over a queue, which is constructed randomly.
Conclusion
Scheduling is the core of the workload automation, and it is not a
trivial problem. This approach maybe just an entry point for more generic and
complete solutions.
We hope that our proposal above helps the developers and researchers
to design a system for managing a group of tasks that has to be executed
according to dependencies in simple and trivial way.
References
[1]Jia Yu, Rajkumar Buyya and Kotagiri Ramamohanarao, Workflow
Scheduling Algorithms for Grid Computing
http://www.cloudbus.org/papers/MHS-Springer-Jia2008.pdf
[2] Y.
K. Kwok and I.Ahmad, Static Scheduling Algorithms for Allocating Directed Task
Graphs to Multiprocessors, ACM Computing Surveys, 31(4):406-471, Dec. 1999.
[3] M. Wieczorek, R. Prodan, and T. Fahringer, Scheduling of Scientific Workflows
in the ASKALON Grid Enviornment, ACM SIGMOD Record,
34(3):56-62, Sept.
2005.
Comments
Post a Comment