8000 [xray] raylet scheduling mechanism with a simple spillback policy by atumanov · Pull Request #2749 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[xray] raylet scheduling mechanism with a simple spillback policy #2749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Aug 28, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
c28452d
raylet scheduling mechanism, load accounting/distribution, policy
atumanov Aug 16, 2018
abe0346
fix upstream/master rebase
atumanov Aug 16, 2018
1af4188
invoke scheduling policy on local resources prior to dispatch when th…
atumanov Aug 17, 2018
2e424bb
addressing comments
atumanov Aug 18, 2018
c719949
schedule before dispatch
atumanov Aug 18, 2018
ad2206e
enable all ActorsWithGPUs tests
atumanov Aug 20, 2018
f0683d0
remove excessive loggingj
atumanov Aug 20, 2018
ac51981
addressing comments: minor resource set API refactor
atumanov Aug 20, 2018
5b6bba1
Call DispatchTasks after ScheduleTasks.
robertnishihara Aug 20, 2018
d3d0ca9
Another DispatchTasks after ScheduleTasks.
robertnishihara Aug 20, 2018
dac1ad8
addressing comments
atumanov Aug 20, 2018
e9b4202
monitor heartbeat handler updated to decode clientid from binary form…
atumanov Aug 21, 2018
4f82d6a
making load balancing tasks with dependencies sleep for 10ms
atumanov Aug 21, 2018
96fc26f
Make tests slightly more verbose so we know which test is hanging.
robertnishihara Aug 21, 2018
afa4a88
Fix linting.
robertnishihara Aug 21, 2018
0a46231
Dispatch tasks when new worker becomes available.
robertnishihara Aug 21, 2018
7bdf6c7
Add code for converting scheduling queues to human-readable string.
robertnishihara Aug 21, 2018
cd4a3bd
Remove outdated Java test UIDTest.
robertnishihara Aug 21, 2018
c3e70f6
Fix linting.
robertnishihara Aug 21, 2018
3610946
handle placeable -> waiting with remaining placeable
atumanov Aug 22, 2018
6812180
linting
atumanov Aug 22, 2018
053c5e5
scheduling all placeable tasks; modified heartbeat to spillover ready
atumanov Aug 22, 2018
0adb158
eliminate do/while around scheduling policy invocation
atumanov Aug 23, 2018
df6d71d
remove some calls to the scheduling policy
atumanov Aug 23, 2018
b40c7f9
don't spill over actor tasks from the ready queue
atumanov Aug 23, 2018
fc2c788
handle infeasible tasks
atumanov Aug 23, 2018
d5b8df9
handle infeasible tasks
atumanov Aug 23, 2018
3be0c65
move infeasible tasks from placeable to infeasible
atumanov Aug 23, 2018
6bd29d2
hastask should check infeasible
atumanov Aug 24, 2018
ab7f67a
Dispatch tasks if we fail to write assign a task to a worker.
robertnishihara Aug 24, 2018
d071572
check dependency manager invariant
atumanov Aug 24, 2018
9f5efa3
Fix, unsubscribe from dependencies when forwarding in spillback.
robertnishihara Aug 27, 2018
a830719
Linting and remove extra debug statements.
robertnishihara Aug 27, 2018
0033be0
Remove check that fails during unit tests.
robertnishihara Aug 27, 2018
afc4940
Linting.
robertnishihara Aug 27, 2018
290aebf
Doc and linting.
robertnishihara Aug 27, 2018
430d389
check for hard constraints when spilling ready tasks
atumanov Aug 27, 2018
c6134b3
Merge branch 'raylet-scheduling-simple' of github.com:atumanov/ray in…
robertnishihara Aug 27, 2018
1ab3fad
Linting
robertnishihara Aug 27, 2018
faa279b
Skip testActorMultipleGPUsFromMultipleTasks in actor_test.py.
robertnishihara Aug 27, 2018
d8078b6
Remove multinode test_driver_put_errors because it doesn't actually g…
robertnishihara Aug 28, 2018
26ae1ae
Minor improvements
robertnishihara Aug 28, 2018
aec71d5
Update ascii art scheduling state diagram.
robertnishihara Aug 28, 2018
079c723
Update doc
robertnishihara Aug 28, 2018
File filter

Filter by extension

Filter by extension
8000
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 54 additions & 54 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -125,46 +125,46 @@ matrix:
# module is only found if the test directory is in the PYTHONPATH.
- export PYTHONPATH="$PYTHONPATH:./test/"

- python -m pytest python/ray/common/test/test.py
- python -m pytest python/ray/common/redis_module/runtest.py
- python -m pytest python/ray/plasma/test/test.py
# - python -m pytest python/ray/local_scheduler/test/test.py
# - python -m pytest python/ray/global_scheduler/test/test.py
- python -m pytest -v python/ray/common/test/test.py
- python -m pytest -v python/ray/common/redis_module/runtest.py
- python -m pytest -v python/ray/plasma/test/test.py
# - python -m pytest -v python/ray/local_scheduler/test/test.py
# - python -m pytest -v python/ray/global_scheduler/test/test.py

- python -m pytest python/ray/test/test_queue.py
- python -m pytest test/xray_test.py
- python -m pytest -v python/ray/test/test_queue.py
- python -m pytest -v test/xray_test.py

# The --assert=plain here is because pytest's assertion
# rewriting mechanism seems to mess up on this file,
# see https://github.com/ray-project/ray/issues/2514
- python -m pytest -v --assert=plain test/runtest.py
- python -m pytest test/array_test.py
- python -m pytest test/actor_test.py
- python -m pytest test/autoscaler_test.py
- python -m pytest test/tensorflow_test.py
- python -m pytest test/failure_test.py
- python -m pytest test/microbenchmarks.py
- python -m pytest test/stress_tests.py
- python -m pytest -v test/array_test.py
- python -m pytest -v test/actor_test.py
- python -m pytest -v test/autoscaler_test.py
- python -m pytest -v test/tensorflow_test.py
- python -m pytest -v test/failure_test.py
- python -m pytest -v test/microbenchmarks.py
- python -m pytest -v test/stress_tests.py
- pytest test/component_failures_test.py
- python test/multi_node_test.py
- python -m pytest test/recursion_test.py
- python -m pytest -v test/recursion_test.py
- pytest test/monitor_test.py
- python -m pytest test/cython_test.py
- python -m pytest test/credis_test.py
- python -m pytest -v test/cython_test.py
- python -m pytest -v test/credis_test.py

# ray tune tests
- python python/ray/tune/test/dependency_test.py
- python -m pytest python/ray/tune/test/trial_runner_test.py
- python -m pytest python/ray/tune/test/trial_scheduler_test.py
- python -m pytest python/ray/tune/test/experiment_test.py
- python -m pytest python/ray/tune/test/tune_server_test.py
- python -m pytest python/ray/tune/test/ray_trial_executor_test.py
- python -m pytest -v python/ray/tune/test/trial_runner_test.py
- python -m pytest -v python/ray/tune/test/trial_scheduler_test.py
- python -m pytest -v python/ray/tune/test/experiment_test.py
- python -m pytest -v python/ray/tune/test/tune_server_test.py
- python -m pytest -v python/ray/tune/test/ray_trial_executor_test.py

# ray rllib tests
- python -m pytest python/ray/rllib/test/test_catalog.py
- python -m pytest python/ray/rllib/test/test_filters.py
- python -m pytest python/ray/rllib/test/test_optimizers.py
- python -m pytest python/ray/rllib/test/test_evaluators.py
- python -m pytest -v python/ray/rllib/test/test_catalog.py
- python -m pytest -v python/ray/rllib/test/test_filters.py
- python -m pytest -v python/ray/rllib/test/test_optimizers.py
- python -m pytest -v python/ray/rllib/test/test_evaluators.py


install:
Expand Down Expand Up @@ -197,46 +197,46 @@ script:
# module is only found if the test directory is in the PYTHONPATH.
- export PYTHONPATH="$PYTHONPATH:./test/"

- python -m pytest python/ray/common/test/test.py
- python -m pytest python/ray/common/redis_module/runtest.py
- python -m pytest python/ray/plasma/test/test.py
- python -m pytest python/ray/local_scheduler/test/test.py
- python -m pytest python/ray/global_scheduler/test/test.py
- python -m pytest -v python/ray/common/test/test.py
- python -m pytest -v python/ray/common/redis_module/runtest.py
- python -m pytest -v python/ray/plasma/test/test.py
- python -m pytest -v python/ray/local_scheduler/test/test.py
- python -m pytest -v python/ray/global_scheduler/test/test.py

- python -m pytest python/ray/test/test_queue.py
- python -m pytest test/xray_test.py
- python -m pytest -v python/ray/test/test_queue.py
- python -m pytest -v test/xray_test.py

# The --assert=plain here is because pytest's assertion
# rewriting mechanism seems to mess up on this file,
# see https://github.com/ray-project/ray/issues/2514
- python -m pytest --assert=plain -v test/runtest.py
- python -m pytest test/array_test.py
- python -m pytest test/actor_test.py
- python -m pytest test/autoscaler_test.py
- python -m pytest test/tensorflow_test.py
- python -m pytest test/failure_test.py
- python -m pytest test/microbenchmarks.py
- python -m pytest test/stress_tests.py
- python -m pytest test/component_failures_test.py
- python -m pytest -v test/array_test.py
- python -m pytest -v test/actor_test.py
- python -m pytest -v test/autoscaler_test.py
- python -m pytest -v test/tensorflow_test.py
- python -m pytest -v test/failure_test.py
- python -m pytest -v test/microbenchmarks.py
- python -m pytest -v test/stress_tests.py
- python -m pytest -v test/component_failures_test.py
- python test/multi_node_test.py
- python -m pytest test/recursion_test.py
- python -m pytest test/monitor_test.py
- python -m pytest test/cython_test.py
- python -m pytest test/credis_test.py
- python -m pytest -v test/recursion_test.py
- python -m pytest -v test/monitor_test.py
- python -m pytest -v test/cython_test.py
- python -m pytest -v test/credis_test.py

# ray tune tests
- python python/ray/tune/test/dependency_test.py
- python -m pytest python/ray/tune/test/trial_runner_test.py
- python -m pytest python/ray/tune/test/trial_scheduler_test.py
- python -m pytest python/ray/tune/test/experiment_test.py
- python -m pytest python/ray/tune/test/tune_server_test.py
- python -m pytest python/ray/tune/test/ray_trial_executor_test.py
- python -m pytest -v python/ray/tune/test/trial_runner_test.py
- python -m pytest -v python/ray/tune/test/trial_scheduler_test.py
- python -m pytest -v python/ray/tune/test/experiment_test.py
- python -m pytest -v python/ray/tune/test/tune_server_test.py
- python -m pytest -v python/ray/tune/test/ray_trial_executor_test.py

# ray rllib tests
- python -m pytest python/ray/rllib/test/test_catalog.py
- python -m pytest python/ray/rllib/test/test_filters.py
- python -m pytest python/ray/rllib/test/test_optimizers.py
- python -m pytest python/ray/rllib/test/test_evaluators.py
- python -m pytest -v python/ray/rllib/test/test_catalog.py
- python -m pytest -v python/ray/rllib/test/test_filters.py
- python -m pytest -v python/ray/rllib/test/test_optimizers.py
- python -m pytest -v python/ray/rllib/test/test_evaluators.py

deploy:
- provider: s3
Expand Down
39 changes: 0 additions & 39 deletions java/test/src/main/java/org/ray/api/test/UIdTest.java

This file was deleted.

2 changes: 1 addition & 1 deletion python/ray/monitor.py
Original file line number Diff line number Diff line change
Expand Up @@ -337,7 +337,7 @@ def xray_heartbeat_handler(self, unused_channel, data):
static_resources[static] = message.ResourcesTotalCapacity(i)

# Update the load metrics for this local scheduler.
client_id = message.ClientId().decode("utf-8")
client_id = ray.utils.binary_to_hex(message.ClientId())
ip = self.local_scheduler_id_to_ip_map.get(client_id)
if ip:
self.load_metrics.update(ip, static_resources, dynamic_resources)
Expand Down
3 changes: 3 additions & 0 deletions src/ray/gcs/format/gcs.fbs
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,9 @@ table HeartbeatTableData {
// Total resource capacity configured for this node manager.
resources_total_label: [string];
resources_total_capacity: [double];
// Aggregate outstanding resource load on this node manager.
resource_load_label: [string];
resource_load_capacity: [double];
}

// Data for a lease on task execution.
Expand Down
52 changes: 32 additions & 20 deletions src/ray/raylet/design_docs/task_states.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,16 @@ Task State: Definitions & Transition Diagram

A task can be in one of the following states:

- **Placeable**: the task is ready to be placed at the node where is going to be
executed. This can be either local or a remote node. The decision is based on
resource availability (the location and size of the task's arguments are
ignore). If the local node has enough resources to satisfy task's demand, then
the task is placed locally, otherwise is forwarded to another node.
- **Placeable**: the task is ready to be assigned to a node (either a local or a
remote node). The decision is based on resource availability (the location and
size of the task's arguments are currently ignored). If the local node has
enough resources to satisfy task's demand, then the task is placed locally,
otherwise it is forwarded to another node. This placement decision is not
final. The task can later be spilled over to another node.

- **WaitForActorCreation**: an actor method (task) is waiting for its actor to get
instantiated. Once the actor is created, the task transitions into the
waiting state, if the actor is local, or it is forwarded to the remote machine
running the actor.
instantiated. Once the actor is created, the task will be forwarded to the
remote machine running the actor.

- **Waiting**: the task is waiting for its argument dependencies to be satisfied,
i.e., for its arguments to be transferred to the local object store.
Expand All @@ -24,18 +24,30 @@ A task can be in one of the following states:
worker/actor.

- **Blocked**: the task is being blocked as some data objects it depends on are not
available, e.g., because the task has launched another task and it waits
for the results, ore because of failures.
available, e.g., because the task has launched another task and is waiting
for the results.

- **Infeasible:** the task has resource requirements that are not satisfied by
any machine.

::

forward
------
| | resource arguments actor/worker
| v available local available
Placeable ----------> Waiting --------> Ready ---------> Running
| ^ ^ | ^
actor | | actor | actor worker | | worker
created | | created | created blocked | | unblocked
v | (remote) | (local) v |
WaitForActorCreation--------- Blocked
---------------------------------
| |
| forward | forward
|---------------- |
node with ------| | arguments |
resources forward| | resource | local | actor/worker
joins | v available | --------> | available
---------------------- Placeable ----------> Waiting Ready ---------> Running
| | | ^ ^ <-------- ^ | ^
| |--------- | | | local arg | | |
| | | | | evicted | worker | | worker
| | actor | | | | blocked | | unblocked
| resources | created | | actor | --------------- | |
| infeasible | | | created | actor | |
| | | | (remote) | created v |
| | v | | (local) Blocked
| | WaitForActorCreation----------
| v
----Infeasible
Loading
0