dstackai · peterschmidt85 · Feb 20, 2026 · Feb 19, 2026 · Feb 19, 2026 · Feb 19, 2026
diff --git a/docs/docs/concepts/gateways.md b/docs/docs/concepts/gateways.md
@@ -110,7 +110,11 @@ router:
 
 </div>
 
-!!! info "Policy"
+If you configure the `sglang` router, [services](../concepts/services.md) can run either [standard SGLang workers](../../examples/inference/sglang/index.md) or [Prefill-Decode workers](../../examples/inference/sglang/index.md#pd-disaggregation) (aka PD disaggregation).
+
+> Note, if you want to run services with PD disaggregation, the gateway must currently run in the same cluster as the service.
+
+??? info "Policy"
     The `policy` property allows you to configure the routing policy:
 
     * `cache_aware` &mdash; Default policy; combines cache locality with load balancing, falling back to shortest queue. 
@@ -119,9 +123,6 @@ router:
     * `round_robin` &mdash; Cycles through workers in order.                                                             
 
 
-> Currently, services using this type of gateway must run standard SGLang workers. See the [example](../../examples/inference/sglang/index.md).
->
-> Support for prefill/decode disaggregation and auto-scaling based on inter-token latency is coming soon.
 
 ### Public IP
 

diff --git a/docs/docs/concepts/services.md b/docs/docs/concepts/services.md
@@ -182,6 +182,8 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
 
 > The `scaling` property requires creating a [gateway](gateways.md).
 
+<span id="replica-groups"></span>
+
 ??? info "Replica groups"
     A service can include multiple replica groups. Each group can define its own `commands`, `resources` requirements, and `scaling` rules.
 
@@ -230,8 +232,9 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
 
     > Properties such as `regions`, `port`, `image`, `env` and some other cannot be configured per replica group. This support is coming soon.
 
-??? info "Disaggregated serving"
-    Native support for disaggregated prefill and decode, allowing both worker types to run within a single service, is coming soon.
+### PD disaggregation
+
+If you create a gateway with the [`sglang` router](gateways.md#sglang), you can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
 
 ### Authorization
 

diff --git a/examples/inference/sglang/README.md b/examples/inference/sglang/README.md
@@ -9,7 +9,7 @@ This example shows how to deploy DeepSeek-R1-Distill-Llama 8B and 70B using [SGL
 
 ## Apply a configuration
 
-Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B using SgLang.
+Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B using SGLang.
 
 === "NVIDIA"
 
@@ -108,15 +108,106 @@ curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
 ```
 </div>
 
-!!! info "SGLang Model Gateway"
-    If you'd like to use a custom routing policy, e.g. by leveraging the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#), create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
+!!! info "Router policy"
+    If you'd like to use a custom routing policy, create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
 
-> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling or HTTPs, rate-limits, etc), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
+> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
+
+## Configuration options
+
+### PD disaggregation
+
+If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html).
+
+<div editor-title="examples/inference/sglang/pd.dstack.yml">
+
+```yaml
+type: service
+name: prefill-decode
+image: lmsysorg/sglang:latest
+
+env:
+  - HF_TOKEN
+  - MODEL_ID=zai-org/GLM-4.5-Air-FP8
+
+replicas:
+  - count: 1..4
+    scaling:
+      metric: rps
+      target: 3
+    commands:
+      - |
+        python -m sglang.launch_server \
+          --model-path $MODEL_ID \
+          --disaggregation-mode prefill \
+          --disaggregation-transfer-backend mooncake \
+          --host 0.0.0.0 \
+          --port 8000 \
+          --disaggregation-bootstrap-port 8998
+    resources:
+      gpu: H200
+
+  - count: 1..8
+    scaling:
+      metric: rps
+      target: 2
+    commands:
+      - |
+        python -m sglang.launch_server \
+          --model-path $MODEL_ID \
+          --disaggregation-mode decode \
+          --disaggregation-transfer-backend mooncake \
+          --host 0.0.0.0 \
+          --port 8000
+    resources:
+      gpu: H200
+
+port: 8000
+model: zai-org/GLM-4.5-Air-FP8
+
+# Custom probe is required for PD disaggregation
+probes:
+  - type: http
+    url: /health_generate
+    interval: 15s
+
+router:
+  type: sglang
+  pd_disaggregation: true
+```
+
+</div>
+
+Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.
+
+#### Gateway
+
+Note, running services with PD disaggregation currently requires the gateway to run in the same cluster as the service.
+
+For example, if you run services on the `kubernetes` backend, make sure to also create the gateway in the same backend:
+
+<div editor-title="gateway.dstack.yml">
+
+```yaml
+type: gateway
+name: gateway-name
+
+backend: kubernetes
+region: any
+
+domain: example.com
+router:
+  type: sglang
+```
+
+</div>
+
+<!-- TODO: Gateway creation using fleets is coming to simplify this. -->
 
 ## Source code
 
-The source-code of this example can be found in
-[`examples/llms/deepseek/sglang`](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek/sglang).
+The source-code of these examples can be found in
+[`examples/llms/deepseek/sglang`](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek/sglang) and [`examples/inference/sglang`](https://github.com/dstackai/dstack/blob/master/examples/inference/sglang).
 
 ## What's next?
 

diff --git a/examples/inference/sglang/pd.dstack.yml b/examples/inference/sglang/pd.dstack.yml
@@ -0,0 +1,51 @@
+type: service
+name: prefill-decode
+image: lmsysorg/sglang:latest
+
+env:
+  - HF_TOKEN
+  - MODEL_ID=zai-org/GLM-4.5-Air-FP8
+
+replicas:
+  - count: 1..4
+    scaling:
+      metric: rps
+      target: 3
+    commands:
+      - |
+          python -m sglang.launch_server \
+            --model-path $MODEL_ID \
+            --disaggregation-mode prefill \
+            --disaggregation-transfer-backend mooncake \
+            --host 0.0.0.0 \
+            --port 8000 \
+            --disaggregation-bootstrap-port 8998
+    resources:
+      gpu: 1
+
+  - count: 1..8
+    scaling:
+      metric: rps
+      target: 2
+    commands:
+      - |
+          python -m sglang.launch_server \
+            --model-path $MODEL_ID \
+            --disaggregation-mode decode \
+            --disaggregation-transfer-backend mooncake \
+            --host 0.0.0.0 \
+            --port 8000
+    resources:
+      gpu: 1
+
+port: 8000
+model: zai-org/GLM-4.5-Air-FP8
+
+probes:
+  - type: http
+    url: /health_generate
+    interval: 15s
+
+router:
+  type: sglang
+  pd_disaggregation: true