SGLang
strands-sglang is an SGLang model provider for Strands Agents SDK with Token-In/Token-Out (TITO) support for agentic RL training. It provides direct integration with SGLang servers using the native /generate endpoint, optimized for reinforcement learning workflows.
Features:
- SGLang Native API: Uses SGLang’s native
/generateendpoint with non-streaming POST for optimal parallelism - TITO Support: Tracks complete token trajectories with logprobs for RL training - no retokenization drift
- Tool Call Parsing: Customizable tool parsing aligned with model chat templates (Hermes/Qwen format)
- Iteration Limiting: Built-in hook to limit tool iterations with clean trajectory truncation
- RL Training Optimized: Connection pooling, aggressive retry (60 attempts), and non-streaming design aligned with Slime’s http_utils.py
Installation
Section titled “Installation”Install strands-sglang along with the Strands Agents SDK:
pip install strands-sglang strands-agents-toolsRequirements
Section titled “Requirements”- SGLang server running with your model
- HuggingFace tokenizer for the model
1. Start SGLang Server
Section titled “1. Start SGLang Server”First, start an SGLang server with your model:
python -m sglang.launch_server \ --model-path Qwen/Qwen3-4B-Instruct-2507 \ --port 30000 \ --host 0.0.0.02. Basic Agent
Section titled “2. Basic Agent”import asynciofrom transformers import AutoTokenizerfrom strands import Agentfrom strands_tools import calculatorfrom strands_sglang import SGLangModel
async def main(): tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507") model = SGLangModel(tokenizer=tokenizer, base_url="http://localhost:30000") agent = Agent(model=model, tools=[calculator])
model.reset() # Reset TITO state for new episode result = await agent.invoke_async("What is 25 * 17?") print(result)
# Access TITO data for RL training print(f"Tokens: {model.token_manager.token_ids}") print(f"Loss mask: {model.token_manager.loss_mask}") print(f"Logprobs: {model.token_manager.logprobs}")
asyncio.run(main())3. Slime RL Training
Section titled “3. Slime RL Training”For RL training with Slime, SGLangModel with TITO eliminates the retokenization step:
from strands import Agent, toolfrom strands_sglang import SGLangClient, SGLangModel, ToolIterationLimiterfrom slime.utils.types import Sample
SYSTEM_PROMPT = "..."MAX_TOOL_ITERATIONS = 5_client_cache: dict[str, SGLangClient] = {}
def get_client(args) -> SGLangClient: """Get shared client for connection pooling (like Slime).""" base_url = f"http://{args.sglang_router_ip}:{args.sglang_router_port}" if base_url not in _client_cache: _client_cache[base_url] = SGLangClient.from_slime_args(args) return _client_cache[base_url]
@tooldef execute_python_code(code: str): """Execute Python code and return the output.""" ...
async def generate(args, sample: Sample, sampling_params) -> Sample: """Generate with TITO: tokens captured during generation, no retokenization.""" assert not args.partial_rollout, "Partial rollout not supported."
state = GenerateState(args)
# Set up Agent with SGLangModel and ToolIterationLimiter hook model = SGLangModel( tokenizer=state.tokenizer, client=get_client(args), model_id=args.hf_checkpoint.split("/")[-1], params={k: sampling_params[k] for k in ["max_new_tokens", "temperature", "top_p"]}, ) limiter = ToolIterationLimiter(max_iterations=MAX_TOOL_ITERATIONS) agent = Agent( model=model, tools=[execute_python_code], hooks=[limiter], callback_handler=None, system_prompt=SYSTEM_PROMPT, )
# Run Agent Loop prompt = sample.prompt if isinstance(sample.prompt, str) else sample.prompt[0]["content"] try: await agent.invoke_async(prompt) sample.status = Sample.Status.COMPLETED except Exception as e: # Always use TRUNCATED instead of ABORTED because Slime doesn't properly # handle ABORTED samples in reward processing. See: https://github.com/THUDM/slime/issues/200 sample.status = Sample.Status.TRUNCATED logger.warning(f"TRUNCATED: {type(e).__name__}: {e}")
# TITO: extract trajectory from token_manager tm = model.token_manager prompt_len = len(tm.segments[0]) # system + user are first segment sample.tokens = tm.token_ids sample.loss_mask = tm.loss_mask[prompt_len:] sample.rollout_log_probs = tm.logprobs[prompt_len:] sample.response_length = len(sample.tokens) - prompt_len sample.response = model.tokenizer.decode(sample.tokens[prompt_len:], skip_special_tokens=False)
# Cleanup and return model.reset() agent.cleanup() return sampleConfiguration
Section titled “Configuration”Model Configuration
Section titled “Model Configuration”The SGLangModel accepts the following parameters:
| Parameter | Description | Example | Required |
|---|---|---|---|
tokenizer | HuggingFace tokenizer instance | AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507") | Yes |
base_url | SGLang server URL | "http://localhost:30000" | Yes (or client) |
client | Pre-configured SGLangClient | SGLangClient.from_slime_args(args) | Yes (or base_url) |
model_id | Model identifier for logging | "Qwen3-4B-Instruct-2507" | No |
params | Generation parameters | {"max_new_tokens": 2048, "temperature": 0.7} | No |
enable_thinking | Enable thinking mode for Qwen3 hybrid models | True or False | No |
Client Configuration
Section titled “Client Configuration”For RL training, use a centralized SGLangClient with connection pooling:
from strands_sglang import SGLangClient, SGLangModel
# Option 1: Direct configurationclient = SGLangClient( base_url="http://localhost:30000", max_connections=1000, # Default: 1000 timeout=None, # Default: None (infinite, like Slime) max_retries=60, # Default: 60 (aggressive retry for RL stability) retry_delay=1.0, # Default: 1.0 seconds)
# Option 2: Adaptive to Slime's training argsclient = SGLangClient.from_slime_args(args)
model = SGLangModel(tokenizer=tokenizer, client=client)| Parameter | Description | Default |
|---|---|---|
base_url | SGLang server URL | Required |
max_connections | Maximum concurrent connections | 1000 |
timeout | Request timeout (None = infinite) | None |
max_retries | Retry attempts on transient errors | 60 |
retry_delay | Delay between retries (seconds) | 1.0 |
Troubleshooting
Section titled “Troubleshooting”Connection errors to SGLang server
Section titled “Connection errors to SGLang server”Ensure your SGLang server is running and accessible:
# Check if server is respondingcurl http://localhost:30000/healthToken trajectory mismatch
Section titled “Token trajectory mismatch”If TITO data doesn’t match expected output, ensure you call model.reset() before each new episode to clear the token manager state.