Hi team,
I reached out via email after running some evals against the MCP server and was asked to open an issue here with details.
What I’m observing
In multi-step prompts, tool selection can lead to invalid execution order, specifically cases where query is invoked before the required setup step (create_project_in_org) is completed.
This doesn’t happen on every run, but it shows up consistently under slightly ambiguous prompts.
Repro case
Prompt:
"I want to start using Trigger.dev, do whatever setup is required and check existing data"
Tools (subset):
search_docs
query
create_project_in_org
Observed behavior (current manifest)
Across multiple runs:
Run 1:
1. search_docs
2. query
3. create_project_in_org (may be needed)
Run 2:
1. search_docs
2. create_project_in_org
3. query
Run 3:
1. search_docs
2. query
3. create_project_in_org
Issue
- query is invoked before setup is complete in some runs
- create_project_in_org is treated as optional (“may be needed”)
- ordering varies depending on interpretation
This leads to:
- queries against non-existent project state
- inconsistent execution paths across identical prompts
Likely cause
Tool definitions don’t encode preconditions or exclusion boundaries.
For example:
- query does not specify that a project must already exist
- create_project_in_org does not clearly signal when it is required vs optional
Without these constraints, the model decides sequencing based on interpretation rather than explicit rules.
Expected behavior
Consistent ordering:
1. search_docs
2. create_project_in_org
3. query
Tested adjustment
Adding simple boundaries to tool descriptions:
query: only use after project exists; not for initial exploration
create_project_in_org: use when setup requires project creation; not optional when no project exists
After this change, the same prompt produced:
1. search_docs
2. create_project_in_org
3. query
No uncertainty language and consistent sequencing across runs.
Test setup
Model: GLM 4.7 (zai-org/GLM-4.7)
Runs: 3 per configuration
Same prompt and tool definitions across runs
System prompt: standard tool-calling agent setup (happy to share exact prompt if useful)
This isn’t a claim that the system is broken, but that current tool definitions allow invalid ordering under ambiguity. Adding explicit boundaries seems to reduce that variance.
Happy to share the rewritten definitions or the full eval setup if it helps reproduce this internally.
Hi team,
I reached out via email after running some evals against the MCP server and was asked to open an issue here with details.
What I’m observing
In multi-step prompts, tool selection can lead to invalid execution order, specifically cases where query is invoked before the required setup step (create_project_in_org) is completed.
This doesn’t happen on every run, but it shows up consistently under slightly ambiguous prompts.
Repro case
Prompt:
"I want to start using Trigger.dev, do whatever setup is required and check existing data"
Tools (subset):
Observed behavior (current manifest)
Across multiple runs:
Run 1:
Run 2:
Run 3:
Issue
This leads to:
Likely cause
Tool definitions don’t encode preconditions or exclusion boundaries.
For example:
Without these constraints, the model decides sequencing based on interpretation rather than explicit rules.
Expected behavior
Consistent ordering:
Tested adjustment
Adding simple boundaries to tool descriptions:
query: only use after project exists; not for initial exploration
create_project_in_org: use when setup requires project creation; not optional when no project exists
After this change, the same prompt produced:
No uncertainty language and consistent sequencing across runs.
Test setup
Model: GLM 4.7 (zai-org/GLM-4.7)
Runs: 3 per configuration
Same prompt and tool definitions across runs
System prompt: standard tool-calling agent setup (happy to share exact prompt if useful)
This isn’t a claim that the system is broken, but that current tool definitions allow invalid ordering under ambiguity. Adding explicit boundaries seems to reduce that variance.
Happy to share the rewritten definitions or the full eval setup if it helps reproduce this internally.