Latest massive language fashions (LLMs) developments sparked a rising analysis curiosity in device assisted LLMs fixing real-world challenges, which requires complete analysis of tool-use capabilities. Whereas earlier works centered on both evaluating over stateless net providers (RESTful API), based mostly on a single flip person immediate, or an off-policy dialog trajectory, ToolSandbox contains stateful device execution, implicit state dependencies between instruments, a built-in person simulator supporting on-policy conversational analysis and a dynamic analysis technique for intermediate and remaining milestones over an arbitrary trajectory. We present that open supply and proprietary fashions have a major efficiency hole, and sophisticated duties like State Dependency, Canonicalization and Inadequate Data outlined in ToolSandbox are difficult even essentially the most succesful SOTA LLMs, offering brand-new insights into tool-use LLM capabilities.