Lets say we want to make a subagent that tests frontends. This would be a browser use agent. So what should the main orchestrator do? The topology as follows

  1. Implement changes
  2. Decide how to test it generate detailed plans on how to test it
  3. Launch browser use agents that actually test it
  4. browser use agent

When testing vibe coded apps, we have:

  1. Functional: does the app behave as expected when interacted with?
    1. Does this copy button successfully save to my clipboard
    2. Does clicking this tab in the navbar actually go to the correct tab
  2. Visual:
    1. Is there text overflow?
    2. Are things not centered?
    3. Are gradients as expected?

How should we prompt the agent to actually test this? How will the agent know that it has completed testing?

The agent outcome for visual will always be “success” because the agent will have completed its task, and then checked the visual stuff. The agent outcome for functional could be false, if it couldn’t actually do the function.

benchmarks:

  • mind2web
    • multimodal one exists that includes screenshots
  • mind2web2
    • For “agentic search”
  • Real evals
    • not public

Midtraining (VQA tasks on web ui):