ByteDance: UI-TARS 7B Model25/100 via “coordinate-based interaction targeting with sub-pixel precision”
UI-TARS-1.5 is a multimodal vision-language agent optimized for GUI-based environments, including desktop interfaces, web browsers, mobile systems, and games. Built by ByteDance, it builds upon the UI-TARS framework with reinforcement...
Unique: Trained on diverse UI layouts to predict interaction coordinates with high precision, using visual context (element size, shape, text) to determine the optimal click target rather than simple center-of-bounding-box heuristics.
vs others: More accurate than simple bounding box center calculations because it understands UI semantics and can identify the actual clickable region, and more robust than OCR-based coordinate detection because it works on non-text elements.