Expert AI assistant for robotics engineers and scientists. Covers open-source platforms, machine vision, ROS2, embedded systems, control theory, and autonomous systems.
Open platforms, machine vision, and autonomous systems.
Extending the use cases of AI into the physical realm
In the early 1950s, John von Neumann — the mathematician behind modern computing architecture, game theory, and the Manhattan Project — turned his attention to a deceptively simple question: can a machine build a copy of itself? His answer, developed in collaboration with Stanislaw Ulam, was the Universal Constructor: a theoretical automaton capable of reading a description of any machine, constructing that machine from raw materials, and then copying its own description into the offspring. It was a blueprint not just for a robot, but for life itself — abstracted into logic.
Von Neumann never saw it built. The hardware didn't exist. The control systems didn't exist. And crucially, the intelligence needed to interpret an open-ended environment, handle ambiguity, and make contextual decisions — that didn't exist either. His dream sat in the theoretical literature for seventy years, admired but unrealised.
Agentic AI changes the equation. A language model embedded in a robotic system can now read a high-level goal, decompose it into a sequence of physical actions, recover from unexpected states, and coordinate with other agents — all without hand-coded rules for every contingency. Pair that with advances in manipulator hardware, rapid-prototyping (a robot that can 3D-print its own replacement parts), and multi-agent coordination frameworks, and von Neumann's Universal Constructor stops being a thought experiment. The self-assembling, self-replicating robot is no longer a question of whether — only of how far we want to take it.
Traditional robots are deterministic state machines. The control loop is explicit: sense a value, apply a control law, drive an actuator. This works remarkably well for narrow, repetitive tasks in structured environments — a pick-and-place arm on an automotive line can operate at millimetre precision for decades, as long as the part is always in the same location, under the same light, on the same fixture. Remove any one of those constraints and the system breaks. The brittleness is not a bug in the implementation; it is a structural property of the architecture.
The missing layer was always semantic understanding — the capacity to interpret a scene rather than just measure it. Classical CV could detect edges, segment by colour, match templates. What it could not do was generalise. A model trained to detect oranges on a conveyor had no useful representations for apples. Each new object required a new labelled dataset, a new model, a new integration test. The combinatorial explosion of real-world diversity made truly general robot perception seem intractable.
Foundation models trained on internet-scale data carry something qualitatively different: embedded world models. A vision-language model that has processed billions of image-caption pairs has developed internal representations of object categories, materials, spatial relationships, and physical affordances — not through explicit programming, but through statistical regularities in the data. The consequence for robotics is zero-shot generalisation: a model trained on grasping tasks can now interpret "hand me the blue one" and identify the correct object without a labelled dataset for that specific item, because the semantic category and the colour attribute are already factored into the model's representational space.
This changes the design of every layer in the perception-action pipeline. Understanding what changed, and where the hard problems still live, is the starting point for building systems that actually work.
At the sensing layer, not much has changed mechanically — cameras, lidar, IMUs, and encoders are the same transducers they were twenty years ago. What has changed is the expectation placed on them. A classical pipeline needed carefully calibrated, high-SNR input because every subsequent step was hand-coded and non-robust. Modern neural pipelines are far more tolerant of sensor noise and illumination variation, but they impose different failure modes: distribution shift is silent — the model continues to produce confident outputs as the input drifts outside the training domain.
At the feature extraction layer, vision transformers (ViTs) have largely displaced purely convolutional backbones for tasks requiring global context. The DINOv2 backbone from Meta AI, for instance, produces semantically rich patch embeddings without any task-specific supervision — a single pretrained backbone can serve detection, segmentation, depth estimation, and correspondence tasks through lightweight task heads. This dramatically reduces the per-task training burden.
At the planning layer, large language models have introduced a genuinely new capability: the ability to decompose natural language task descriptions into executable action sequences, reason about preconditions and failure modes, and adapt plans when execution deviates from expectation. This is the layer that makes "pick up the red cup and put it next to the sink" tractable without a custom state machine for every possible object-location pair. The hard problem is grounding — mapping abstract language-level plan steps to concrete robot primitives in a way that handles spatial ambiguity and physical constraint.
Assigns a single class label to the entire image. Output is a probability distribution over N classes. No spatial localisation — the model cannot tell you where the object is.
Predicts axis-aligned bounding boxes plus class labels for each object instance. The dominant metric is COCO mAP (mean average precision across IoU thresholds 0.5–0.95, 80 COCO classes). A detection head attaches to a feature backbone and adds location regression.
Assigns a class label to every pixel. No distinction between individual object instances — two adjacent cars merge into a single "car" region. Useful for drivable area detection, terrain classification, and scene understanding where instance identity is irrelevant.
Extends detection by producing a binary pixel mask for each detected object instance, not just a bounding box. Computationally heavier. Essential for tasks where object boundary matters: bin picking, grasping, agricultural weed isolation. YOLOv8-seg is the standard production choice.
Two distinct problems: 6-DoF object pose (position + orientation of a known rigid object — foundational for robot grasping) and skeleton keypoint estimation (joint positions of a human or animal body). Both require knowledge of the 3D structure of the target.
Stereo uses triangulation from two calibrated cameras — reliable absolute depth but requires baseline and calibration. Monocular learns depth priors from data — scale-ambiguous without reference objects, but cheap (one camera). DepthAnything v2 is the current best monocular model.
Estimates dense per-pixel 2D motion vectors between consecutive frames. Foundational for visual odometry, action recognition, and detecting moving objects in otherwise static scenes. RAFT is the canonical neural approach; classical Lucas-Kanade still runs on MCUs.
| Library / Model | Primary Use | COCO mAP (det) | Speed | Edge Deploy |
|---|---|---|---|---|
| OpenCV 4.x | Classical CV, preprocessing, camera calibration, traditional geometry | No ML | <1ms | Any hardware, C++/Python |
| YOLOv8n | Real-time detection on edge hardware | 37.3 |
1.47ms A100 TRT | ONNX, TFLite, CoreML, OpenVINO |
| YOLOv8x | High-accuracy detection, latency-tolerant | 53.9 |
12.8ms A100 TRT | ONNX, TensorRT |
| RT-DETR-L | Transformer-based detection, no NMS required | 53.0 |
9.3ms A100 | ONNX export available |
| SAM2 | Zero-shot segmentation, interactive masking | N/A — mask IoU | ~35ms/frame | GPU required (A10G+ recommended) |
| DepthAnything v2 | Monocular depth, metric and relative modes | N/A — AbsRel 0.076 | 30fps (GPU) | ONNX export, quantised variants |
| MediaPipe | Pose, hand, face on-device at very low latency | Domain-specific | <5ms mobile | Mobile, web (WASM), some MCU support |
The open robotics ecosystem has matured significantly. These platforms represent the best starting points — chosen for documentation quality, community size, and realistic buildability. Costs are honest build-it-yourself estimates; kit prices are higher but reduce integration risk.
FarmBot Genesis is the best open-source starting point for autonomous garden weeding. The CNC gantry architecture is the right choice for this task — not because it's simpler than an arm, but because it is fundamentally better suited. A gantry provides sub-millimetre X/Y positioning across the entire working volume, deterministic path planning (no inverse kinematics required), and straightforward tool-head swapping. An arm requires IK solutions, has workspace singularities, and loses positional accuracy near the edges of its reach envelope. For a flat raised bed, the gantry wins on every practical dimension.
Here is the engineering reality of building a functional weed detection and removal system on top of this platform.
Mounted on the tool head, moves with the gantry. CS-mount lens allows focal length selection — a 6mm lens at 300mm working height gives approximately 180mm × 135mm field of view. Fixed focal length avoids autofocus latency. The HQ sensor (Sony IMX477) has sufficient dynamic range for outdoor use, though direct sunlight at noon remains challenging.
The DeepWeeds dataset (CSIRO, CC BY 4.0) contains 17,509 images across 8 weed species photographed in North Queensland agricultural land — good for Australian conditions, requires domain adaptation for other geographies. An alternative is the WeedAI database (weed.ai) with 200k+ annotated images across multiple crop types and regions. Fine-tuning YOLOv8n from COCO checkpoint takes approximately 2 hours on an RTX 3080 for 100 epochs at 640×640 input resolution. Expect 85–92% mAP@0.5 on the DeepWeeds validation split, degrading to 60–75% in novel environments without domain adaptation.
The gantry provides X/Y coordinates in millimetres from a calibrated home position. The camera is offset from the tool-head centre by a known fixed vector. Converting pixel coordinates to gantry coordinates requires a planar homography calibration: place a calibration grid at bed surface level, capture the grid image with the gantry at a known position, and solve the 3×3 homography matrix mapping image pixels to millimetre coordinates.
Depth (Z) is estimated from the calibration: at a fixed working height, the camera-to-surface distance is constant. Tall plants break this assumption — a 10cm-tall weed's base is at the soil surface but its detection bounding box centroid will project to an incorrect gantry position unless corrected by the known plant height offset.
A 3–5mm steel spike driven into the soil at the detected weed centroid, optionally with rotation. No chemicals, no plumbing. ~90% effective on taproot weeds (dandelion, dock). Slower — 3–5 seconds per weed including gantry travel. Causes soil disturbance which can promote adjacent weed germination. Works reliably in wet soil; hard, dry ground reduces effectiveness.
Micro-dosing peristaltic pump delivers 10–50µl of herbicide precisely at the weed location. Effective across weed types. Requires chemical handling procedures, reservoir management, nozzle cleaning routine, and waterproof electronics. Regulatory requirements vary by herbicide and jurisdiction. Significant reduction in chemical use vs broadcast spraying (estimated 90–95% reduction).
The Carbon Robotics approach: 150W CO2 laser at a 1cm² spot, ~200,000 weeds/hour on their commercial unit. Instantaneous, no chemicals, no soil disturbance. Requires: Class IV laser safety enclosure, interlock systems, beam dump, thermal imaging for spot verification. CO2 lasers at this power are expensive and require periodic tube replacement. Not practical for a DIY garden robot without significant safety engineering.
imgsz=320 for faster inference if precision loss is acceptable.from ultralytics import YOLO import cv2 import numpy as np model = YOLO('yolov8n.pt') # swap for fine-tuned DeepWeeds checkpoint model.fuse() # fuse conv+bn layers for faster inference # Homography: maps image pixels → gantry mm coordinates # Calibrate once per session with a known grid target H = np.load('calibration_homography.npy') cap = cv2.VideoCapture(0) # FarmBot USB camera (or RTSP stream) while True: ret, frame = cap.read() if not ret: break results = model.predict(frame, conf=0.45, imgsz=640, device='cpu') for box in results[0].boxes: cls = int(box.cls[0]) conf = float(box.conf[0]) x1, y1, x2, y2 = box.xyxy[0].tolist() # Project bounding box centroid to gantry coordinates cx, cy = (x1 + x2) / 2, (y1 + y2) / 2 pt = np.array([[[cx, cy]]], dtype=np.float32) gantry_pt = cv2.perspectiveTransform(pt, H)[0][0] # [mm_x, mm_y] label = f"{model.names[cls]} {conf:.2f} → ({gantry_pt[0]:.1f}mm, {gantry_pt[1]:.1f}mm)" cv2.rectangle(frame, (int(x1), int(y1)), (int(x2), int(y2)), (0,255,0), 2) cv2.putText(frame, label, (int(x1), int(y1)-8), cv2.FONT_HERSHEY_SIMPLEX, 0.45, (0,255,0), 1) cv2.imshow('FarmBot Vision', frame) if cv2.waitKey(1) & 0xFF == ord('q'): break
The Robot Operating System 2 is the de-facto middleware for robot software. It provides a typed, distributed publish-subscribe message bus (DDS), a service/action framework, a hardware clock, TF2 transform tree, and a large ecosystem of pre-built packages. Humble is the current LTS; Iron and Jazzy follow on a yearly cadence but Humble has the broadest package support.
Key packages: Nav2 (costmap-based 2D/3D navigation stack, replaces ROS1 move_base);
MoveIt2 (motion planning, collision checking, trajectory execution for arms);
ros2_control (hardware abstraction layer — separates control algorithm from
hardware-specific driver code); micro-ROS (runs ROS2 on microcontrollers,
connects to the main ROS2 graph over serial/UDP).
DDS middleware choice matters. Cyclone DDS (Eclipse) has lower latency on a
single machine — good for a self-contained robot. eProsima Fast DDS (the default) is better
for distributed systems with nodes across multiple machines. Set
RMW_IMPLEMENTATION=rmw_cyclonedds_cpp for single-machine deployments.
Gazebo Fortress (LTS, paired with Humble) and Gazebo Harmonic
(newer, Ionic-compatible) are the open-source simulation standard. Physics via DART or Bullet,
sensor simulation (lidar, depth camera, IMU), ROS2 bridge via ros_gz.
Limitations: sensor simulation fidelity is adequate for algorithm development but not for
photorealistic neural network training.
Isaac Sim (NVIDIA, Omniverse-based) provides photorealistic rendering via ray tracing, physically accurate sensor simulation (lidar, depth, RGB with noise models), and direct integration with Isaac ROS packages. Requires an RTX GPU and a machine with significant VRAM (16GB+). The domain randomisation and synthetic data generation capabilities make it the right choice if you need to train perception models in simulation before hardware deployment.
| Platform | CPU / GPU | AI TOPS | TDP | Best for |
|---|---|---|---|---|
| Raspberry Pi 5 | 4× ARM Cortex-A76 @ 2.4GHz | CPU only | 5–8W | Classical CV, lightweight ROS2 nodes, sensor fusion, gantry control |
| Coral USB Accelerator | Google Edge TPU (int8 only) | 4 TOPS (int8) | 2W | TFLite int8 model inference offload from Pi; requires full int8 quantisation |
| Jetson Orin Nano | 6-core ARM A78AE + 1024-core Ampere GPU | 40 TOPS | 7–15W | Full vision pipeline: YOLOv8, depth estimation, SAM2 on small images |
| Jetson AGX Orin | 12-core ARM + 2048-core Ampere + DLA ×2 | 275 TOPS | 15–60W | Multi-camera perception, real-time SLAM, lidar processing, full Nav2 stack |
| x86 + RTX 4060 | 3072 CUDA cores, 8GB GDDR6 | ~200 TOPS fp16 | 115W TDP | Development workstation, model training, offline data processing |
OSRF publishes official ROS2 Docker images with consistent, reproducible environments.
Use volume mounts for your workspace so build artifacts persist between container runs.
GPU passthrough via --gpus all on Linux with the NVIDIA Container Toolkit installed.
X11 forwarding enables RViz, rqt, and Gazebo GUIs from within the container.
# Pull ROS2 Humble base image docker pull ros:humble-ros-base # Run with X11 forwarding for GUIs (RViz, Gazebo, rqt) docker run -it --rm \ -e DISPLAY=$DISPLAY \ -v /tmp/.X11-unix:/tmp/.X11-unix \ -v $(pwd)/ros_ws:/ros_ws \ ros:humble-ros-base bash # With GPU passthrough (requires NVIDIA Container Toolkit) docker run -it --rm --gpus all \ -e DISPLAY=$DISPLAY \ -v /tmp/.X11-unix:/tmp/.X11-unix \ -v $(pwd)/ros_ws:/ros_ws \ nvcr.io/nvidia/isaac/ros:humble bash # Inside container: create workspace, build mkdir -p /ros_ws/src && cd /ros_ws colcon build --symlink-install source install/setup.bash # Allow X11 connections from Docker (run on host before starting container) xhost +local:docker