Diving deep into universal grasping

Diving deep into universal grasping


This article reviews the challenges in achieving universal grasping   - the ability to pick up and manipulate a previously unknown object the first time it's seen, without any mistakes. We will cover -

  • Moravec's Paradox - Why things we find easy are hard for robots
  • How robot grasping should work and why it often doesn't
  • How humans solve grasping - it's actually very very complex
  • The most impressive companies developing universal grasping solutions
  • How to cut through corporate marketing and understand the state of the art with Robot Olympiads
  • Where Industrials are finding value

Why can't robots just get a grip?

How is it possible that robots can generate beautiful art but still can’t pick up a random object?AI’s are designing ad campaigns. So, why has it taken physical motion so long to ketch-up?

Heinz used AI to design Its ad campaign

It's due to a phenomenon in AI and robotics known as Moravec’s Paradox.

It is easy to train computers to do things that humans find hard, like mathematics and logic, and it is hard to train them to do things humans find easy, like walking and image recognition.  - Moravec's Paradox

If you want to learn more about Moravec and his paradox - read more here. Nothing epitomises this paradox more than universal grasping. Grasping is one of the last steps required to bring robotics into parity with humans. Today, the majority of interest in grasping has been driven by E-commerce: an industry where almost every stage has been automated except picking and packing. This stage is currently very challenging due to the sheer variety of products. For example, Amazon has around 12 million unique product variants. But it’s not just e-commerce: Dyson is interested in universal picking for home robotics, and Google has a spinout company which is targeting pretty much every industry imaginable.

How should universal grasping work?

We’re talking about robot systems, so no surprise universal gripping is tackled by the usual suspects -  a robot, grippers, sensors and a control system. We’ll discuss the dominant technologies for each next week. If we abstract away from specific technological solutions, we can say there are 4 stages to grasping: planning the route, grasping the object, manipulating it through space and successfully placing it in the desired location.  Although simple in principle, in practice an autonomous system must perform numerous complex tasks in each stage. We’ve broken down the main ones in this chart -


The 4 Stages of robot grasping (according to Remix)

Why it often doesn't work

It may seem like we're overcomplicating things, but in industry (or any dynamic environment) major issues can emerge in all four stages. Generally, the problems originate from limitations in perception and control. Outside of a lab, systems struggle to capture and process the level of detail required in real-time. There is always a difference between the controller's view of reality and reality itself. It turns out that the grasping of objects is actually very sensitive to change, and so these small (in the best case scenario) differences compound into a large impact on success rates.

To match a human, a system needs to hit < 2 sec cycle times for the PnP of a Never-Been-Seen-Before (NBSB? - what an awful acronym) object. Currently, this isn't enough time to accurately -

  • Identify an object's properties - e.g. centre of mass, friction, hardness, stiffness, number of moving components etc etc
  • Model an object’s behaviour due to its properties - will it slide on the surface it's being held on if touched, will it deform when gripped, can it support its own weight etc etc
  • Map the environment and keep track of it as the object moves in real time

As a result, in Stage 1 perception systems are still pretty poor at identifying new objects from backgrounds/other objects, automatically determining the best type of gripper to use, the best pick location for the gripper and how to move into position without collisions.


looks like we have a collision here...

If Stage 1 goes well, Stage 2 is usually pretty smooth. Any mistakes though, and all bets are off. Issues can crop up if the gripper collides with other objects or accidentally unbalances the object it's trying to pick.

The first two stages are where researchers have focused the majority of their efforts, and — as we’ll see over the next few weeks — where there has been the greatest success to date. Unfortunately, the problems really start in the last two stages.

In Stage 3, systems can struggle to transport an object without collisions with the robot itself and other aspects of the environment. Any changes in the object’s positioning within the grips can throw the system off, and while solutions for avoiding the entanglement of objects exist, no general solution has been proposed to unhook complex object geometries.


Finally, in Stage 4, simply dropping objects, or even throwing them is straightforward. However, accurate placement is tricky on a flat surface, let alone into a container. The friction between a part and the grippers can change how an item falls, as can it's centre of mass etc. Ensuring collisions are avoided and delicate objects are not crushed in real-time has not received sufficient attention to date, and so remains a large barrier to industrial deployment of intelligent pick and place for NBSB objects.

How humans solve the issue

To put the challenge into context, let's look at how the human body is shaped to deal with the challenge. Try to reframe all of these “features” as if you had to design an electromechanical system with the same capabilities.


Human fingertips are probably the most sensitive skin areas in the animal world; they can feel the difference between a smooth surface and one with a pattern embedded just 13 nm deep. they’re able to differentiate between a wide range of textures, materials, temperatures, and pressures. They can sense pressure as light as a feather while being robust enough to handle loads of 20kg, high temperatures, abrasives etc. Our sense of touch is so responsive it can bypass our brain with the reaction to touch stimuli occurring in just 15 milliseconds.


Cameras are actually very similar to eyes (which makes sense - we copied them) but the resolution of the human eye is 576 megapixels in 3D - hard to beat. Even if standard machine vision cameras reach this level it's our brains that really set human vision apart. And speaking of brains…


To understand our brains it's useful to contrast them with computers:

Jack Pearson