<aside> <img src="https://prod-files-secure.s3.us-west-2.amazonaws.com/93065546-5c75-483c-be16-cafbb40e21ec/886a6c2c-181c-4807-9665-f662f8c93c12/icon_autobot.jpeg" alt="https://prod-files-secure.s3.us-west-2.amazonaws.com/93065546-5c75-483c-be16-cafbb40e21ec/886a6c2c-181c-4807-9665-f662f8c93c12/icon_autobot.jpeg" width="40px" /> “Now, all we need is a little Energon and a lot of luck” - Optimus Prime
</aside>
Isidoro Tamassia - <[email protected]>
Francesca Del Buono - <[email protected]>
Miłosz Jezierski - <[email protected]>
The code associated with this work can be found at https://github.com/TheEmotionalProgrammer/vit_pose_estimation
A space mission comprises multiple stages, each posing significant risks to the mission's overall success. Spacecraft docking (Figure 1), critical for crew transfers and resupply missions, is one such high-risk stage, particularly if one of the two spacecraft is uncooperative, as is often the case in space debris removal missions. For these scenarios, it is desired to equip the chaser spacecraft with tools that can rapidly estimate the position of uncooperative targets using onboard sensors and computers only, since communication with Earth may be noisy, delayed, or not even possible in certain situations.

Figure 1: A chaser spacecraft approaches the target space station during the docking process
To solve this issue, considerable effort has been recently put into monocular spacecraft pose estimation, defined as the problem of finding the relative position and orientation of a target spacecraft, with respect to the camera reference mounted on a chaser spacecraft [2]. An effective pose estimation method should allow for precise alignment and docking by providing accurate real-time pose estimation.
More practically, given an image and the corresponding intrinsic camera parameters, the goal is to estimate translation and rotation, between the camera and the target object [2]. The object's location in the camera reference frame is denoted by $\mathbf{r} = (x,y,z) \in \mathbb{R}^3$ , and its orientation is commonly represented by a quaternion $\mathbf{q} = (q_0, q_1, q_2, q_3) \in \mathbb{R}^4$ .
The recent development in this area has been accelerated by the European Spatial Agency (ESA) which launched the Satellite Pose Estimation Challenge [1] in 2019, leading to the creation of multiple new algorithms based on a variety of different techniques.
Unfortunately, the surprisingly high scores obtained by some of these models in simulated settings do not often translate into a reliable performance in real settings. The main reason for this discrepancy in performance is the high reliance on synthetic datasets for training the models, as collecting real satellite images is difficult. Models trained on synthetic data often underperform in real-world scenarios due to a phenomenon called domain shift, which is given by differences in illumination conditions, level of detail of the spacecraft representation, and background conditions between the rendered images and the real samples.
For instance, the hybrid CNN-based model created by the researchers of UniAdelaide [13] which won the ESA pose estimation challenge yields an approximately 40 times worse performance on the real test data compared to the synthetic one used for training.
While Convolutional Neural Networks (CNNs) demonstrate good in-distribution generalization, recent works [3] show how Vision Transformers (ViTs) [11] can generalize better when it comes to testing on out-distribution samples, possibly leading to lower accuracy drops when operating on images that are significantly different from the training data.
Another significant issue is that current state-of-the-art models are highly computationally intensive, which can limit their deployment on vehicles with limited onboard processing capabilities. For the pose estimation to be real-time, the inference has to be run quickly, which is difficult when the model is constituted of hundreds of millions or even billions of parameters.
Proenca et al. [5] proposed a model called URSONet, a straightforward way to solve pose estimation by regressing position and orientation using a single CNN. URSONet stands out for its generalization capabilities as it **obtained a good score on both synthetic and real test sets. However, it comprises a tremendous amount of parameters (500 million) and huge computational complexity as they use an ensemble method based on three ResNet-101 CNNs.