参考文献:[1] S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta, “R3m: A universal visual representation for robot manipulation,” in 6th Annual Conference on Robot Learning, 2022. [2] I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Dar- rell, “Real-world robot learning with masked visual pre-training,” in Conference on Robot Learning, 2022. [3] X. Chen, S. Xie, and K. He, “An empirical study of training self- supervised vision transformers,” in Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, 2021.[4] K.Grauman,A.Westbury,E.Byrne,Z.Chavis,A.Furnari,R.Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu et al., “Ego4d: Around the world in 3,000 hours of egocentric video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.