SmolVLA Vision language action paper used for training real world Robots for task
Input :
Images of surrronding, task explanation in text, state ( senserimotor snapshot at a given time ) Sensorimotor states are projected into a single token using a linear layer to align with the token dimension of the language model. Ex:
i } n p u t " " " ] s i i s ) m n t = a s a g t t 0 0 0 0 0 # { e r e . . . . . " u " 1 0 8 4 0 : c : 2 0 5 5 , t , , , , i i n 0 m o p - 0 0 . a a n . 1 . . 0 n g " a . 0 1 , y e : r 0 1 2 _ r 5 , , 0 o t " a , . t e P y 0 0 0 h n u ( 0 . . , e s t [ . 0 3 r o 4 0 3 1 r t 4 , , . r , h , 0 e e 0 , l 0 . e # r . 0 v e 3 0 a e d 1 , n . , t g c 0 . u - . s , b 0 0 e e . 0 n [ 2 , s 3 i 2 o , n , 0 r . 2 t 0 0 r 2 h . 0 e 4 e 0 , a , 8 d b , 0 i 2 i . n 2 n 1 0 g 4 " . 0 s ] , 5 , 7 R , G B i # # m # # # a j j g o o g e e e i i r n n n n i d d f t t p - - r p e e o p v e f f m o e r f f s l e e r i o o c c o t c p t t b i i e o o o o t n r r t n i ' s e ( p o s s 1 o r ( . s i c 7 ( 0 i e a ) 7 = t n m ) o i t e p o a r e n t a n i , ( o x n 0 , . ( 0 y q = , u c a l z t o ) e s r e n d i ) o n ) Output from Action expert that predict what action it should take next: Flow matching is a way to train the action expert so it can generate smooth, realistic action sequences quickly and efficiently.
...