The latest amalgamation of transformer and convolutional designs has led to regular enhancements in accuracy and effectivity of the fashions. On this work, we introduce FastViT, a hybrid imaginative and prescient transformer structure that obtains the state-of-the-art latency-accuracy trade-off. To this finish, we introduce a novel token mixing operator, RepMixer, a constructing block of FastViT, that makes use of structural reparameterization to decrease the reminiscence entry price by eradicating skip-connections within the community. We additional apply train-time overparametrization and enormous kernel convolutions to spice up accuracy and empirically present that these selections have minimal impact on latency. We present that – our mannequin is 3.5x quicker than CMT, a latest state-of-the-art hybrid transformer structure, 4.9x quicker than EfficientNet, and 1.9x quicker than ConvNeXt on a cellular machine for a similar accuracy on the ImageNet dataset. At related latency, our mannequin obtains 4.2% higher High-1 accuracy on ImageNet than MobileOne. Our mannequin persistently outperforms competing architectures throughout a number of duties — picture classification, detection, segmentation and 3D mesh regression with vital enchancment in latency on each a cellular machine and a desktop GPU. Moreover, our mannequin is very sturdy to out-of-distribution samples and corruptions, enhancing over competing sturdy fashions.