Guidance
Understanding GPU utilization and timing details of the operations is the first step in profiling your model.
To learn more about Tensor cores and Mixed Precision training, visit this site:
https://developer.nvidia.com/tensor_cores
You will find resources on how to train networks with mixed precision and make full use of Tensor cores for Tensorflow models here:
https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#training_tensorflow
Note that if there are multiple kernels being observed on single op, these are likely performing data transposes to prepare the data for efficient use by tensorcores. Such transposes themselves would not use tensor cores.