Achieving better recognition rate for text in video action images is challenging due to multi-type texts with unpredictable backgrounds. We propose a new method for the classification of captions (which is edited text) and scene texts (which is part of an image in video images of Yoga, Concert, Teleshopping, Craft, and Recipe classes). The proposed method introduces a new fusion criterion-based on DCT and Fourier coefficients to extract features that represent good clarity and visibility of captions to separate them from scene texts. The variances for coefficients of corresponding pixels of DCT and Fourier images are computed to derive the respective weights. The weights and coefficients are further used to generate a fused image. Furthermore, the proposed method estimates sparsity in Canny edge image of each fused image to derive rules for classifying caption and scene texts. Lastly, the proposed method is evaluated on images of five above-mentioned action image classes to validate the derived rules. Comparative studies with the state-of-the-art methods on the standard databases show that the proposed method outperforms the existing methods in terms of classification. The recognition experiments before and after classification show that the recognition performance rate improves significantly after classification.