True Vision AI: Deepfake Video Detection Using a Hybrid Ensemble of Xception and Video Vision Transformer (ViViT)
DOI:
https://doi.org/10.47392/IRJAEM.2026.0317Keywords:
Deepfake Detection, Xception, Video Vision Transformer (ViViT), Ensemble Learning.Abstract
Deepfakes have become one of the most pressing issues of our time, and what used to take a team of visual effects experts weeks to do can now be done in minutes using freely available software, with results increasingly indistinguishable from reality. We present True Vision AI, a deepfake video detection system based on a two-stream ensemble approach utilizing both spatial and temporal understanding. Our system combines a fine-tuned Xception network (pre-trained on ImageNet) for detecting subtle visual inconsistencies in individual frames, alongside a Video Vision Transformer (ViViT-B/16x2, pre-trained on Kinetics-400) for detecting motion-level anomalies across frames. Features from both networks are merged into a unified 2,816-dimensional vector fed into a compact classifier to determine whether a video is real or fake. Trained and tested on the Celeb-DF dataset (890 genuine videos and 808 deepfakes), our Xception model achieves 88.5% validation accuracy, ViViT achieves 87.0%, and the ensemble achieves 88.3%, The final system is deployed as a lightweight Flask API that provides a determination, a confidence score, and a frame-level breakdown of where deception is likely occurring.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2026 International Research Journal on Advanced Engineering and Management (IRJAEM)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
.