Abstract

In today’s digital landscape, video and image data have emerged as pivotal and widely adopted means of communication. They serve not only as a ubiquitous mode of conveying information but also as indispensable evidential and substantiating elements across diverse domains, encompassing law enforcement, forensic investigations, media, and numerous others. This study employs a systematic literature review (SLR) methodology to meticulously investigate the existing body of knowledge. An exhaustive review and analysis of precisely 90 primary research studies were conducted, unveiling a range of research methodologies instrumental in detecting forged videos. The study’s findings shed light on several research methodologies integral to the detection of forged videos, including deep neural networks, convolutional neural networks, Deepfake analysis, watermarking networks, and clustering, amongst others. This array of techniques highlights the field and emphasizes the need to combat the evolving challenges posed by forged video content. The study shows that videos are susceptible to an array of manipulations, with key issues including frame insertion, deletion, and duplication due to their dynamic nature. The main limitations of the domain are copy-move forgery, object-based forgery, and frame-based forgery. This study serves as a comprehensive repository of the latest advancements and techniques, structured, and summarized to benefit researchers and practitioners in the field. It elucidates the complex challenges inherent to video forensics.

1. Introduction

In the last few years, image production has increased exponentially. Around 1.4 trillion digital photos are expected to be created in 2020 alone, according to an estimate. Today, digital photographs are essential to our everyday lives since they not only serve as a means of saving photos of family and friends but also appear on the covers of all the main news publications, such as magazines, newspapers, and journals. Thanks to recent technological advancements, one may now effortlessly modify a digital picture or video using computer software or a mobile application. Identification theft is one instance in which someone’s identity can be taken by a fraudster who has access to their personal and financial data. Law enforcement officials must utilize several automatic tools or approaches to determine if a person is clean-handed or the perpetrator in order to prevent such dire circumstances [1].

When authentic digital data are in short supply, the main objective of synthetic data generation is to produce something that is extremely close to the actual thing. Deepfake technology, which uses computer vision and graphics to swap out one person’s face for another person’s, is a major source of worry [2]. The credibility of media sources is therefore seriously compromised. Therefore, one must check the video to determine whether the content is unique. If the video’s authenticity and uniqueness are affected, viewers may perceive it differently [3]. In recent years, there has been increased interest in object-based video forgery detection. However, up until recently, the most common object-based forgery detectors still relied on characteristics that had to be handcrafted, and their results were subpar [4]. Videos are vulnerable to manipulation attempts that change the intended meaning and trick the viewer. Previous methods of detecting video falsification discovered the altered areas using minute hints. Attackers can, however, circumvent detection by erasing these hints using video compression or blurring [5].

With the advancement of multimedia editing capabilities in recent years, image and video alteration has become increasingly popular [6]. Current face forgery detection techniques based on frequency domain discover that, in contrast to authentic photographs, the generative adversarial network (GAN) fabricated images exhibit glaringly visible grid-like visual abnormalities in the frequency spectrum [7]. The most significant and commonly utilized type of communication nowadays is video and picture data. In a variety of fields, including law enforcement, forensic research, media, and others, it is utilized as proof and verified evidence. The issue of video and picture counterfeiting has emerged along with the growth of video applications and data [8].

Nowadays, especially considering the ubiquitous sharing of movies on social media and websites, a lot of attention is dedicated to spotting video forgeries. There are several video editing apps that work well for altering video footage or even producing videos [9]. The full exploration of deep learning models’ broad representational capacity and their connection to different forensic aspects remains incomplete. Instead, existing approaches mostly concentrate on manually picked models and features for a limited task, such as copy-move or splicing [10]. The videos captured by surveillance cameras are frequently used in court as persuasive evidence. They are frequently employed to offer protection and security. Sometimes, after editing, various postprocessing steps are carried out to conceal the signs of counterfeiting. Since then, it has been critical to scientifically assess the veracity and integrity of surveillance videos.

1.1. Importance and Contribution

Several survey papers exist in the literature to cover different aspects of forged videos. Nayerifard et al. [11] conducted a literature review on traditional machine learning for image forensics, covering the time span from 2010 to 2021. Similarly, Stroebel et al. [12] conducted a systematic literature review (SLR) from 2021 to August 2022, highlighting the dominance of deep learning (DL) over machine learning (ML) in Deepfake detection, especially in nonmedical contexts. Another SLR conducted by Chauhan et al. [13] highlights the use of various algorithms, including deep neural networks (DNN), for detecting Deepfakes, particularly in the video game and cinema industries, and its focus was to identify loopholes, datasets, and contemporary techniques of Deepfake in the entertainment domain. Rana et al. [14] conducted an SLR on Deepfake detection covering the years from 2018 to 2020, categorizing methods into deep learning-based, classical machine learning-based, statistical, and blockchain-based techniques, concluding that deep learning-based methods are the most effective in detecting Deepfakes. Shahzad et al. [15] emphasized the need for novel methods to detect face manipulation in Deepfakes and suggested combatting this threat through policies, regulations, and technological advancements. Tolosana et al. [16] presented a comprehensive survey on techniques for facial image manipulations only and discussed methods for detecting manipulations related to Deepfakes. Yadav et al. [17] presented a survey on forgery and described Deepfake as an emerging AI-based technology to create convincing fake videos. They highlight its potential for misuse, such as character defamation of politicians and celebrities.

This study emphasizes the importance of video and image data in today’s digital world in various domains such as law enforcement and forensic investigations. This systematic literature review (SLR) covers the primary research studies from 2016 to 2023, highlighting various research methodologies like deep neural networks, watermarking networks, hybrid model, etc. for detecting forged videos. Moreover, this study presents a discussion on key challenges like frame manipulation and serves as a valuable resource for researchers and practitioners in the field of video forensics, addressing the complexities and challenges involved. In this study, the primary objective is to collect information on fake video and image data and to collect information on techniques to identify the forged videos. The above summary concludes that no other SLR exists with the same publishing and scope period. The main contribution of this paper is as follows:

This study(i)provides an overview of the evolution of video and image forgery detection(ii)widely covers not only the traditional machine learning methods but also emerging methods such as deep learning, transfer learning, federated learning, water marking, generative adversarial networks (GANs) and attention mechanisms in improving video forensics(iii)provides a state-of-the-art summary of notable work in the forensic domain for detection and verification of video authenticity(iv)covers methodologies evaluated on different categories of private and public datasets such as surveillance, security, action, social media data, legal proceedings, wildlife, action detections, privacy or consent issues, and video forensics(v)explores potential future trends and challenges of detection techniques for research community in video and image manipulation

The rest of the paper is organized as follows: The next section describes the planning, design, and execution of the SLR. Further sections present the state-of-the-art summary of the prominent work in the field, followed by a detailed discussion of various techniques. The last section describes domain challenges, followed by a conclusion in the end.

2. Systematic Literature Review

The three steps of SLR are planning, implementation, and reporting. The best research information is acquired and then utilized to evaluate research challenges, according to evidence-based software engineering (EBSE). The identified studies were examined based on their title, abstract, and conclusions. The planning of SLR is presented in Figure 1. The next subsections are specifics on how the SLR is expected to be carried out.

This systematic literature review (SLR) paper stands out in the field of forged video detection by offering a comprehensive and up-to-date perspective. Covering primary research studies from 2016 to 2023, it ensures readers are presented with the latest advancements and techniques in video forensics. Unlike studies that focus on specific methodologies and domains like Deepfake, interframe, copy-move, and object-based detection, this review encompasses a wide array of research approaches, including deep neural networks, convolutional neural network, and hybrid models, providing a holistic view of detection techniques. Emphasizing the critical challenges in video forensics, particularly frame manipulation, Deepfake, copy-move manipulation, and object-based manipulation issues, it offers practical insights and recommendations for researchers and practitioners. Furthermore, this review takes a global perspective, considering the impact of forged videos across various domains, making it a valuable and distinctive resource for addressing this pervasive issue.

2.1. Research Questions Formation

The research questions (RQ) addressed in this study is presented in Table 1.

2.2. Review Protocol Formation

This section outlines the SLR process and provides a quick overview of the SLR. The following subsections describe the search process, selection of papers, extraction of data, and analysis conducted.

2.2.1. Process of Search

Following that, we chose the keywords based on the language used in the forged video detecting sector. Then, a specialist looked up all the keywords’ synonyms, alternatives, and hypernyms. To limit the search results, used the Boolean operators AND and OR, as well as the wildcard character “” in the search term. The synonyms were combined using the “OR” operator. For instance, the wildcard () denotes either a single alphanumeric character or a collection of alphanumeric characters in accordance with the IEEE Xplore search criteria. The population and intervention terms are combined using the AND operator.

To identify which research should be regarded as primary, both primary and secondary searches, as well as snowball tracking, were employed. I looked for the years 2016 and 2023 but found nothing. The primary search was conducted by searching electronic journals, conference proceedings, and the grey literature. Search engines and online databases for research (IEEE, BMJ, Springer, ISPRS, SAGE, ICCIT, IAES, ECS, BMC, ELSEVIER, etc.) were also used. Although they were also used, search engines like Google and Google Scholar were excluded from the total number of studies since they only included excerpts from other known research databases.

The secondary search on the papers uses the titles, abstracts, and conclusions (identified by the first search). The secondary search results were then used to select the articles for analysis using the inclusion/exclusion criteria and quality criteria provided in the next section.

Snowball tracking was also carried out to make sure no pertinent studies were missed, which entails reading over the final primary research’s reference list.

2.2.2. Exclusion Criteria of Study

Primary study is in English and with the complete text was necessary for consideration.

2.2.3. Inclusion Criteria of Study

The research’s relevance to RQs was one of the selection criteria. Primary studies from business and/or research viewpoints were considered if they offered an empirical evaluation. The validity of SLR significantly depends on the calibre of the chosen study. Therefore, we only included peer-reviewed studies that met our standards for quality.

2.2.4. Criteria of Quality Assessment

SLR must be used with high-quality research in order to produce reliable results and conclusions. This calls for sound SLR planning, appropriate keywords, and well-stated exclusion and inclusion criteria. A snowball tracking activity that involved perusing the reference lists of each primary study listed (see Table 2) was then performed. Criteria to further analyse the validity of the research are presented in Table 2.

2.2.5. Extraction of Data

Each RQ could be extracted in a structured, uniform, and consistent manner thanks to the data extraction forms made in Microsoft Excel. The results were entered in the forms for further analysis and investigation. You may find definitions of certain RQ-related data (see Table 3) lower down this page.

2.2.6. Empirical Study

Identified the empirical research method that was applied in each primary study. Empiricism techniques are categorized (see Table 4).

2.2.7. Tool/Model

The measurement planning model aids software businesses in carrying out their measurement operations in a manner that helps them accomplish their objectives. We reviewed the key studies, together with their empirical validation, to identify what kinds of measurement planning models and related tools are available.

2.3. Method of Conducting SLR
2.3.1. Primary Study of Research

In Figure 2, the illustration outlines the systematic process of identifying and selecting primary studies. The initial step involved conducting a comprehensive primary search, which yielded a pool of 450 prominent papers to serve as our initial reference point. Subsequently, further exploration led to the discovery of several potential primary studies, as indicated. In addition, Table 5 complements this process by providing valuable information regarding the impact factor and the most recent updates of these studies.

2.3.2. Data Analysis and Extraction

Data extraction forms are used to extract data (see data extraction section). In our analysis, we used both qualitative and quantitative methods. Figure 3 displays the distribution of primary research based on their publication year and the number of research publications per year. The average number of publications was the same every three to four years. These findings are rather surprising given that fake videos have been available for a long time. In past years, it was expected to see more empirical research. This might be due to the paucity of researchers in this subject and the lack of access to businesses that utilize falsified video identification. As a result, the group may be unable to exchange experiences and learn from one another. Figure 4 depicts the empirical technique for primary research categorized as conference and journal.

Table 5 shows the total number of articles, conferences, and search engines for dataset searches. Table 6 represents the state-of-the-art of all previous 7 years publications, whereas Table 7 represents the dataset detail. The pie chart in Figure 5 shows the bar chart of publisher from research that has been done, and Figure 2 represents the total ratio and proportion of research articles used in this study. Figure 6 represents the primary study selection process and criteria.

Table 7 gives a thorough overview of the many datasets used in computer vision and video analysis, spanning multiple years, and research references. These datasets cover a wide range of video formats, including MP4, MPEG-4, and high-resolution video coding standards, and range greatly in size from thousands to millions of samples. In addition, these dataset’s dimensions, which indicate the qualities of the video data, range greatly from low resolutions to multidimensional feature vectors. From action recognition and Deepfake detection to video frame analysis and high-provenance image and video datasets, the datasets span a wide spectrum of applications. These datasets can be used by computer vision researchers and practitioners to create and test new algorithms, ultimately improving video analysis and artificial intelligence in a variety of real-world applications. These datasets support developments in the study of artificial intelligence and video-based research by catering to a wide range of applications, including Deepfake detection, action identification, and picture and video analysis.

3. Categories of Models

This section provides a discussion on models used in the domain of video forgery detection. The structure of models and prominent work based on these models is discussed.

3.1. Convolution Neural Network (CNN)

An advanced deep learning architecture known as a convolutional neural network (CNN) is specifically created to process and analyse visual input, such as pictures and videos. Due to its capability to automatically learn hierarchical features from the input data, it has revolutionized computer vision jobs. Figure 7 shows the CNN structure.

3.1.1. Structure of CNN

The main components of a CNN are described as follows.

The foundational elements of the CNN are convolutional layers. Each layer is made up of a collection of filters (also known as kernels) that conduct convolutional operations by sliding over the input data, like an image. The filters are in charge of spotting various elements in the input, such as edges, corners, and textures.

An activation function, frequently a ReLU (rectified linear unit), is applied elementwise after each convolution process to provide nonlinearity to the network. This aids the CNN model in detecting more intricate links and patterns in the data.

Pooling layers are used to control overfitting and minimize the spatial dimensions of the data. The most popular pooling method, known as max-pooling, subsamples the maximum value from a tiny area of the feature map. This helps to keep key characteristics while reducing computational complexity.

The data are sent via fully connected layers after several convolutional and pooling layers. These layers link each neuron in one layer to each neuron in the one above. They are responsible of making predictions using the high-level characteristics that the preceding layers have learnt.

Each convolutional block in the CNN architecture is made up of convolutional layers, followed by activation and pooling layers. For classification tasks, a softmax layer is frequently placed after the final fully linked layers and offers probability ratings for various classes.

3.1.2. Discussion of CNN Based Approaches

Ganguly et al. [2] proposed a deep learning model improved with the visual attention approach to distinguish false videos and images from authentic ones. To create the feature maps, they first extracted the facial region from the video frames and then applied the extracted region through the Xception model that has been previously trained. Next, they concentrated mostly on the Deepfake video modification of the remaining artifacts with the aid of the visual attention mechanism. FaceForensics++ and Celeb-DF, two publicly accessible datasets, were used to test their model (V2). Kumar et al. [3] described that the deep characteristics are an important aspect in identifying the fake and abnormal fluctuations in the film. They used a parallel “CNN” model to extract deep features to uncover the disassociation between the consecutive frames and detect video counterfeiting. Their model also determined that how far the correlation coefficient is from the deep features. Tan et al. [4] combined two-dimensional/three-dimensional recurrent neural network and the convolutional neural network for the first time in a unique hybrid deep learning network. They used it for object-based video forgery detection with sophisticated encryption formats. Zhou et al. [5] offered a network for video watermarking to detect manipulation. They trained a decoder to forecasts the tampering mask and a 3D-UNet-based watermark embedding network.

Fadl et al. [32] presented an interframe forgeries (frame deletion, insertion, and duplication) detection system using a “2D convolution neural network” (2D-CNN) with spatiotemporal fusion and information for deep automated feature extraction. For classification, a “Gaussian RBF multiclass support vector machine” (RBF-MSVM) is employed. Zheng et al. [33] offer a brand-new end-to-end structure with two key phases. A completely temporal convolution network makes up the initial level “fully temporal convolutional network” (FTCN). Surprisingly, discover that this unique architecture can help the model extract temporal cues and increase its capacity to generalize. Temporal transformer network, which is used in the second step, seeks to investigate long-term temporal coherence. The suggested system is all-encompassing and adaptable, allowing for direct training from the start without the need of pretraining models or outside datasets.

Hau Nguyen et al. [41] are retraining the existing CNN models that were trained on the ImageNet dataset in order to identify video interframe forgeries. The suggested techniques are based on retrained CNN models that take use of spatial-temporal correlations in a video to effectively identify interframe forgeries. It is suggested to use a confidence score rather than the raw output score from networks to account for network mistakes. It has been demonstrated through the results of tests that the suggested strategy is both much more efficient and accurate than more current methods. Long et al. [75] offer a novel method for forensic analysis that relies on the local spatiotemporal correlations inside a video segment to identify frame deletions. Suggest modifying the “Convolutional 3D Neural Network” (C3D) for the detection of frame drops. Rao and Ni [86] automated the learning of hierarchical representations from the input RGB colour photographs by a “convolutional neural network” (CNN), a novel method for detecting image forgeries based on deep learning. The suggested CNN is made particularly for applications like copy-move detection and picture splicing.

3.2. Deep Artificial Neural Network (ANN)
3.2.1. Structure of Deep ANN

An artificial neural network (ANN) with numerous layers in between the input and output layers is known as a deep neural network (DNN) (see Figure 8). Since it has several hidden layers, it can learn and model complicated patterns and representations in the data, which is why it is dubbed “deep” learning. DNNs are a crucial part of deep learning, a branch of machine learning that has attracted substantial interest and achieved success in several fields, including speech recognition, natural language processing, and computer vision. Deep neural networks have shown to be incredibly effective at a variety of difficult tasks, including picture and audio recognition, interpreting spoken language, playing games, and many more activities involving a lot of data and high-dimensional input.

3.2.2. Discussion of Deep ANN Based Methods

Kaur et al. [37] centred on a very effective strategy for the usage of “deep convolutional neural network” (DCNN) to reveal interframe manipulation in the videos. The suggested approach will identify forgeries without the need for extra pre-embedded frame data. The classification of the fabricated frames by our algorithm is based on the correlation between frames and the detected irregularities using DCNN, which is another important aspect of preexisting learning techniques. The decoders used for batch input normalization speed up training. Zhong et al. [38] presented a method to efficiently extract from the video multidimensional dense moment features. Second, a brand-new way of feature representation concatenates each feature submap index, which represents each dimension of the feature, into a 9-digit dense moment feature index. Third, an interframe best match approach is suggested to locate the best matches among each pixel’s 9-digit dense moment feature index. The best match map is created by all the greatest matches. D’Avino et al. [70] propose deep learning detection with an architecture based on autoencoders and recurrent neural networks. The autoencoder learns an intrinsic representation of the source during a training phase on a few clean frames. The forged material is then identified as anomalous since it does not conform to the taught model and is encoded with a substantial reconstruction error. To leverage temporal relationships, recursive networks with the long short-term memory model are utilized. Preliminary findings on forged videos demonstrate the approach’s potential.

Yadav and Salmani [17] present a survey article showing that the Deepfake approach combined with a generative adversarial network can produce results that appear realistic to human eyes. A false picture is created by the “Generative Adversarial Network” by fusing together two separate photographs, but the images of Persons A and B must be comparable in terms of facial features and skin tone, and they must have been shot under the same lighting conditions. Deepfake can be used in two beneficial ways: it can be implemented in the education sector to change the faces of historical figures and use them as study materials; or it can be used in the arts to change the faces of actors in movies, which will save a lot of money by doing away with the need for CGI and VFX in addition to Deepfake. Hosler et al. [44] enumerate the features, make-up, and collecting process of video-ACID, which contains films with obvious markings for testing camera model recognition software. Finally, utilising cutting-edge deep learning algorithms, we present baseline camera model identification findings on these evaluation films. Shou et al. [61] explicitly address the difficulties in training ODAS models by suggesting three unique techniques. Three techniques are used to deal with the lack of training data. Carry out substantial research utilising ActivityNet and THUMOS’14.

3.3. Hybrid Neural Network Model
3.3.1. Structure of Hybrid Models

When parts of several neural network architectures or machine learning models are combined, the result is a hybrid neural network model (see Figure 9). The objective is to combine the advantages of each model to produce a more robust and adaptable system that can successfully complete a range of tasks. To process both image and sequence input concurrently, a hybrid neural network, for instance, may incorporate elements of a convolutional neural network (CNN) and a recurrent neural network (RNN). This might be helpful in tasks like video analysis or action detection in videos that call for the processing of both visual and temporal information.

4. Categories of Forgeries

The prominent categories of forgeries in the domain are identified and discussed in this section.

4.1. Deepfake Based Forgery Detection

Wang et al. [7] worked to get a more thorough representation of the spatial and temporal features and suggested a “discrete cosine transform-based forgery clue augmentation network” (FCAN-DCT). “Compact Feature Extraction” (CFE) module and “Frequency Temporal Attention” (FTA) module are two branches of the FCAN-DCT, which also comprises of a backbone network. They thoroughly evaluated two datasets that use “visible light” “Wild Deepfake” and “Celeb-DF” (v2). They also created their own, self-created Deepfake NIR, the first video forgery dataset based on the near-infrared modality. Afchar et al. [55] propose an approach to quickly and effectively spot face tampering in films. It focuses on Deepfake and Face2Face, two current methods used to produce hyperrealistic fake videos. Use a dataset that is already available as well as one that have created using web videos to evaluate those fast networks. Do et al. [74] proposed a convolutional neural network to perform face forensic. Employ GANs to generate synthetic faces in a variety of resolutions and sizes to aid data augments. Furthermore, for strong face feature extraction, use a deep face recognition system to send weight to our system. Furthermore, the network is fine-tuned for real/fake picture categorization. I tried with the AI Challenge validation data and got decent results.

4.2. Frame-Based Forgery Detection

Munawar and Noreen [8] presented a deep learning method to resolve the issue of frame duplication with different frame rates. They proposed a novel deep learning framework made up of “Inflated 3D” (I3D) and “Siamese-based Recurrent Neural Network” (RNN). Their proposed method first extracted the characteristics and converted the movies into frames. To find frame-to-frame duplication, an original and a fake video were fed into the I3D network. Afterwards, several frames were combined to make a sequence. Sasikumar et al. [39] explore how, to eliminate video forgeries, the suggested approach employed two deep learning-based algorithms. First, the feature extraction approach is called “Scalar Invariant Feature Transform” (SIFT). The second method is “Mean Shift Clustering Algorithms” (MSCL), which groups comparable object frames from the retrieved video. The suggested model offers information on how many and which frames in a specific movie were faked. The suggested approach employs image processing methods and is a window-based application. Fayyaz et al. [46] explain how an attacker can compensate for a forged picture by adding SPN to it. It then suggests a forgery detection method for such a situation that would rely on the correlation between noise residue and SPN as well as noise residue from prior frames. Even if the attacker adds SPN to the forged frame to make up for it, the interframe continuity of noise will be interrupted and therefore be detected.

Joshi and Jain [48] represented a passive tampering detection technique that may be used on films recorded using variable-size GOP structures is described. First, all of the video frames from a specific video sequence are retrieved. Then, a video’s real and reconstructed from temporal difference is determined for each pair of adjacent frames. Frame prediction error is used in the reconstruction of video. Last but not the least, tampering is located and found using the estimated discrepancies. Nguyen et al. [63] employ a capsule network to identify several types of spoofs, such as replay attacks that leverage printed pictures or recorded movies to create computer-generated videos. It expands the use of capsule networks to address inverted graphics issues beyond their initial purpose. Singh and Singh [54] present to find frame and area duplication forgeries in films and offer a passive-blind method using two separate algorithms. Examined the video frame duplication forgery in three different ways, including duplication of a series of consecutive video frames at a long continue running position, duplication of numerous such sequences with various lengths at various locations, and duplication from other videos with different and identical dimensions, all of which can pose serious issues in a real-world setting. Analysed fabricated regular and irregular regions at various positions both within the same frame and from another frame to one or more sequences of subsequent frames from the same video at those same locations.

Ulutas et al. [65] propose a novel frame duplication detection approach based on the “Bag-of-Words” (BoW) model. Researchers employ the BoW model for retrieving images and videos after textual analysis. To find the order of repeated sections in the movie, frame features—visual word representations at key points—are employed. To increase efficiency and robustness, the approach computes thresholds based on the content. 31 test films from various movies and the “Surrey University Library for Forensic Analysis” (SULFA) are used to test the suggested technique. Zhao et al. [66] represent a passive-blind forensics method for video shots is suggested to identify interframe forgeries based on similarity analysis. The two components of this technique are the comparison of the “Hue, Saturation, Value” (HSV) colour histograms and the feature extraction using “Speeded Up Robust Features” (SURF) and “Fast Library for Approximate Nearest Neighbors” (FLANN) for double-checking. The forgery kinds in the tampered sites are then further confirmed using SURF feature extraction and FLANN matching.

Voronin et al. [67] accentuate the issue of detecting erased frames in movies. Frame dropping is a method of video editing in which a series of frames are dropped to jump ahead to certain parts of the source video. In digital video forensics, the automated identification of lost frames is a difficult problem. Outlines a method employing the spatial-temporal method based on the “convolutional neural network” and statistical analysis. Calculate confidence scores for each frame using the collection of several statistical procedures. Moreover, the output scores were obtained using a convolutional neural network. Bozkurt et al. [73] offer a novel video forgery detection technique with improved execution and detection capacity for detecting faked frames the frames’ features are retrieved, and their correlations are shown as a correlation picture. To determine the forging operation, this approach studies a line on the correlation image. The identified line is then subjected to two new techniques (shrinking/expanding) to discover the precise position of the counterfeit. Su et al. [77] provide a rapid fraud detection technique for identifying region duplication in films based on Exponential-Fourier moments (EFMs). The system initially extracts EFMs characteristics from each block in the current frame before performing a rapid match to identify probable matching pairs. The updated areas are then located in the current frame using a postverification strategy (PVS). Finally, the “tampered areas are tracked in succeeding frames using an adaptive parameter-based fast compression tracking technique” (AFCT).

Barhoom et al. [82] suggested a novel reverse method to identify frame duplication in order to prevent theft by blocking the ip camera in particular locations. Abbasi Aghamaleki and Behrad [83] suggest a novel passive technique for tampering detection and localization in MPEGx-coded films. The suggested approach is capable of detecting double compression and frame insertion or deletion with various GOP formats and lengths. The quantization error traces on the residual errors of P frames are theoretically investigated in order to develop the suggested technique. Rigoni et al. [89] outlines a methodology for finding altered data in digital audio-visual material. The suggested architecture combines temporal and spatial watermarks without lowering the calibre of the host videos. Watermarks are embedded using a modified version of the “Quantization Index Modulation” (QIM) technique. The QIM watermarking algorithm’s vulnerability enables pixel-level detection of local, global, and temporal-altering assaults. The method can also determine what kind of tampering assault it is. The framework is quick, reliable, and precise.

4.3. Copy-Move Forgery Detection

Parveen et al. [43] presented a pixel-based copy-move picture fraud detection technique is presented to validate the validity of digital images. The steps in the proposed technique are as follows: The steps are as follows: (1) convert the colour picture to grayscale; (2) split the grayscale image into overlapping blocks of size 88; (3) extract features using DCT based on several feature sets; (4) cluster blocks using the K-means method; and (5) use radix sort for feature matching. Experimental findings show that the suggested approach may successfully identify the fake part from digital photos. Pavlović et al. [51] represent a novel approach to the identification of such changes, employing both common statistical statistics and specific multifractal parameters as defining characteristics. Images are split into nonoverlapping, fixed-size pieces before to processing. The typical characteristics are computed for each block. Employed a metaheuristic strategy to categorize the observed blocks and put forth a brand-new semimetric function for comparing the similarities between blocks. Simulation demonstrates that the suggested strategy delivers high accuracy and recall with minimal computing cost.

Liu et al. [52] show that the suggested technique, the superpixel of the picture is divided into complicated parts and smooth regions using K-means clustering technology. “Scale-Invariant Feature Transform” (SIFT) features are employed in complicated regions to find tampering. The sector mask feature and RGB colour feature are suggested as ways to identify tampering in smooth zones. For the copy-move detection, filtering out incorrect matching is done to these two categories of areas. Jia et al. [53] consider the three requirements and provide an innovative method to identify frame copy-move forgeries. Based on “optical flow” (OF) and stable parameters, a coarse-to-fine detection approach is developed. To locate potentially manipulated locations, coarse detection specifically examines the consistency of the OF sum. Following that, fine detection is carried out to determine the exact position of the forgery. To further limit false detection, duplicated frame pairs are matched using OF correlation. Voronin et al. [67] accentuate the issue of detecting erased frames in movies. Frame dropping is a method of video editing in which a series of frames are dropped to jump ahead to certain parts of the source video. In digital video forensics, the automated identification of lost frames is a difficult problem. Outlines a method employing the spatial-temporal method based on the “convolutional neural network” and statistical analysis. Calculate confidence scores for each frame using the collection of several statistical procedures. Moreover, the output scores were obtained using a convolutional neural network.

D’Amiano et al. [69] present a novel method for detecting and localising video copy-move frauds. It can be challenging to find well-crafted video copy-moves, especially when some uniform backdrop is duplicated to occlude foreground items. Employ a dense-field technique with invariant characteristics that provide resilience to multiple postprocessing processes to identify both additive and occlusive copy moves. Mizher et al. [78] represent a several forms of video falsifying, video forgery detection approaches are researched and characterized, difficulties to current forgery detection systems are discussed, and a conclusion of proposed ideas is made. Zhu et al. [79] presented by monitoring the SIFT to create temporal concentration SIFT (TCSIFT), which substantially compresses the number of local features to eliminate visual redundancy while maintaining as many of the benefits of SIFT at the same time, SIFT features are stably stored with temporal information. Results from experiments on the two distinct datasets CC WEB VIDEO and TRECVID show that our technique can produce equivalent accuracy, a smaller storage size, a faster execution time, and the ability to adapt to varied video transformations.

4.4. Object-Based Forgery Detection

Kim et al. [6] proposed a brand-new object-based frame identification network. Their suggested technique leveraged symmetrically overlapping motion residuals to improve the ability to distinguish between video frames. Because their suggested motion residual characteristics were produced using overlapped temporal frames, the deep neural network took advantage of temporal fluctuations in the video stream. Alsakar et al. [9] showed the introduction and discussion of a newly created passive video forgery system. For the purpose of finding and detecting the two significant forms of forgeries, insertion and deletion, an arbitrary number of core tensors are chosen. To accomplish greater data reductions and to give useful characteristics to track forgeries throughout the whole film, these tensor data are orthogonally modified. Experimental findings and comparisons demonstrate the superiority of the suggested method, which can identify and locate both forms of assaults with a precision value of up to 99%.

Jin et al. [10] presented a dual-stream framework for object-based video forgery detection. First, discriminative characteristics are extracted using two distinct kinds of branches. A “Conditional Random Field” (CRF) layer is then applied to the segmentation results following the dual-stream feature fusion. Finally, the video tracking approach is incorporated to identify temporal consistency. To improve the localization outcomes, depth information is used. Aloraini et al. [40] explore how to identify forged regions in films and identify object removal forgery and provides a unique method based on sequential and patch analysis. By modelling video sequences as stochastic processes, sequential analysis may be used to identify fake videos by monitoring changes in their properties. By employing anomalous patches to visualize the movement of deleted items, pinpoint fabricated areas.

Chen et al. [56] enhance object detection’s resilience across domains. Approach the domain shift from two angles: (1) the instance-level shift, such as object appearance, size, etc., and (2) the image-level change, such as picture style, lighting, etc. To lessen the domain difference, propose two domain adaptation components at the picture and instance levels based on the most recent state-of-the-art Faster R-CNN model. Based on the H-divergence theory, the two domain adaptation components are accomplished through the adversarial training of a domain classifier. Aneja et al. [59] create a convolutional captioning method for images. On the difficult MSCOCO dataset, show its effectiveness and show performance comparable to the LSTM baseline with a shorter training time per parameter. Provide convincing arguments in support of convolutional language generation techniques by performing a thorough examination. Mathai et al. [85] suggested using statistical moment characteristics and a normalised cross-correlation factor to detect and locate video forgeries. For each frame block, the characteristics from the prediction-error array are computed (set of a certain number of continuous frames in the video). When compared to other nonduplicated frame blocks, the normalised cross-correlation of those attributes will be higher between duplicated frame blocks. The duplication is verified by utilising a determined threshold based on the mean squared error. Using the method, the position of the duplicated block is also discovered.

5. Discussion of Challenges

There are several challenges concluded from the discussion in previous sections. The reliance on static images rather than video sequences for Deepfake detection [2] is one of many. Furthermore, lack of evaluation on real-world scenario video forgery scenarios [5], large-scale datasets for comprehensive performance assessment [3], absence of extensive evaluation on diverse video datasets, and limiting the generalizability of the proposed framework [4] are the key challenges.

The reliance on motion residual-based analysis is not effective in detecting tampering techniques which does not significantly alter the motion patterns [6]. Existing face forgery detection methods based on frequency domain only focus on single frames and overlook the discriminative part and temporal frequency clues among different frames in synthesized videos [7].

Identification of forged duplication frames in large videos with variant frame rates in real time is not feasible due to computational limitations, lack of generalization, and low-performance accuracy [8]. Scheme based on tensor representation and orthogonal tracing feature algorithms may have limitations in detecting and locating insertion and deletion forgery in videos for more types of attacks beyond the ones tested [9]. Performance in complex or crowded environments, robustness to variations in Wi-Fi signal strength or scalability to larger surveillance networks are also the areas of concern [35].

System’s ability to detect multiple interframe forgeries within a single video is an issue that is not explored explicitly [32]. The scalability of the proposed framework and detection of complex types of forgeries may pose challenges when adding more models and coding parameters, requiring further research and exploration [36]. The specific types or variations of video face forgery detection that may not be effectively addressed by the temporal cues are also an area that requires investigation [33]. The approaches have limitations in terms of efficiency and computational requirements, especially for machines with low memory [37]. Performance and accuracy when faced with more complex and advanced types of video forgeries beyond the ones mentioned (such as content-aware manipulations or Deepfake videos) becomes even worse [38]. System’s security and performance in the presence of large geometric attacks like editing or embedding logos is also a hot topic to investigate deeper [39]. A lot of work is done on detecting forged videos with object removal and moving backgrounds; however, further investigation is needed to improve detection performance at the pixel level [40].

6. Discussion for Key Questions

6.1. In the Literature, Which Technologies, Models, Methods, and Practices Are Discussed?

To address the challenges, a novel deep learning framework made up of “Inflated 3D” (I3D) and “Siamese-based Recurrent Neural Network” (RNN) is suggested. The extraction of characteristics and conversion of movies into frames is the first stage in the suggested framework. To find frame-to-frame duplication, an original and a fake video are sent to the I3D network. Incredibly effective way using a large convolutional neural network to reveal interframe manipulation in videos. A decoder that forecasts the tampering mask and a 3D-UNet-based watermark embedding network. Clustering methods to expedite the block matching approach throughout the process of detecting image fraud. To create a more thorough spatial-temporal feature representation, a “Forgery Clue Augmentation Network” (FCAN-DCT) based on the discrete cosine transform is used. “Compact Feature Extraction” (CFE) module and “Frequency Temporal Attention” (FTA) module are two branches of the FCAN-DCT, which also comprises of a backbone network.

6.2. When Implementing Forged Video Detection, What Techniques/Methods/Models Are Developed for Metric Selection?

Frame duplication is recognized using a “Siamese (twin) RNN” integrated with I3D from a collection of forged sequence frames. These two forms of video forgeries are easily found and located using SVD tube-fibber tensor construction. To improve segmentation results, dual-stream feature fusion and a “Conditional Random Field” (CRF) layer are used. Deep automated feature extraction is performed using a “2D convolution neural network” (2D-CNN) of spatiotemporal information and fusion, and classification is performed using a Gaussian RBF multiclass support vector machine (RBF-MSVM). A network of “fully temporal convolutions network” (FTCN). The major finding of FTCN is to keep the temporal convolution kernel size constant while reducing the size of the spatial convolution kernel. Video interframe manipulation using a “deep convolutional neural network” (DCNN).

6.3. What Mitigation Techniques Are Advised to be Used to Recognize Forged Videos?

CNN’s core model is based on the Xception model. Deep learning architectures are appropriate for object-based forgery detection in a more sophisticated H.265 encoding format and in more realistic real-world scenarios. Examine how successfully video forgery detection can be used in the context of increasingly hostile and real-world disturbances. Improving the process for discovering and recognizing additional sorts of assaults. Technology can identify several interframe forgeries in a single video.

7. Conclusions

This article presents a comprehensive overview of significant research endeavours aimed at identifying counterfeit videos, a pressing issue exacerbated by the widespread accessibility of image and video editing tools in today’s technology-driven era. As digital manipulation capabilities have become increasingly accessible, the potential for misuse and harm has grown in tandem with the creative possibilities. The malicious use of facial alteration raises concerns regarding real-world injustices inflicted upon unsuspecting individuals. The emergence of potent computer software and mobile applications has ushered in a new era of visual manipulation, allowing virtually anyone to manipulate images and videos effortlessly. The diligent systematic literature review (SLR) conducted for this study has encompassed pivotal research articles, revealing various methods for identifying forged videos. These methods leverage cutting-edge technologies, including deep neural networks, convolutional neural networks, Deepfake analysis, watermarking networks, and clustering, among others. The synergy of success factors and mitigation strategies forms a cohesive framework of solutions to address the multifaceted challenges associated with counterfeit video content. The study concludes that major challenges include frame duplication, deletion, and insertion. Furthermore, frame rate fluctuation and loop detection are also important problems from the standpoint of duplicated frames. Due to computational restrictions, a lack of generalization, poor performance accuracy, and real-time identification of counterfeit duplicating frames for big films with different frame rates, this is not feasible.

Data Availability

The data used to support the findings of this study are included within this article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.