Journal of information and communication convergence engineering 2023; 21(2): 152-158
Published online June 30, 2023
https://doi.org/10.56977/jicce.2023.21.2.152
© Korea Institute of Information and Communication Engineering
Correspondence to : Ki-Hong Kim (E-mail: khkim@g.dongseo.ac.kr)
Department of Visual Animation, Dongseo University, Busan 47011, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
In recent years, as computer-generated imagery has been applied to more industries, realistic facial animation is one of the important research topics. The current solution for realistic facial animation is to create realistic rendered 3D characters, but the 3D characters created by traditional methods are always different from the actual characters and require high cost in terms of staff and time. Deepfake technology can achieve the effect of realistic faces and replicate facial animation. The facial details and animations are automatically done by the computer after the AI model is trained, and the AI model can be reused, thus reducing the human and time costs of realistic face animation. In addition, this study summarizes the way human face information is captured and proposes a new workflow for video to image conversion and demonstrates that the new work scheme can obtain higher quality images and exchange effects by evaluating the quality of No Reference Image Quality Assessment.
Keywords Artificial Intelligence, Deepfake, Facial animation, Animation
People rely on facial expressions to indicate emotions and intentions. Due to the sensitivity of humans to subtle facial movements, many details of features such as muscle movements, wrinkles and skin composition must be considered when creating realistic facial expressions using computer technology, making it difficult to achieve realistic facial animations from computer graphics models. Realistic facial animation production consists of facial modelling and animation data acquisition techniques. Facial modelling was earlier used to create models and mapping using 3D software, then the face models were given to skeletons for control and then animated, to obtain realistic face models, researchers worked with laser scanning or image scanning [1-3]. Face animation data acquisition techniques include speech-driven techniques, image-based techniques, and data capture techniques [4,5]. Nguyen, Tan-Nhu, et al. (2020) proposed a computer vision system for data acquisition using a non-contact Kinect sensor for real-time tracking of rigid head and non-rigid face imitation movements while designing subject-specific Texture generation subsystem to enhance the realism of generative models with texture information. A head animation subsystem with a graphical user interface was also developed. Pan, Ye, et al. (2022) proposed a real-time motion capture system named MienCap by combining traditional blend shape animation techniques with machine learning models. It drives character expressions in a geometrically consistent and perceptually efficient way, a system that could potentially find its way into VR filmmaking and animation pipe lines. Ye, Yuping, Zhan Song, and Juan Zhao (2022) developed a facial acquisition system based on an infrared structured light sensor to obtain high-fidelity and accurate facial expression models. Accurate and dense point clouds, then morphing template models into captured facial expressions, textured real-time 3D meshes using high-resolution images captured by three color cameras. Gu K, Zhou Y, Huang T (2020) propose a Landmark-driven network to generate realistic speaking facial animations, where more facial details are created, preserved, and transferred from multiple source images rather than a single source image. The acquisition subnetwork learns to carefully warp and merge facial regions directly from five source images with unique landmarks, while the learning pipeline renders facial organs from the training facial space to compensate. K Vougioukas, S Petridis, M Pantic (2019) present a system for generating talking head videos, which achieves this by using a temporal GAN with 2 discriminators that capture different aspects of the video to produce facial animations. In recent years, artificial intelligence techniques have been widely used for their ability to solve complex problems. Due to this ability, researchers have turned to machine learning to improve the quality and feasibility of facial animation [S.W. Bailey, Prashanth Chandran, 2020] [Thanh Thi Nguye, 2019] [T. Karras, 2017]. With a sufficiently large dataset, AI can learn how to produce facial animations for a variety of humans. The increased speed, quality, and usability of the AI model learning results make it an excellent solution to some of the major problems of traditional approaches to facial animation. Since the research on AI in facial animation is relatively new and information on its use cases is limited, Deepfake is one of the AI techniques and this study will focus on Deepfake technique, detailing the important parts of the realistic facial animation solution as well as the improved parts of the workflow, and finally the quality evaluation of the produced facial animations to demonstrate the solution’s Feasibility. Fig. 1 is an overview of the proposed scheme in this thesis. First, the same face model as the model is created by photogrammetry, and then a simulated avatar is created using a plug-in of the game engine, and Audio2Face provides the face animation data for the avatar. Then a digital camera is used to capture the model's face to get a video file, and Adobe Media Encoder is used to decompose the video into pictures delivered to the AI model for training, and finally a hyper-realistic face animation is obtained.
Deepfake refers to techniques for specific types of synthetic media in which a person in an image or video swaps faces with another person. Common underlying mechanisms for deepfake are deep learning models, such as autoencoders and generative adversarial networks (Gans), which have been widely used in computer vision. Deepfake methods typically require large amounts of image and video data to train models to create photo-realistic images and videos [15-17]. Because of the simplicity of Deepfake applications can be used by both professionals and users with low computer skills [18].
In Deepfake technique work is done by computer to learn two data sets to generate AI model by which face interchange can be achieved, auto encoder size, encoder size and decoder size are important components of AI model, auto encoder size is the middle layer of AI model and affects the number of AI complementary images generated. The decoder divides the extracted face image squares matrix style to be learned by the model, the larger the size the more the number of divided squares, while the faster the AI model learns. Fig. 2 shows the autoencoder and encoder working process of Deepfake, where the decoder converts the data squares in the model into images [19,20].
Deepfake’s encoder is divided into images in the form of square matrix, so the clarity of AI model learning picture material is crucial, Deepfake data collection method generally by digital camera shooting video, and then the video file into pictures, with the development of film and television technology, there are many ways to convert video to pictures, but due to the different compression methods, different conversion methods The quality of the pictures obtained is also different due to the different compression methods, the reason for this difference lies in the different picture compression techniques. For digital images, each pixel is used as a sampling point and has a corresponding sampling value. The finer the image segmentation, the more the number of pixels, the more the sampling points, the higher the image clarity; conversely, the fewer the number of pixels, the lower the image clarity. Since the human eye has different subjective sensitivity to brightness and chromaticity, it is difficult to distinguish the difference in quality between pictures with the naked eye, so the source video with the same conversion method is different, and it needs to be calculated by a function to carry out the work of evaluation of the clarity of the non-reference picture. The general principle of BRISQUE algorithm is to extract the mean subtracted contrast normalized (MSCN) coefficients from the image, and to fit the MSCN Tenengrad function is a gradient-based function. In image processing, it is generally considered that wellfocused images have sharper edges and therefore have larger gradient function values. The Lapras algorithm is sensitive and can obtain fast results in images of different sizes [21- 24].
Adobe Media Encoder is a video and audio encoding application that can be used for different applications and audiences in various distribution formats to Audio and video files are encoded in a variety of distribution formats. Adobe Media Encoder combines the numerous settings provided by the major audio and video formats and includes preset settings specifically designed to export files compatible with specific delivery media [25]. Deepfake technology uses FFMPEG for image conversion, which is a set of formats that can be used to record, convert digital audio and video, and can convert them to streams as an open-source computer program. It is licensed under the LGPL or GPL. It provides a complete solution for recording, converting, and streaming audio and video. It contains the very advanced audio/video codec library libavcodec. To ensure high portability and codec quality, much of the code in libavcodec was developed from scratch [26,27]. To ensure the objectivity of the data, this study will use these two methods to convert the video files to pictures, and then work on the No Reference Image Quality Assessment by three algorithms, BRISQUE, Tenengrad, and Laplacian, which are converted to Python code in Spyder (Anaconda3) is implemented, the following table shows the specific code.
Table 1 . Implementation code
![]() |
Deepfake’s video to picture command provides two kinds of formats: PNG and JPG, so two sets of picture clips named DF_PNG and DF_JPG were obtained by Deepfake’s command, and the video was converted to PNG and JPG using Adobe Media Encoder. The total duration of this experiment is 36 s, the resolution is 3840*2160 px, and the frame rate is 60 F/s. The DF_PNG image group is decomposed into 2190 images, each image resolution is 3840*2160 px, and the bit depth is 24. The average image size in the DF_JPG image group is 680KB, and the average image size in the DF_PNG image group is 8 MB. using Adobe Media Encoder to decompose the same video, the final AME_PNG image group decomposes 2910 images, with a resolution of 3840*3160 px image and a bit depth of 24. The average image size in the AME_JPG image group was 3 MB, and the average image size in the AME_PNG image group was 8.5 MB. 30 images were randomly selected from each image group for the evaluation of the clarity of the non-reference images.
The results in Fig. 3 show the results of four groups of picture clarity measurements. The smaller the value of BRISQUE represents the better picture clarity, and the larger the value of Tenengrad and Laplacian, the better the picture clarity. Among the results of the four groups of pictures, the best picture clarity is found in the case of converting the video to JPG format by Adobe Media Encoder, so this study proposes to use Adobe Media Encoder to produce the picture material used for AI learning in Deepfake.
To reuse AI models generated by Deepfake, they must be pre-trained because Deepfale maps the style effect of the previous training. This study uses DeepfakeLab for AI model training. The AI model file size does not change with the number of materials or training time, and pre-training can save the mapping effect after training as a starting point, so the AI model after pre-training is more efficient for reuse. The AI model generated by Deepfake includes 6 files, among which the replacement of _SAEHD_data.dst file can complete the reuse of pre-training, which can reduce a great deal of training time to imaging. Because the pre-training only learns on a set of picture material, only mask training is set to Ture. if the face angle in Src is missing will lead to longer training time, so the AI model needs to add various face angles during the pre-training, Fig. 4 shows the pretraining process of the AI model used in this study, you can see that the simulation of the character expression has been basically completed and The Loss value is the evaluation value of the face exchange result, and a smaller value means a better face exchange effect, so it is judged that the AI model can be used for formal training.
Two sets of data (Src & Dst) are required for formal training of the AI model. Dst material used in this thesis is MetaHuman animation material, MetaHuman character model is made by real people through photo-scanned face data, and face ani-mation is made by Omniverse Audio2Face, Fig. 5. Src material is made by digital camera shooting 13 videos. Adobe Media Encoder output a total of 33,554 valid Jpg images.
When training, we need to turn on masked_training and random_warp, the loss value will drop very fast at the beginning of training, and the image will appear quickly. When most of the outline of the fifth column in the preview image is basically accurate, you need to set lr_dropout to True to continue training, when the loss value drops very slowly or there are signs of rebound one after another. At this time, check the eye direction in the training results, in order to ensure the clarity of the eye direction and the side profile of the face need to set eyes_mouth_prio and uniform_yaw to Ture, set lr_dropout and random_warp and random_flip to False. the final stage to set the value of Gan, GAN The value of Gan is the speed of AI model learning, if set too large will lead to AI model error, so the value range is better to start with 0.0001, and the setting of GAN will increase the GPU occupancy.
In this study, the synthesized face after training is divided into 9 parts (Fig. 5) equal to the original face, and SSIM evaluation is performed for the real face and Deepfake face from the whole and each part, respectively, Structural Similarity Index Measure (SSIM) is a metric to measure the degree of similarity between two digital images. This SSIM evaluation was performed in the software Matlab, and the SSIM index was 0.6398 for the overall two face images and 0.3326, 0.7299, 0.4737, 0.694, 0.7127, 0.8046, 0.6696, 0.7664, and 0.5956 for each part. observing the images and SSIM values reveal that Deepfake simulates and replicates the real face details and has good effect on the processing of light and shadow (Table 2).
Table 2 . SSIM evaluation results
SSIM Evaluation Values | ||
0.333 | 0.730 | 0.474 |
0.694 | 0.713 | 0.805 |
0.670 | 0.766 | 0.596 |
SSIM Overall Evaluation: 0.6398 |
Although it was objectively verified that the AI model generated realistic face animations, a test of the Valley of Terror effect was needed to confirm whether viewers perceived the animation as realistic. Regarding the measurement of the questionnaire this study was based on some of the questions proposed by Bartneck, Kulic, and Croft (2009) [28]. The scale itself was set up for robots, so this study first screened through the respondents to set up suitable questions, and after screening a total of four topics were identified, and the topics were set up as shown in Table 3. this survey was conducted at Dongseo University in Busan, Korea, starting on October 15, 2022, and ending on October 18, 2022, the survey was conducted offline, allowing respondents to watch a produced video and record their feelings based on the questionnaire questions. 67 people participated, with the participants consisting mainly of professors and students.
Table 3 . The questionnaires defined used to assess the impressions of the robot facial animation
Please rate your impression of the Facial animation on these scales: | ||||||
---|---|---|---|---|---|---|
Fake | 1 | 2 | 3 | 4 | 5 | Natural |
Machinelike | 1 | 2 | 3 | 4 | 5 | Humanlike |
Artificial Moving | 1 | 2 | 3 | 4 | 5 | Lifelike Moving |
Rigidly | 1 | 2 | 3 | 4 | 5 | Elegantly |
After averaging the questionnaire data (Fig. 6), the pairproduced video scores higher in the case of comparison with the real person, and the motion perception score of the animation is also higher than the middle value, thus, the feasibility of the solution proposed in this study can be demonstrated. To summarize respondents' feedback outside of the questionnaire questions, the animation of the eyes and the authenticity of the hair can affect the overall judgment.
The application of artificial intelligence technology in the film and television industry is becoming more and more mature, with the development of the Internet, the variety of film and television content has become rich, movies, TV series, etc. using artificial intelligence technology to change the age of actors and even let the deceased actors appear in the virtual world, the development of film and television technology has made the production of virtual simulators become simple, especially the facial animation of simulators can already be made in various ways through facial capture technology, etc. However, both facial capture technology and traditional facial animation production technology require a lot of labor and time cost. The solution proposed in this thesis, using Deepfake to train AI models to learn real human faces and expression movements, and then replace the faces of CG characters, because AI models have the feature of reuse, and there are also skin and other facial details information are almost perfectly replicated, which can greatly save the time cost and labor cost of facial animation production, which is a new attempt in CG film and television industry. From the experimental results and the results of the subjectivity survey, the solution proposed in this study has good results in terms of real human production, however, the fact that the face has only simple animation and hair rendering quality leads to a small difference in the mean score of the face animation evaluation, although it is higher than the middle value. Future studies on eye animation and hair realism will be attempted.
The material obtained from the Internet in this study was used for study and research purposes only, and all data that might violate personal interests were removed after the experiment. Consent for the use of face information was obtained from the individual.
received a bachelor's degree in advertising from Qingdao Agricultural University, China (2016). Received Master's degree in Dongseo University from the Department of Visual Contents, Korea (2019). Currently a PhD student at the Department of Visual Contents, Dongseo University (2022).
Research interests include virtual characters, 3D reconstruction, and artificial intelligence learning.
2006: Alexander the Great R&D Supervisor in the animated feature film. 2007: San Antonio R&D Supervisor in the animated feature film. 2008: Carol R&D Supervisor in the animated feature film. 2010: 7 Ride Films Executive Production. 2010: Parada of PotteryIs TV series R&D Supervisor. 2010~Present: Professor of Visual Animation at Dongseo University/ Director of Software Convergence Center.
Areas of interest: animation content, 3D CG, visual artificial intelligence, motion data, photo surveying and 3D implementation.
born in Shandong Province, China in 1996, received a Bachelor of Arts degree from Zhongnan University of Economics and Law in 2019 and a Bachelor of Arts degree from Dongseo University in South Korea. In 2022, he obtained a master's degree in engineering from Dongseo University, and in 2022, he began to engage in artificial intelligence research at Dongseo University in South Korea.
rachelor of Arts, China-Korea Institute of New Media, Zhongnan University of Economics and Law. Studying for a master’s degree, Department of Visual Contents, Dongseo University.
Journal of information and communication convergence engineering 2023; 21(2): 152-158
Published online June 30, 2023 https://doi.org/10.56977/jicce.2023.21.2.152
Copyright © Korea Institute of Information and Communication Engineering.
Zheng-Dong Hou 1, Ki-Hong Kim
2*, Gao-He Zhang
3, and Peng-Hui Li4
1Department of Visual Contents, Dongseo University, Busan 47011, Republic of Korea
2Department of Visual Animation, Dongseo University, Busan 47011, Republic of Korea
3Department of Visual Contents, Dongseo University, Busan 47011, Republic of Korea
4Department of Visual Contents, Dongseo University, Busan 47011, Republic of Korea
Correspondence to:Ki-Hong Kim (E-mail: khkim@g.dongseo.ac.kr)
Department of Visual Animation, Dongseo University, Busan 47011, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
In recent years, as computer-generated imagery has been applied to more industries, realistic facial animation is one of the important research topics. The current solution for realistic facial animation is to create realistic rendered 3D characters, but the 3D characters created by traditional methods are always different from the actual characters and require high cost in terms of staff and time. Deepfake technology can achieve the effect of realistic faces and replicate facial animation. The facial details and animations are automatically done by the computer after the AI model is trained, and the AI model can be reused, thus reducing the human and time costs of realistic face animation. In addition, this study summarizes the way human face information is captured and proposes a new workflow for video to image conversion and demonstrates that the new work scheme can obtain higher quality images and exchange effects by evaluating the quality of No Reference Image Quality Assessment.
Keywords: Artificial Intelligence, Deepfake, Facial animation, Animation
People rely on facial expressions to indicate emotions and intentions. Due to the sensitivity of humans to subtle facial movements, many details of features such as muscle movements, wrinkles and skin composition must be considered when creating realistic facial expressions using computer technology, making it difficult to achieve realistic facial animations from computer graphics models. Realistic facial animation production consists of facial modelling and animation data acquisition techniques. Facial modelling was earlier used to create models and mapping using 3D software, then the face models were given to skeletons for control and then animated, to obtain realistic face models, researchers worked with laser scanning or image scanning [1-3]. Face animation data acquisition techniques include speech-driven techniques, image-based techniques, and data capture techniques [4,5]. Nguyen, Tan-Nhu, et al. (2020) proposed a computer vision system for data acquisition using a non-contact Kinect sensor for real-time tracking of rigid head and non-rigid face imitation movements while designing subject-specific Texture generation subsystem to enhance the realism of generative models with texture information. A head animation subsystem with a graphical user interface was also developed. Pan, Ye, et al. (2022) proposed a real-time motion capture system named MienCap by combining traditional blend shape animation techniques with machine learning models. It drives character expressions in a geometrically consistent and perceptually efficient way, a system that could potentially find its way into VR filmmaking and animation pipe lines. Ye, Yuping, Zhan Song, and Juan Zhao (2022) developed a facial acquisition system based on an infrared structured light sensor to obtain high-fidelity and accurate facial expression models. Accurate and dense point clouds, then morphing template models into captured facial expressions, textured real-time 3D meshes using high-resolution images captured by three color cameras. Gu K, Zhou Y, Huang T (2020) propose a Landmark-driven network to generate realistic speaking facial animations, where more facial details are created, preserved, and transferred from multiple source images rather than a single source image. The acquisition subnetwork learns to carefully warp and merge facial regions directly from five source images with unique landmarks, while the learning pipeline renders facial organs from the training facial space to compensate. K Vougioukas, S Petridis, M Pantic (2019) present a system for generating talking head videos, which achieves this by using a temporal GAN with 2 discriminators that capture different aspects of the video to produce facial animations. In recent years, artificial intelligence techniques have been widely used for their ability to solve complex problems. Due to this ability, researchers have turned to machine learning to improve the quality and feasibility of facial animation [S.W. Bailey, Prashanth Chandran, 2020] [Thanh Thi Nguye, 2019] [T. Karras, 2017]. With a sufficiently large dataset, AI can learn how to produce facial animations for a variety of humans. The increased speed, quality, and usability of the AI model learning results make it an excellent solution to some of the major problems of traditional approaches to facial animation. Since the research on AI in facial animation is relatively new and information on its use cases is limited, Deepfake is one of the AI techniques and this study will focus on Deepfake technique, detailing the important parts of the realistic facial animation solution as well as the improved parts of the workflow, and finally the quality evaluation of the produced facial animations to demonstrate the solution’s Feasibility. Fig. 1 is an overview of the proposed scheme in this thesis. First, the same face model as the model is created by photogrammetry, and then a simulated avatar is created using a plug-in of the game engine, and Audio2Face provides the face animation data for the avatar. Then a digital camera is used to capture the model's face to get a video file, and Adobe Media Encoder is used to decompose the video into pictures delivered to the AI model for training, and finally a hyper-realistic face animation is obtained.
Deepfake refers to techniques for specific types of synthetic media in which a person in an image or video swaps faces with another person. Common underlying mechanisms for deepfake are deep learning models, such as autoencoders and generative adversarial networks (Gans), which have been widely used in computer vision. Deepfake methods typically require large amounts of image and video data to train models to create photo-realistic images and videos [15-17]. Because of the simplicity of Deepfake applications can be used by both professionals and users with low computer skills [18].
In Deepfake technique work is done by computer to learn two data sets to generate AI model by which face interchange can be achieved, auto encoder size, encoder size and decoder size are important components of AI model, auto encoder size is the middle layer of AI model and affects the number of AI complementary images generated. The decoder divides the extracted face image squares matrix style to be learned by the model, the larger the size the more the number of divided squares, while the faster the AI model learns. Fig. 2 shows the autoencoder and encoder working process of Deepfake, where the decoder converts the data squares in the model into images [19,20].
Deepfake’s encoder is divided into images in the form of square matrix, so the clarity of AI model learning picture material is crucial, Deepfake data collection method generally by digital camera shooting video, and then the video file into pictures, with the development of film and television technology, there are many ways to convert video to pictures, but due to the different compression methods, different conversion methods The quality of the pictures obtained is also different due to the different compression methods, the reason for this difference lies in the different picture compression techniques. For digital images, each pixel is used as a sampling point and has a corresponding sampling value. The finer the image segmentation, the more the number of pixels, the more the sampling points, the higher the image clarity; conversely, the fewer the number of pixels, the lower the image clarity. Since the human eye has different subjective sensitivity to brightness and chromaticity, it is difficult to distinguish the difference in quality between pictures with the naked eye, so the source video with the same conversion method is different, and it needs to be calculated by a function to carry out the work of evaluation of the clarity of the non-reference picture. The general principle of BRISQUE algorithm is to extract the mean subtracted contrast normalized (MSCN) coefficients from the image, and to fit the MSCN Tenengrad function is a gradient-based function. In image processing, it is generally considered that wellfocused images have sharper edges and therefore have larger gradient function values. The Lapras algorithm is sensitive and can obtain fast results in images of different sizes [21- 24].
Adobe Media Encoder is a video and audio encoding application that can be used for different applications and audiences in various distribution formats to Audio and video files are encoded in a variety of distribution formats. Adobe Media Encoder combines the numerous settings provided by the major audio and video formats and includes preset settings specifically designed to export files compatible with specific delivery media [25]. Deepfake technology uses FFMPEG for image conversion, which is a set of formats that can be used to record, convert digital audio and video, and can convert them to streams as an open-source computer program. It is licensed under the LGPL or GPL. It provides a complete solution for recording, converting, and streaming audio and video. It contains the very advanced audio/video codec library libavcodec. To ensure high portability and codec quality, much of the code in libavcodec was developed from scratch [26,27]. To ensure the objectivity of the data, this study will use these two methods to convert the video files to pictures, and then work on the No Reference Image Quality Assessment by three algorithms, BRISQUE, Tenengrad, and Laplacian, which are converted to Python code in Spyder (Anaconda3) is implemented, the following table shows the specific code.
Table 1 . Implementation code.
![]() |
Deepfake’s video to picture command provides two kinds of formats: PNG and JPG, so two sets of picture clips named DF_PNG and DF_JPG were obtained by Deepfake’s command, and the video was converted to PNG and JPG using Adobe Media Encoder. The total duration of this experiment is 36 s, the resolution is 3840*2160 px, and the frame rate is 60 F/s. The DF_PNG image group is decomposed into 2190 images, each image resolution is 3840*2160 px, and the bit depth is 24. The average image size in the DF_JPG image group is 680KB, and the average image size in the DF_PNG image group is 8 MB. using Adobe Media Encoder to decompose the same video, the final AME_PNG image group decomposes 2910 images, with a resolution of 3840*3160 px image and a bit depth of 24. The average image size in the AME_JPG image group was 3 MB, and the average image size in the AME_PNG image group was 8.5 MB. 30 images were randomly selected from each image group for the evaluation of the clarity of the non-reference images.
The results in Fig. 3 show the results of four groups of picture clarity measurements. The smaller the value of BRISQUE represents the better picture clarity, and the larger the value of Tenengrad and Laplacian, the better the picture clarity. Among the results of the four groups of pictures, the best picture clarity is found in the case of converting the video to JPG format by Adobe Media Encoder, so this study proposes to use Adobe Media Encoder to produce the picture material used for AI learning in Deepfake.
To reuse AI models generated by Deepfake, they must be pre-trained because Deepfale maps the style effect of the previous training. This study uses DeepfakeLab for AI model training. The AI model file size does not change with the number of materials or training time, and pre-training can save the mapping effect after training as a starting point, so the AI model after pre-training is more efficient for reuse. The AI model generated by Deepfake includes 6 files, among which the replacement of _SAEHD_data.dst file can complete the reuse of pre-training, which can reduce a great deal of training time to imaging. Because the pre-training only learns on a set of picture material, only mask training is set to Ture. if the face angle in Src is missing will lead to longer training time, so the AI model needs to add various face angles during the pre-training, Fig. 4 shows the pretraining process of the AI model used in this study, you can see that the simulation of the character expression has been basically completed and The Loss value is the evaluation value of the face exchange result, and a smaller value means a better face exchange effect, so it is judged that the AI model can be used for formal training.
Two sets of data (Src & Dst) are required for formal training of the AI model. Dst material used in this thesis is MetaHuman animation material, MetaHuman character model is made by real people through photo-scanned face data, and face ani-mation is made by Omniverse Audio2Face, Fig. 5. Src material is made by digital camera shooting 13 videos. Adobe Media Encoder output a total of 33,554 valid Jpg images.
When training, we need to turn on masked_training and random_warp, the loss value will drop very fast at the beginning of training, and the image will appear quickly. When most of the outline of the fifth column in the preview image is basically accurate, you need to set lr_dropout to True to continue training, when the loss value drops very slowly or there are signs of rebound one after another. At this time, check the eye direction in the training results, in order to ensure the clarity of the eye direction and the side profile of the face need to set eyes_mouth_prio and uniform_yaw to Ture, set lr_dropout and random_warp and random_flip to False. the final stage to set the value of Gan, GAN The value of Gan is the speed of AI model learning, if set too large will lead to AI model error, so the value range is better to start with 0.0001, and the setting of GAN will increase the GPU occupancy.
In this study, the synthesized face after training is divided into 9 parts (Fig. 5) equal to the original face, and SSIM evaluation is performed for the real face and Deepfake face from the whole and each part, respectively, Structural Similarity Index Measure (SSIM) is a metric to measure the degree of similarity between two digital images. This SSIM evaluation was performed in the software Matlab, and the SSIM index was 0.6398 for the overall two face images and 0.3326, 0.7299, 0.4737, 0.694, 0.7127, 0.8046, 0.6696, 0.7664, and 0.5956 for each part. observing the images and SSIM values reveal that Deepfake simulates and replicates the real face details and has good effect on the processing of light and shadow (Table 2).
Table 2 . SSIM evaluation results.
SSIM Evaluation Values | ||
0.333 | 0.730 | 0.474 |
0.694 | 0.713 | 0.805 |
0.670 | 0.766 | 0.596 |
SSIM Overall Evaluation: 0.6398 |
Although it was objectively verified that the AI model generated realistic face animations, a test of the Valley of Terror effect was needed to confirm whether viewers perceived the animation as realistic. Regarding the measurement of the questionnaire this study was based on some of the questions proposed by Bartneck, Kulic, and Croft (2009) [28]. The scale itself was set up for robots, so this study first screened through the respondents to set up suitable questions, and after screening a total of four topics were identified, and the topics were set up as shown in Table 3. this survey was conducted at Dongseo University in Busan, Korea, starting on October 15, 2022, and ending on October 18, 2022, the survey was conducted offline, allowing respondents to watch a produced video and record their feelings based on the questionnaire questions. 67 people participated, with the participants consisting mainly of professors and students.
Table 3 . The questionnaires defined used to assess the impressions of the robot facial animation.
Please rate your impression of the Facial animation on these scales: | ||||||
---|---|---|---|---|---|---|
Fake | 1 | 2 | 3 | 4 | 5 | Natural |
Machinelike | 1 | 2 | 3 | 4 | 5 | Humanlike |
Artificial Moving | 1 | 2 | 3 | 4 | 5 | Lifelike Moving |
Rigidly | 1 | 2 | 3 | 4 | 5 | Elegantly |
After averaging the questionnaire data (Fig. 6), the pairproduced video scores higher in the case of comparison with the real person, and the motion perception score of the animation is also higher than the middle value, thus, the feasibility of the solution proposed in this study can be demonstrated. To summarize respondents' feedback outside of the questionnaire questions, the animation of the eyes and the authenticity of the hair can affect the overall judgment.
The application of artificial intelligence technology in the film and television industry is becoming more and more mature, with the development of the Internet, the variety of film and television content has become rich, movies, TV series, etc. using artificial intelligence technology to change the age of actors and even let the deceased actors appear in the virtual world, the development of film and television technology has made the production of virtual simulators become simple, especially the facial animation of simulators can already be made in various ways through facial capture technology, etc. However, both facial capture technology and traditional facial animation production technology require a lot of labor and time cost. The solution proposed in this thesis, using Deepfake to train AI models to learn real human faces and expression movements, and then replace the faces of CG characters, because AI models have the feature of reuse, and there are also skin and other facial details information are almost perfectly replicated, which can greatly save the time cost and labor cost of facial animation production, which is a new attempt in CG film and television industry. From the experimental results and the results of the subjectivity survey, the solution proposed in this study has good results in terms of real human production, however, the fact that the face has only simple animation and hair rendering quality leads to a small difference in the mean score of the face animation evaluation, although it is higher than the middle value. Future studies on eye animation and hair realism will be attempted.
The material obtained from the Internet in this study was used for study and research purposes only, and all data that might violate personal interests were removed after the experiment. Consent for the use of face information was obtained from the individual.
Table 1 . Implementation code.
![]() |
Table 2 . SSIM evaluation results.
SSIM Evaluation Values | ||
0.333 | 0.730 | 0.474 |
0.694 | 0.713 | 0.805 |
0.670 | 0.766 | 0.596 |
SSIM Overall Evaluation: 0.6398 |
Table 3 . The questionnaires defined used to assess the impressions of the robot facial animation.
Please rate your impression of the Facial animation on these scales: | ||||||
---|---|---|---|---|---|---|
Fake | 1 | 2 | 3 | 4 | 5 | Natural |
Machinelike | 1 | 2 | 3 | 4 | 5 | Humanlike |
Artificial Moving | 1 | 2 | 3 | 4 | 5 | Lifelike Moving |
Rigidly | 1 | 2 | 3 | 4 | 5 | Elegantly |
Richard Evan Sutanto,Sukho Lee
Journal of information and communication convergence engineering 2018; 16(3): 148-152 https://doi.org/10.6109/jicce.2018.16.3.148