Journal of information and communication convergence engineering 2024; 22(1): 56-63
Published online March 31, 2024
https://doi.org/10.56977/jicce.2024.22.1.56
© Korea Institute of Information and Communication Engineering
Correspondence to : Jung Hee Seo (E-mail: jhseo@tu.ac.kr)
Department of Computer Engineering, Tongmyong University, Busan 48520, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Despite the rapid strides in content-based image retrieval, a notable disparity persists between the visual features of images and the semantic features discerned by humans. Hence, image retrieval based on the association of semantic similarities recognized by humans with visual similarities is a difficult task for most image-retrieval systems. Our study endeavors to bridge this gap by refining image semantics, aligning them more closely with human perception. Deep learning techniques are used to semantically classify images and retrieve those that are semantically similar to personalized images. Moreover, we introduce a keyword-based image retrieval, enabling automatic labeling of images in mobile environments. The proposed approach can improve the performance of a mobile device with limited resources and bandwidth by performing retrieval based on the visual features and keywords of the image on the mobile device.
Keywords CBIR, Image Retrieval, Deep Learning, CNN, Feature Detect, etc.
With the rapid expansion of communication technology and the widespread use of digital and mobile devices, the generation of images has surged exponentially in daily life. Consequently, most of the memory of personal digital devices is consumed by images. In addition, personalized image data are being increasingly shared offline and online sharing methods, such as through websites and social media. This surge in image generation has led to an increased interest among researchers in image retrieval methods.
An effective algorithm for content-based image retrieval (CBIR) was developed as a result of research over the past several years. Image-based query retrieval holds promise for efficient image retrieval and finds applications across various fields within computer vision and artificial intelligence. Studies have also been conducted on CBIR using Big Data and deep learning techniques.
Conventional image retrieval techniques are used in various fields, including facial recognition [1,2], iris recognition [3], person identification [4], searching for clothing or other products [5], searching for food and groceries [6], and fingerprint recognition [7,8].
Earlier studies on image-retrieval systems primarily relied on text-based frameworks, with image retrieval being conducted through these methods. CBIR emerged later and its focus shifted toward automatic image annotation. Many CBIR systems use these methods or a combination of them to reduce semantic differences between images [9].
The text-based image retrieval (TBIR) methods, however, pose challenges as users are required to manually input keywords, and there are limitations in semantically aligning with these keywords. By contrast, CBIR can solve the problem of text retrieval at a more fundamental level by utilizing a computer’s visual processing capability [10].
CBIR employs feature extraction and matching to classify images based on human semantic perspectives. Feature extraction in CBIR is the first step of image retrieval. It is invariant to image scaling and rotation and is partially invariant to illumination changes. Additionally, feature extraction in CBIR is well-localized in both spatial and frequency domains, laying the foundation for accurate image retrieval, particularly for databases with numerous single features [11].
However, despite the rapid advancement of CBIR, there are significant differences between the visual features of images and semantic features recognized by humans. Therefore, image retrieval that associates semantic similarities recognized by humans with visual similarities is a difficult task for most image retrieval systems, and results in high memory usage and computational complexity owing to large-scale image processing.
CBIR systems excel at automatically extracting visual content from images using low-level features such as color or texture in image queries. However, users generally prefer to query images according to high-level concepts such as keywords [12].
For several years, CBIR has generated considerable interest among researchers in the imaging field, and many studies have focused on CBIR. However, most image-retrieval methods are susceptible to variations in color, texture, and shape, posing difficulties for content-based image retrieval.
The scale-invariant feature transform (SIFT) is a visual feature extraction method that transforms an image into a collection of local feature vectors. This function is invariant to the translation, scaling, or rotation of the images. Furthermore, it is partially invariant to changes in illumination. Recently, researchers have proposed using SIFT to solve CBIR problems [13].
A notable disparity exists between the low-level visual features of an image and the semantic features recognized by humans. To overcome this issue, image classification using deep-learning technology has been investigated, and many achievements have been made through studies on deep-learning-based image classification. Deep learning has demonstrated exceptional performance in object detection and segmentation [14] and has been leveraged to identify similar images based on semantic similarities [9,15].
Convolutional neural network (CNN)-based image retrieval has developed rapidly in recent years owing to the limited expressive capabilities of existing functions and innovations in image processing via deep neural networks. Given CNN’s substantial advancements in image classification, researchers have explored using pre-trained CNN models to conduct classification tasks in CBIR [5].
In this study, we subdivided images semantically using deep-learning techniques to retrieve images that were semantically similar to a personalized image. Additionally, we introduce keyword-based image retrieval approach by automatically labeling images in a mobile environment, enhancing semantics to align with human perception. Therefore, the gap between the content-based visual semantics of an image and the semantic features of an image recognized by humans can be reduced. Moreover, the memory consumption of digital devices can be reduced.
The paper is structured as follows: Section 2 discusses the studies related to conventional image retrieval. Section 3 reinforces the semantics of images and presents a technique for retrieving similar images based on keywords using automatic image labeling in a mobile environment. Section 4 provides implementation results and analysis, while Section 5 offers our conclusions.
The demand for cutting-edge technologies capable of efficiently processing vast amounts of data is ever-increasing, and CBIR stands out as a powerful method for retrieving diverse pictures and videos from extensive image databases [16]. CBIR enables the retrieval of relevant images from a database by utilizing the content of interest as the input image. This technique is used to search for similar products on e-commerce sites, such as Alibaba, Amazon, and eBay [5].
Image classification based on semantics provides a semantically classified hierarchical image database. Pandey et al. leveraged the benefits of such databases in their study and proposed a system that automatically assigns semantics to images through an adaptive combination of multiple visual features [15].
Existing image-retrieval methods include CBIR [5,16,17], symbol image representation [18], hash algorithms [19,20], CNNs for retrieving similar images [14,21], the SIFT [11,22], image semantics [5,9,15,23], geo-multimedia [13], and entropy-based retrieval [24]. These methods present various approaches for effective and efficient image retrieval.
Punitha et al. [18] proposed a method for representing symbolic images in a symbolic image database, ensuring invariance to image transformation and facilitating exact match retrieval.
Cheng et al. [19] primarily improved feature learning, loss functions, and learning methods to increase image retrieval efficiency and learn more hash functions and hash codes more efficiently. They also proposed an adaptive asymmetric residual hash method based on a residual hash, integrated network, and fast-supervised discrete hashing.
Zhang et al. [14] utilized deep-learning techniques to construct a semantic database using a location estimation method based on semantic information.
Pandey et al. [9] developed content-based semantics and image retrieval systems tailored for semantically classified hierarchical image databases.
Munjal et al. [16] combined CBIR and TBIR support methods to structure a collection of photographs systemically. Their approach simplifies information collection and facilitates offline image retrieval through the automatic generation of text metadata.
Wang et al. [11] developed an effective content-based web image search engine using SIFT feature matching. SIFT descriptors capture the local feature of an image, remaining invariant to scaling, transformation, and rotation, while also exhibiting partial invariance to illumination changes and affine transformation. To reduce the unavailable feature matches, a dynamic probability function replaces the original fixed values to determine the similarity distances and databases from the training images. It can improve search performance by saving the key points in XML format by preprocessing the next original image.
Wangming et al. [22] utilized Lowe’s SIFT properties, renowned for their unique local invariant characteristics in CBIR. The visual contents of the query and database images were extracted and described as a 128-dimensional SIFT feature vector using the CBIR system.
Weng et al. [21] introduced an effective framework leveraging convolutional neural architecture search (CNAS) to address diverse image classification tasks.
Li et al. [5] reviewed technological developments regarding image representation and database retrieval. They explained the practical applications of CBIR in fashion image retrieval, person reidentification, e-commerce product retrieval, remote-sensing image retrieval, and trademark image retrieval. Furthermore, they examined the challenges of big data and future research directions for deep learning.
CBIR is primarily performed using large-scale databases. However, image-retrieval methods for small amounts of data are lacking. Most CBIR methods require a large amount of data. Hence, it is difficult to collect images in tasks such as retrieval, and it is expensive to assign labels, thereby placing numerous limitations on the development of CBIR [5].
Hence, there’s a need for a CBIR-based retrieval technique capable of identifying similar images using a small amount of data based on human semantics, similar to how humans can easily classify semantically similar images.
In this study, we visualized the results of retrieving similar images in a mobile environment and devise an effective search strategy. To this end, we added personalized images through transfer learning to subdivide semantics and perform image classification.
Using the trained model, we automatically extract labels from the gallery or images captured by the camera, saving their tag properties. The goal is to concretize the semantics of the images by minimizing the gap between the semantics of the values stored in the tag properties of the images and the semantics of the images perceived by humans.
Semantics-based image classification plays a crucial role in retrieving similar images. To reinforce the visual semantics of an image, the semantics can be concretized in detail through transfer learning. Consequently, this study proposes an enhanced architecture based on the MobileNet CNN architecture.
High data accuracy and a low degree of overfitting must be maintained to improve the quality of the CNN model in deep learning. Hence, a large amount of training data is required. Nevertheless, as with the CBIR method, there are many practical limitations to collecting large amounts of training data. To address this issue, transfer learning is employed as a compensatory measure.
Fig. 1 shows the overall system procedure for retrieving similar images based on semantics.
The Input Image Data of the Create Model Module in Fig. 1 consist of hierarchical nodes to which the semantic meanings of the images recognized by humans are assigned so that they can be used as inputs for the learning model. These nodes were used to train a model that was pretrained using the newly added data.
The pretrained feature extractor employs MobileNet as the base model, comprising convolutional and pooling layers. It extracts visual features from a lower-level to a higher-level layer. MobileNet was already trained using the ImageNet dataset.
The classifier configuration incorporates a fully connected layer with dense dropout layers, facilitating hierarchical classification. Fine-tuning involves retraining the added data using the pretrained model. The learning rate was low and the entire layer was fine-tuned to gradually increase the learning rate to create a new model for image classification. The TensorFlow Lite Model converts the newly developed model into a TensorFlow Lite Model format, which includes metadata, enabling it to operate on a mobile device.
The embedded model within the mobile app module, as depicted in Fig. 1 runs the model that has been transformed into a TensorFlow Lite Model format on the mobile device.
User Interface Design represents the design of the screen on the mobile device, and label detection and asset semantic (tag) takes the Query Image as the input, extracts the label with the semantic meaning of the image, and assigns it to the tag property of the image. A similar image-search-based tag retrieves similar images according to keywords through tag properties.
In this study, deep learning technology was employed to train a model for semantic images classification. The trained model was then converted into a TensorFlow Lite model to work with a mobile device. The model runs on a mobile device and automatically extracts labels from the personalized image. The extracted label is assigned to the tag property of the image, facilitating the retrieval of similar images based on keywords. Therefore, similar images can be retrieved according to keywords through tag properties that reinforce the visual features of the content image with semantic features perceived by human vision.
The proposed procedure for similar image retrieval is outlined in Algorithm 1.
In Step 1, the semantic system of images and their visual hierarchy structure is manually established, categorizing images into hierarchical datasets. Each node within this hierarchy represents the specific image semantics and functions as a label for subsequent extraction.
Step 2 constructs the transfer-learning model by training a new model on the recently added data, leveraging MobileNet, as a pretrained model. MobileNet is used only as a feature extractor. Image classification classifies (Classifier Configuration) the categories included in the images according to the feature extractor, pretrained network, and extracted features. The pretrained feature extractor consists of a convolutional layer and a pooling layer, and the Classifier Configuration classifies the newly added data into a hierarchical structure.
In Step 3, the newly added data undergoes relearning using the pre-trained model through fine-tuning, yielding a refined model tailored for image classification.
Step 4 transforms the generated model into a TensorFlow Lite model to add it to the mobile app and generates metadata that include image labels in the model.
Step 5 runs the TensorFlow Lite model, which includes metadata on the mobile device and uses it to automatically extract the labels for the query image.
Step 6 sorts the automatically extracted labels based on accuracy. Among the extracted labels, the label with the highest accuracy was set as the tag for the query image. This tag is then used to perform keyword-based image retrieval on a mobile device.
Step 7 displays similar image retrieval results for the query image on the mobile device.
In Step 8, if the semantics of the image need to be modified, the image tag can be manually edited individually or collectively.
Algorithm 1. Similar Retrieval Method based keyword |
---|
|
The experimental environment of this study utilized Google TensorFlow and Android-based mobile programming, facilitating the construction of a hierarchical database grounded in visual semantics through a transfer learning model.
The dataset used to train the model in this experiment was an open database. The experiment was conducted using image data collected from hierarchical visual image databases including ImageNet’s ILSVRC2012 dataset, Oxford-iiit-pet dataset, Flickr dataset, and TensorFlow’s flower-photos dataset. The experiment evaluated the effectiveness of similar keyword-based image retrieval on a mobile device.
A Gallery image or image captured by a camera was used as the query image, and the features of the image were extracted using a hierarchical database through image classification training. Labels similar to the visual features of the images were extracted. This model operates on Androidbased mobile devices, serving image-labeling tasks by extracting semantic meanings from images. The extracted labels were saved as tag properties of the images. These properties are represented by a hierarchical search structure that visualizes images on a mobile device. The system’s efficiency in assigning semantics to images was validated through effective visual and intuitive retrieval, offering compelling rationale for aligning human-perceived image features with visual features.
This approach can substancially increase the efficiency of keyword-based similar image retrieval on a mobile device using only tag information from the visual features of the images.
Fig. 2 and Fig. 3 present the loss rate and accuracy of the learning model, respectively. The model maintained a consistent loss rate and accuracy. Additionally, the learning model achieved a loss value of 0.31616 and an accuracy of 0.9342.
Fig. 4 and Fig. 5 depict the inference results obtained from the proposed model. For example, in Fig. 4, “pred:sea” represents the label inferred from the image, and “label:sea” represents the actual label of the image. Notably, the recognition rate for males was relatively low. However, as indicated by the results 1×8 in the figure, the recognition rate was high, except when a man was incorrectly identified as a woman.
Fig. 6 illustrates the process of extracting an image tag on mobile device. “Images Display” at the top (a) shows the result of saving the image using the Gallery or Camera application. Tapping on the image transitions the screen to “ImageView,” depicted in (b) at the top of the figure. If the “TAG SEARCH” button is tapped, the image’s label (mountain) and confidence value (0.9986299) are displayed. The extracted label is then saved as a tag property of the image. Finally, (c) represents a list of tag property values stored in an image.
At the bottom of the figure, (d) and (e) display the outcomes of retrieving similar images based on keywords associated with sea and mountain by entering “sea” and “mou” as the Tag property values, respectively. (f) shows the result of incorrectly predicting a “baby” as a “woman” in the case where “wo” was entered as the Tag property to search for “woman.” These results indicate significant potential for enhancement in the proposed system, particularly concerning the semantic classification of people.
Therefore, the experimental results provide significant motivation to narrow the gap between the visual features and semantics of images perceived by humans. However, further studies are needed in other areas, particularly concerning the notable disparity between the visual features and semantics of images perceived by humans, especially in the context of identifying individuals.
We propose an approach for efficient similar-image retrieval based on keywords on mobile devices through automatic image labeling. This approach semantically divides images using deep learning. Therefore, it is possible to reduce the gap in the visual semantics of the content-based images. It is also possible to reduce storage consumption in digital devices, which can improve the performance of mobile devices with limited resources and bandwidth by searching according to the visual features and keywords of the image on the mobile device. The proposed approach achieved outstanding performance in the semantic classification of various images.
Moreover, our approach bridges the gap between the semantics of an image perceived by humans and the visual features of the image because CBIR can be implemented on devices with limited resources, such as mobile devices and embedded systems.
Jung-Hee Seo
She received a B.S. degree in Computer Science from Silla University in 1994, M.S. degree in Computer Science and Statistics from Kyungsung University in 1997, and Ph.D. degree in Electronic Commerce System from Pukyong National University in 2006. She has been a assistant professor with the Department of Computer Engineering, Tongmyong University, since 2000. Her research interests includes Remote Education, Multimedia, Image Processing, Information Protection, Mobile Application
Journal of information and communication convergence engineering 2024; 22(1): 56-63
Published online March 31, 2024 https://doi.org/10.56977/jicce.2024.22.1.56
Copyright © Korea Institute of Information and Communication Engineering.
Department of Computer Engineering, Tongmyong University, Busan 48520, Republic of Korea
Correspondence to:Jung Hee Seo (E-mail: jhseo@tu.ac.kr)
Department of Computer Engineering, Tongmyong University, Busan 48520, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Despite the rapid strides in content-based image retrieval, a notable disparity persists between the visual features of images and the semantic features discerned by humans. Hence, image retrieval based on the association of semantic similarities recognized by humans with visual similarities is a difficult task for most image-retrieval systems. Our study endeavors to bridge this gap by refining image semantics, aligning them more closely with human perception. Deep learning techniques are used to semantically classify images and retrieve those that are semantically similar to personalized images. Moreover, we introduce a keyword-based image retrieval, enabling automatic labeling of images in mobile environments. The proposed approach can improve the performance of a mobile device with limited resources and bandwidth by performing retrieval based on the visual features and keywords of the image on the mobile device.
Keywords: CBIR, Image Retrieval, Deep Learning, CNN, Feature Detect, etc.
With the rapid expansion of communication technology and the widespread use of digital and mobile devices, the generation of images has surged exponentially in daily life. Consequently, most of the memory of personal digital devices is consumed by images. In addition, personalized image data are being increasingly shared offline and online sharing methods, such as through websites and social media. This surge in image generation has led to an increased interest among researchers in image retrieval methods.
An effective algorithm for content-based image retrieval (CBIR) was developed as a result of research over the past several years. Image-based query retrieval holds promise for efficient image retrieval and finds applications across various fields within computer vision and artificial intelligence. Studies have also been conducted on CBIR using Big Data and deep learning techniques.
Conventional image retrieval techniques are used in various fields, including facial recognition [1,2], iris recognition [3], person identification [4], searching for clothing or other products [5], searching for food and groceries [6], and fingerprint recognition [7,8].
Earlier studies on image-retrieval systems primarily relied on text-based frameworks, with image retrieval being conducted through these methods. CBIR emerged later and its focus shifted toward automatic image annotation. Many CBIR systems use these methods or a combination of them to reduce semantic differences between images [9].
The text-based image retrieval (TBIR) methods, however, pose challenges as users are required to manually input keywords, and there are limitations in semantically aligning with these keywords. By contrast, CBIR can solve the problem of text retrieval at a more fundamental level by utilizing a computer’s visual processing capability [10].
CBIR employs feature extraction and matching to classify images based on human semantic perspectives. Feature extraction in CBIR is the first step of image retrieval. It is invariant to image scaling and rotation and is partially invariant to illumination changes. Additionally, feature extraction in CBIR is well-localized in both spatial and frequency domains, laying the foundation for accurate image retrieval, particularly for databases with numerous single features [11].
However, despite the rapid advancement of CBIR, there are significant differences between the visual features of images and semantic features recognized by humans. Therefore, image retrieval that associates semantic similarities recognized by humans with visual similarities is a difficult task for most image retrieval systems, and results in high memory usage and computational complexity owing to large-scale image processing.
CBIR systems excel at automatically extracting visual content from images using low-level features such as color or texture in image queries. However, users generally prefer to query images according to high-level concepts such as keywords [12].
For several years, CBIR has generated considerable interest among researchers in the imaging field, and many studies have focused on CBIR. However, most image-retrieval methods are susceptible to variations in color, texture, and shape, posing difficulties for content-based image retrieval.
The scale-invariant feature transform (SIFT) is a visual feature extraction method that transforms an image into a collection of local feature vectors. This function is invariant to the translation, scaling, or rotation of the images. Furthermore, it is partially invariant to changes in illumination. Recently, researchers have proposed using SIFT to solve CBIR problems [13].
A notable disparity exists between the low-level visual features of an image and the semantic features recognized by humans. To overcome this issue, image classification using deep-learning technology has been investigated, and many achievements have been made through studies on deep-learning-based image classification. Deep learning has demonstrated exceptional performance in object detection and segmentation [14] and has been leveraged to identify similar images based on semantic similarities [9,15].
Convolutional neural network (CNN)-based image retrieval has developed rapidly in recent years owing to the limited expressive capabilities of existing functions and innovations in image processing via deep neural networks. Given CNN’s substantial advancements in image classification, researchers have explored using pre-trained CNN models to conduct classification tasks in CBIR [5].
In this study, we subdivided images semantically using deep-learning techniques to retrieve images that were semantically similar to a personalized image. Additionally, we introduce keyword-based image retrieval approach by automatically labeling images in a mobile environment, enhancing semantics to align with human perception. Therefore, the gap between the content-based visual semantics of an image and the semantic features of an image recognized by humans can be reduced. Moreover, the memory consumption of digital devices can be reduced.
The paper is structured as follows: Section 2 discusses the studies related to conventional image retrieval. Section 3 reinforces the semantics of images and presents a technique for retrieving similar images based on keywords using automatic image labeling in a mobile environment. Section 4 provides implementation results and analysis, while Section 5 offers our conclusions.
The demand for cutting-edge technologies capable of efficiently processing vast amounts of data is ever-increasing, and CBIR stands out as a powerful method for retrieving diverse pictures and videos from extensive image databases [16]. CBIR enables the retrieval of relevant images from a database by utilizing the content of interest as the input image. This technique is used to search for similar products on e-commerce sites, such as Alibaba, Amazon, and eBay [5].
Image classification based on semantics provides a semantically classified hierarchical image database. Pandey et al. leveraged the benefits of such databases in their study and proposed a system that automatically assigns semantics to images through an adaptive combination of multiple visual features [15].
Existing image-retrieval methods include CBIR [5,16,17], symbol image representation [18], hash algorithms [19,20], CNNs for retrieving similar images [14,21], the SIFT [11,22], image semantics [5,9,15,23], geo-multimedia [13], and entropy-based retrieval [24]. These methods present various approaches for effective and efficient image retrieval.
Punitha et al. [18] proposed a method for representing symbolic images in a symbolic image database, ensuring invariance to image transformation and facilitating exact match retrieval.
Cheng et al. [19] primarily improved feature learning, loss functions, and learning methods to increase image retrieval efficiency and learn more hash functions and hash codes more efficiently. They also proposed an adaptive asymmetric residual hash method based on a residual hash, integrated network, and fast-supervised discrete hashing.
Zhang et al. [14] utilized deep-learning techniques to construct a semantic database using a location estimation method based on semantic information.
Pandey et al. [9] developed content-based semantics and image retrieval systems tailored for semantically classified hierarchical image databases.
Munjal et al. [16] combined CBIR and TBIR support methods to structure a collection of photographs systemically. Their approach simplifies information collection and facilitates offline image retrieval through the automatic generation of text metadata.
Wang et al. [11] developed an effective content-based web image search engine using SIFT feature matching. SIFT descriptors capture the local feature of an image, remaining invariant to scaling, transformation, and rotation, while also exhibiting partial invariance to illumination changes and affine transformation. To reduce the unavailable feature matches, a dynamic probability function replaces the original fixed values to determine the similarity distances and databases from the training images. It can improve search performance by saving the key points in XML format by preprocessing the next original image.
Wangming et al. [22] utilized Lowe’s SIFT properties, renowned for their unique local invariant characteristics in CBIR. The visual contents of the query and database images were extracted and described as a 128-dimensional SIFT feature vector using the CBIR system.
Weng et al. [21] introduced an effective framework leveraging convolutional neural architecture search (CNAS) to address diverse image classification tasks.
Li et al. [5] reviewed technological developments regarding image representation and database retrieval. They explained the practical applications of CBIR in fashion image retrieval, person reidentification, e-commerce product retrieval, remote-sensing image retrieval, and trademark image retrieval. Furthermore, they examined the challenges of big data and future research directions for deep learning.
CBIR is primarily performed using large-scale databases. However, image-retrieval methods for small amounts of data are lacking. Most CBIR methods require a large amount of data. Hence, it is difficult to collect images in tasks such as retrieval, and it is expensive to assign labels, thereby placing numerous limitations on the development of CBIR [5].
Hence, there’s a need for a CBIR-based retrieval technique capable of identifying similar images using a small amount of data based on human semantics, similar to how humans can easily classify semantically similar images.
In this study, we visualized the results of retrieving similar images in a mobile environment and devise an effective search strategy. To this end, we added personalized images through transfer learning to subdivide semantics and perform image classification.
Using the trained model, we automatically extract labels from the gallery or images captured by the camera, saving their tag properties. The goal is to concretize the semantics of the images by minimizing the gap between the semantics of the values stored in the tag properties of the images and the semantics of the images perceived by humans.
Semantics-based image classification plays a crucial role in retrieving similar images. To reinforce the visual semantics of an image, the semantics can be concretized in detail through transfer learning. Consequently, this study proposes an enhanced architecture based on the MobileNet CNN architecture.
High data accuracy and a low degree of overfitting must be maintained to improve the quality of the CNN model in deep learning. Hence, a large amount of training data is required. Nevertheless, as with the CBIR method, there are many practical limitations to collecting large amounts of training data. To address this issue, transfer learning is employed as a compensatory measure.
Fig. 1 shows the overall system procedure for retrieving similar images based on semantics.
The Input Image Data of the Create Model Module in Fig. 1 consist of hierarchical nodes to which the semantic meanings of the images recognized by humans are assigned so that they can be used as inputs for the learning model. These nodes were used to train a model that was pretrained using the newly added data.
The pretrained feature extractor employs MobileNet as the base model, comprising convolutional and pooling layers. It extracts visual features from a lower-level to a higher-level layer. MobileNet was already trained using the ImageNet dataset.
The classifier configuration incorporates a fully connected layer with dense dropout layers, facilitating hierarchical classification. Fine-tuning involves retraining the added data using the pretrained model. The learning rate was low and the entire layer was fine-tuned to gradually increase the learning rate to create a new model for image classification. The TensorFlow Lite Model converts the newly developed model into a TensorFlow Lite Model format, which includes metadata, enabling it to operate on a mobile device.
The embedded model within the mobile app module, as depicted in Fig. 1 runs the model that has been transformed into a TensorFlow Lite Model format on the mobile device.
User Interface Design represents the design of the screen on the mobile device, and label detection and asset semantic (tag) takes the Query Image as the input, extracts the label with the semantic meaning of the image, and assigns it to the tag property of the image. A similar image-search-based tag retrieves similar images according to keywords through tag properties.
In this study, deep learning technology was employed to train a model for semantic images classification. The trained model was then converted into a TensorFlow Lite model to work with a mobile device. The model runs on a mobile device and automatically extracts labels from the personalized image. The extracted label is assigned to the tag property of the image, facilitating the retrieval of similar images based on keywords. Therefore, similar images can be retrieved according to keywords through tag properties that reinforce the visual features of the content image with semantic features perceived by human vision.
The proposed procedure for similar image retrieval is outlined in Algorithm 1.
In Step 1, the semantic system of images and their visual hierarchy structure is manually established, categorizing images into hierarchical datasets. Each node within this hierarchy represents the specific image semantics and functions as a label for subsequent extraction.
Step 2 constructs the transfer-learning model by training a new model on the recently added data, leveraging MobileNet, as a pretrained model. MobileNet is used only as a feature extractor. Image classification classifies (Classifier Configuration) the categories included in the images according to the feature extractor, pretrained network, and extracted features. The pretrained feature extractor consists of a convolutional layer and a pooling layer, and the Classifier Configuration classifies the newly added data into a hierarchical structure.
In Step 3, the newly added data undergoes relearning using the pre-trained model through fine-tuning, yielding a refined model tailored for image classification.
Step 4 transforms the generated model into a TensorFlow Lite model to add it to the mobile app and generates metadata that include image labels in the model.
Step 5 runs the TensorFlow Lite model, which includes metadata on the mobile device and uses it to automatically extract the labels for the query image.
Step 6 sorts the automatically extracted labels based on accuracy. Among the extracted labels, the label with the highest accuracy was set as the tag for the query image. This tag is then used to perform keyword-based image retrieval on a mobile device.
Step 7 displays similar image retrieval results for the query image on the mobile device.
In Step 8, if the semantics of the image need to be modified, the image tag can be manually edited individually or collectively.
Algorithm 1. Similar Retrieval Method based keyword |
---|
|
The experimental environment of this study utilized Google TensorFlow and Android-based mobile programming, facilitating the construction of a hierarchical database grounded in visual semantics through a transfer learning model.
The dataset used to train the model in this experiment was an open database. The experiment was conducted using image data collected from hierarchical visual image databases including ImageNet’s ILSVRC2012 dataset, Oxford-iiit-pet dataset, Flickr dataset, and TensorFlow’s flower-photos dataset. The experiment evaluated the effectiveness of similar keyword-based image retrieval on a mobile device.
A Gallery image or image captured by a camera was used as the query image, and the features of the image were extracted using a hierarchical database through image classification training. Labels similar to the visual features of the images were extracted. This model operates on Androidbased mobile devices, serving image-labeling tasks by extracting semantic meanings from images. The extracted labels were saved as tag properties of the images. These properties are represented by a hierarchical search structure that visualizes images on a mobile device. The system’s efficiency in assigning semantics to images was validated through effective visual and intuitive retrieval, offering compelling rationale for aligning human-perceived image features with visual features.
This approach can substancially increase the efficiency of keyword-based similar image retrieval on a mobile device using only tag information from the visual features of the images.
Fig. 2 and Fig. 3 present the loss rate and accuracy of the learning model, respectively. The model maintained a consistent loss rate and accuracy. Additionally, the learning model achieved a loss value of 0.31616 and an accuracy of 0.9342.
Fig. 4 and Fig. 5 depict the inference results obtained from the proposed model. For example, in Fig. 4, “pred:sea” represents the label inferred from the image, and “label:sea” represents the actual label of the image. Notably, the recognition rate for males was relatively low. However, as indicated by the results 1×8 in the figure, the recognition rate was high, except when a man was incorrectly identified as a woman.
Fig. 6 illustrates the process of extracting an image tag on mobile device. “Images Display” at the top (a) shows the result of saving the image using the Gallery or Camera application. Tapping on the image transitions the screen to “ImageView,” depicted in (b) at the top of the figure. If the “TAG SEARCH” button is tapped, the image’s label (mountain) and confidence value (0.9986299) are displayed. The extracted label is then saved as a tag property of the image. Finally, (c) represents a list of tag property values stored in an image.
At the bottom of the figure, (d) and (e) display the outcomes of retrieving similar images based on keywords associated with sea and mountain by entering “sea” and “mou” as the Tag property values, respectively. (f) shows the result of incorrectly predicting a “baby” as a “woman” in the case where “wo” was entered as the Tag property to search for “woman.” These results indicate significant potential for enhancement in the proposed system, particularly concerning the semantic classification of people.
Therefore, the experimental results provide significant motivation to narrow the gap between the visual features and semantics of images perceived by humans. However, further studies are needed in other areas, particularly concerning the notable disparity between the visual features and semantics of images perceived by humans, especially in the context of identifying individuals.
We propose an approach for efficient similar-image retrieval based on keywords on mobile devices through automatic image labeling. This approach semantically divides images using deep learning. Therefore, it is possible to reduce the gap in the visual semantics of the content-based images. It is also possible to reduce storage consumption in digital devices, which can improve the performance of mobile devices with limited resources and bandwidth by searching according to the visual features and keywords of the image on the mobile device. The proposed approach achieved outstanding performance in the semantic classification of various images.
Moreover, our approach bridges the gap between the semantics of an image perceived by humans and the visual features of the image because CBIR can be implemented on devices with limited resources, such as mobile devices and embedded systems.
Seoyoung Lee, Hyogyeong Park, Yeonhwi You, Sungjung Yong, and Il-Young Moon, Member, KIICE
Journal of information and communication convergence engineering 2023; 21(4): 346-350 https://doi.org/10.56977/jicce.2023.21.4.346Seung-Won Yoon, In-Woo Hwang, and Kyu-Chul Lee, Member, KIICE
Journal of information and communication convergence engineering 2023; 21(4): 294-299 https://doi.org/10.56977/jicce.2023.21.4.294Taejun Lee, Hakseong Kim, and Hoekyung Jung, Member, KIICE
Journal of information and communication convergence engineering 2023; 21(2): 110-116 https://doi.org/10.56977/jicce.2023.21.2.110