Journal of information and communication convergence engineering 2024; 22(1): 80-87
Published online March 31, 2024
https://doi.org/10.56977/jicce.2024.22.1.80
© Korea Institute of Information and Communication Engineering
Correspondence to : Choong Ho Lee (E-mail: chlee@hanbat.ac.kr)
Department of Information and Communication Engineering, Hanbat National University, Daejeon 34158, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
The Manchu language holds historical significance, but a complete dataset of Manchu script letters for training optical character recognition machine-learning models is currently unavailable. Therefore, this paper describes the process of creating a robust dataset of extracted Manchu script letters. Rather than performing automatic letter segmentation based on whitespace or the thickness of the central word stem, an image of the Manchu script was manually inspected, and one copy of the desired letter was selected as a region of interest. This selected region of interest was used as a template to match all other occurrences of the same letter within the Manchu script image. Although the dataset in this study contained only 4,000 images of five Manchu script letters, these letters were collected from twenty-eight writing styles. A full dataset of Manchu letters is expected to be obtained through this process. The collected dataset was normalized and trained using a simple convolutional neural network to verify its effectiveness.
Keywords Character Extraction, Data Collection, Dataset Creation, Manchu Characters, Template Matching
The Manchu language served as the lingua franca of the Qing Dynasty for over 200 years, facilitating trade and exchange throughout Asia and Europe. During that time, scholars in Russia, Joseon Korea, Tokugawa Japan, and European missionaries took a keen interest in Manchu for its usefulness in making the Chinese language more accessible [1].
However, today, only about 50 people can still read and write Manchu [2]. This implies a significant risk of losing numerous historical documents. Estimates indicate that nearly 20 percent of the 10 million documents in the First Historical Archives of China in Beijing were written in Manchu. In the provincial archive of Harbin, there may be up to 60 tons of Manchu documents [3]. There are also large archives of Qing Dynasty documents in Shenyang, Taipei, Daliang, Changchun, and Mongolia [4], as well as overseas archives in Japan, Europe, and North America [5].
Three major factors complicate the recovery of these documents. First, Manchu is a dying language. The last generation of native speakers is slowly aging out of the population. Second, few scholars have actively studied Manchu or conducted translation work using Manchu. Third, the vast amount of archival data is overwhelming. One estimate suggests that even if 100 people spent 100 years translating all Manchu archives, they would still be unable to finish [3].
Therefore, optical character recognition and machine learning technologies are becoming increasingly important for preserving and translating historical documents such as those written in Manchu. However, there is currently no available dataset of extracted and labeled Manchu script letters to aid in building an optical character recognition (OCR) model. Therefore, this study outlines a process for simultaneously extracting and labeling Manchu letters from images of Manchu scripts to build and normalize a machine-learning dataset. The dataset gathered and normalized for this study was trained using a simple neural network to confirm the viability of the approach.
Since the early 2000s, assorted studies have been conducted on Manchu script recognition. These studies typically fall into one of three categories. The first type of study focuses on recognizing Manchu letters based on strokes [6-8]. The second type of study focuses on the segmentation or extraction of Manchu letters from words for recognition [9-10]. The third type of study focuses on the segmentation-free recognition of Manchu words [11-14]. Of these three methods, the segmentation-free method appears to be the most intuitive, given the complexity of stroke and letter classifications. However, the segmentation-free recognition of Manchu words is not free of difficulties.
One difficulty is the need for a significantly larger Manchu words dataset for training, as opposed to a dataset containing only letters or strokes. However, because public datasets for the Manchu are largely unavailable, custom datasets are typically created for research purposes. Huang et al. [11] found that creating a Manchu word dataset often results in an imbalance, lacking sufficient word quantity for effective training, and thus necessitates augmentation.
The second difficulty is the possibility of additional noise or deformations appearing in Manchu word images when they are resized, by squashing or stretching, to match the input shape of the convolutional neural network (CNN). Zheng et al. [12] attempted to solve this problem by introducing a CNN with spatial pyramid pooling (SPP) rather than max-pooling in the final layer. The SPP layer contains three independent pooling layers, each dividing an input feature map into patches and creating a fixed-dimensional vector that can be input into the last fully connected layer of the CNN. This allows the recognition of Manchu words of arbitrary sizes without segmentation.
The third difficulty concerns the algorithmic complexity of segmentation-free Manchu word classifiers. For example, Zhang et al. [14] presented a 10-step algorithm for performing segmentation-free recognition with a deep CNN, where two steps of the algorithm required two separate appendices. The first appendix explains the seventh step, an extraction method for the vector outline of the j-th letter of a Manchu word, in ten additional steps. The second appendix explains the eighth step, which calculates the connection strength between the i-th contrast image and the j-th letter, in six additional steps.
Therefore, despite the increased frequency and potential outcomes of segmentation-free studies in recent years, they remain far from a validated approach for Manchu word recognition. Accordingly, in a previous study [15], we attempted a new type of Manchu script letter segmentation with limited success. In that study, we used Python to segment the images of the Manchu script into lines and words based on the surrounding white space. We then segmented each Manchu word into individual letters by cropping the image horizontally in vertical pixel rows where the image had the lowest number of black pixels. In other words, we cropped the image so that no strokes vertically overlapped with the central word stem. Fig. 1 illustrates this segmentation method. Three cropping locations are displayed in the histogram, where the word is divided into letters. These three locations are located at vertical pixel rows 7, 13, and 25, which crops the word into four letters.
Although the segmentation method described above was effective for separating lines and words of text precisely, two additional problems arose. First, when applied to a collection of varying handwriting styles, the method proved ineffective in segmenting letters. This was particularly true in cases where long letter strokes overlapped vertically into the horizontal space of other letters. Second, even when the segmentation method worked perfectly, the issue of labeling the segmented data persisted.
Therefore, this paper presents a process for the simultaneous extraction and labeling of Manchu script letters to create a machine-learning dataset. Although the method presented here requires the manual identification of Manchu script letters in an image, it is effective for numerous documents and handwriting styles.
To verify the effectiveness of this extraction and labeling method, a small dataset of Manchu script letters was collected, normalized, and trained using a small neural network. After training for 50 epochs, the network classified the test images with an accuracy of 98.75%.
The proposed process for Manchu script letter extraction and labeling uses OpenCV’s selectROIs and matchTemplate functions. First, the letters to be collected are defined in a label array. Next, selectROIs is used to select a region of interest containing a single Manchu letter. The selected ROI is then used as a template in the matchTemplate function to locate the matching copies in the given image. Because overlapping bounding boxes may exist for a matched template, a non-maximum suppression algorithm is used to minimize repeated matches.
The resulting matching bounding boxes are used to crop and save matching copies of the letters in labeled folders. This pattern is observed for each image of Manchu script in each folder of images collected from scanned books.
In this study, five different texts were used to gather 4,000 Manchu letters from 28 writing styles [16-20]. This allowed us to create a dataset with large data variance. Table 1 lists the selected texts, the number of scanned images, and the number of script styles for each text.
Table 1 . Number of images and writing styles collected from each text
Text | Images | Writing Styles |
---|---|---|
Gospel of St. Matthew | 79 | 1 |
Yumin tingzheng | 138 | 6 |
A Manchu Grammar | 409 | 4 |
Manchu Written Text | 24 | 1 |
A Textbook for Reading Documents | 154 | 16 |
TOTAL | 804 | 28 |
First, the PDF copies of each text were converted into JPG images. Next, each JPG image was visually enhanced using Photoshop and cropped into portions containing only the Manchu script. The images were then preprocessed in Python with erosion and dilation functions to enhance stroke features. Fig. 2 shows a sample of each type of text before preprocessing. Fig. 3 shows a sample of the preprocessed images. Various differing writing styles are present in Li [20] and are shown in the bottom row of both images.
A total of 804 Manchu script images were cropped, preprocessed, and organized into folders based on the source text. Next, a Python program was developed to select ROIs and match the templates across all images.
Initially, we define an array of letters to be visually located in the Manchu script images. The letter names are used as folder labels for the cropped and saved letter images. Letter names are also appended at the beginning of the filename for each cropped and saved letter.
For this study, the Manchu letters ‘j,’ ‘u,’ ‘w,’ ‘a,’ and ‘n’ were selected. The word ‘juwan’ means ‘ten’ in Manchu. In some images, the full word ‘juwan’ was present as a unit, making ROI selection simpler. However, in most images, letters were selected separately from different words. Fig. 4 shows the word ‘juwan’ in several styles.
When running, the Python program opens the first image in each folder in full-size grayscale. The selectROIs function provides a crosshair for letter selection in the image. The first letter, ‘j’ is located and selected. After selection, ‘enter’ or ‘space’ are pressed on the keyboard to advance the selectROIs function to wait for the second letter input, ‘u.’ This process is repeated until all five letters in ‘juwan’ are selected. Fig. 5 illustrates the selection method.
These letter selections are stored in a template array that is used later when matching the templates. After all five letters are selected, ‘ESC’ is pressed on the keyboard to end the selectROIs function and advance to the matchTemplates function.
If one of the five letters cannot be located in an image, nothing is selected, and ‘enter’ or ‘space’ is pressed on the keyboard to advance selectROIs to the next selection. If all five letters cannot be located in an image, nothing is selected, and ‘ESC’ is pressed on the keyboard to end the selectROIs function and advance to the matchTemplates function.
If the image template array created with selectROIs is not empty, each letter template is processed individually. First, if no template exists for a given letter, the letter is skipped, and the template of the next letter is evaluated. When a template exists for a letter, the program locates the letter’s folder in the project’s root directory and stores the folder path as a variable. If a folder for that letter does not yet exist, the program creates a new folder with that letter’s label and stores the folder path as a variable.
After the saved destination folder path is stored, the original grayscale Manchu-script image and grayscale letter templates are passed to OpenCV’s matchTemplates function. The TM_SQDIFF_NORMED method is used for template matching, which returns an output array of the comparison results.
The matchTemplates function is often used with OpenCV’s minMaxLoc function to determine the best match for the template in the output array. For TM_SQDIFF and TM_SQDIFF_NORMED, the best matches are obtained using the minimum value of the minMaxLoc function. Because we need to determine all the best matches for a given template in the source image, we use the ‘np.where’ function to find all x- and y-coordinates where the normalized correlation coefficient is below a given threshold. In this study, the threshold was set to 0.05.
However, using ‘np.where’ results in multiple overlapping detections for each match. Therefore, we also used a non-maximum suppression function proposed by Malisiewicz et al. [21], which was ported to Python by Rosebrock [22], to filter the best matches from numerous overlapping detections.
The non-maximum suppression function computes the area of each matched template’s bounding boxes and sorts them using the bottom-right y-coordinates. It then loops through all matches and calculates the maximum x- and y-coordinates for the start of each bounding box and the minimum x- and y-coordinates for the end of each bounding box. Subsequently, the width and height of each bounding box are computed. It then computes the overlap ratio of the bounding boxes. Finally, it deletes matching bounding boxes from the array that exceed the overlap threshold. This should result in only one (or extremely few) overlapping bounding boxes. In this study, the overlap threshold is set to 0.95. An example of the number of detected bounding boxes before and after non-maximum suppression is shown in Fig. 6.
The remaining array of matched bounding boxes is returned to the Python program. The program then crops and saves each match from the original image to its letter folder using the stored folder path variable. Each cropped and saved image filename includes the letter name, followed by an underscore, and a timestamp when the image was cropped and saved. Appending a timestamp to each image’s filename was more effective than appending sequential numbers because of the possibility of overwriting previously saved files. Fig. 7 shows a selection of twelve cropped ‘j’ templates.
After the first image is processed, the Python program loops sequentially through every remaining image in the given folder, performing the same series of steps listed above. The reason for manual letter selection (Step 1) for every image in the folder is twofold. First, although most source folders contain images with the same handwriting style, multiple handwriting styles were present in the case of Li [20]. Second, even when every image used the same handwriting style, there was sufficient variation between the individual images to require separate processing of each image. When only the first selected image was used for template selection and matching across all images, numerous false positives were detected in later images. Therefore, using only the first image in the folder for template selection and matching is ineffective for the entire folder. Rather, locating one copy of each letter in each image separately and allowing the program to perform template matching for each image sequentially was found to be more effective. Fig. 8 shows a flowchart of the process involved in selecting letters and matching templates for all Manchu script images.
After letter selection, most letters were matched two to three times per page, resulting in most letter folders containing over 1,000 matching images. However, the letter ‘a’ folder contained nearly 50,000 matches. This letter was the simplest letter, containing only a single stroke protruding to the left of the central stem. Thus, it was easily overmatched. Likewise, the ‘n’ folder also contained over three times as many matches as the other letter folders.
To correct this overmatching problem in future studies, we suggest two possible solutions. First, because ‘n’ is an extremely common word ending, it might be more effective to combine it with ‘a’ or whatever other letter precedes it. Second, the threshold for the non-maximum suppression algorithm in the previous section should be decreased to 0.5 or even 0.3. Both suggestions would reduce the number of matched images saved by the program. This might also result in a less tedious process of manual folder inspection later.
After all Manchu script images were processed and the matching templates were saved into folders labeled with the letter name, each folder was manually inspected to remove false positives or incorrect matches. Because the ‘a’ and ‘n’ folders contained multiple times the number of matched images as the other folders, a script was run to systematically relocate every nth image into new folders.
For the ‘a’ folder, every tenth image was relocated to a separate folder. For the ‘n’ folder, every sixth image was relocated. And for the ‘j’ folder, every second image was relocated.
This systematic relocation of letters resulted in five-letter folders with almost 1,000 images each. Additionally, each folder was manually inspected using Windows Explorer with the Medium Icons view and incorrect matches were discarded. Finally, a Python script was used to select 800 images randomly from each folder for the final dataset. Table 2 lists the number of letters per folder resulting from relocating the letters and manually inspecting each folder.
Table 2 . Results of Manchu letter selection and matching
Letter | # of Matches | # Relocated | After Inspection | Final |
---|---|---|---|---|
j | 2,245 | 1,655 | 828 | 800 |
u | 998 | 998 | 886 | 800 |
w | 1,028 | 1,028 | 957 | 800 |
a | 47,726 | 1,160 | 1,158 | 800 |
n | 6,717 | 1,073 | 1,013 | 800 |
The resulting 4,000-image dataset was normalized like the MNIST digits dataset [23]. Each image was opened in grayscale and binarized with pixel values under 80 set to 0 (black), and all other values set to 1 (white). Subsequently, images were inverted. The Python Image Library’s getbbox function was used to find the bounding box around each letter and crop the image to its bounding box. The images were resized to a length of 28 pixels on the long side, and padding was added to create square images of 28×28 pixels. Fig. 9 shows a sample of both the original dataset and the normalized dataset for the letter ‘j.’
Finally, a simple neural network was built and trained to verify the effectiveness of this Manchu dataset creation method.
First, the dataset was loaded into an array of (image, label) tuples using the folder name as the label. Every fifth image was placed in the test dataset. The remaining images were used as the training dataset. The test dataset contained 160 copies of each letter (800 images in total), and the training dataset contained 640 copies of each letter (3,200 images in total). Images were converted to PyTorch Tensors, and the labels were integer-encoded. The letter ‘j’ was encoded as 0, ‘u’ as 1, ‘w’ as 2, ‘a’ as 3, and ‘n’ as 4. Finally, data loaders were created for the training and test sets with a batch size of 100, and the shuffle parameter was set to True.
The neural network was built and trained in PyTorch and was based on LeNet-5 [24] with some modifications. The network consisted of two convolutional layers, followed by ReLU activation functions and max-pooling layers. Both convolutional layers used a kernel size of 5, a stride of 1, and a padding of 2. The size of the in_channels for the first convolutional layer was one and the size of the out_channels was 16. The in- and out-channels of the second convolutional layer were 16 and 32, respectively. Both max-pooling layers used a kernel size of two and a stride of two. The output from the second convolutional layer was flattened and sent to the final fully connected layer. The fully connected layer had 1,568 inputs and five outputs for five letters.
The network model is relatively small, with a total of 21,093 parameters. However, this small model was employed to assess the effectiveness of the data collection and labeling approaches. Fig. 10 provides a summary of the model.
The neural network was trained for 50 epochs with a batch size of 100. A stochastic gradient descent was used with a cross-entropy loss, a learning rate of 0.01, and a momentum of 0.5. After training, the model was evaluated on the test dataset with an overall accuracy of 98.75%. Fig. 11 shows the comprehensive test results for each letter, as well as the overall accuracy of the model’s predictions.
After evaluating the classification accuracy of the model, a script was created to randomly display a series of images from the test dataset with both predicted and true labels. The images were output in a 10 × 10 grid with predicted labels on the left and true labels on the right in parentheses. Correct predictions are displayed in green and incorrect predictions are shown in red. A sample of this 10 × 10 grid of predicted and labeled images is shown in Fig. 12.
The focus of this study was to describe the process of creating a dataset of Manchu script letters for machine learning. The process involves manually selecting Manchu script letters from script images and then performing template matching to extract and label matching copies of each selected letter. In this study, the size of the created dataset was intentionally kept small (using only five letters) to quickly assess the effectiveness of the method. The dataset was normalized and trained using a simple CNN model. Likewise, the CNN was also kept small to focus on the data collection method rather than on the neural network architecture.
The data collection method in this study is unique in that it does not employ automatic letter segmentation based on stroke features or the central stem of a word, unlike other studies. Rather, it relies on the manual identification of Manchu letters in an image of Manchu script. This is an important distinction because it implies that letters in multiple handwriting styles can be similarly extracted and labeled.
Although this method of Manchu letter extraction and labeling is time-consuming, manual letter selection for each image also eliminates the need to sort and label letters that were segmented using an automatic extraction process. Additionally, the process outlined in this study provides an effective approach for gathering data from various handwriting styles across numerous source texts.
Therefore, we expect to create a full dataset of Manchu script letters with an unlimited number of writing styles using this method. This is significant because no Manchu script letter dataset currently exists that can be used for machine learning. Additionally, this method may prove beneficial for creating datasets of other script-based languages, such as Urdu, Pashto, Bangla, Kannada, and ancient Mongolian, which are often referenced in Manchu OCR studies.
Aaron Daniel Snowberger
Aaron Daniel Snowberger received his Ph.D. degree in Information and Communication Engineering from Hanbat National University, Korea in 2024. He received his B.S degree in Computer Science from the College of Engineering, University of Wyoming, USA in 2006 and his M.FA degree in Media Design from Full Sail University, USA in 2011. He taught at Jeonju University from 2010 to 2023 but is now lecturing in Computer Science elsewhere. His research interests include computer vision, natural language processing, signal processing, machine and deep learning, and language.
Choong Ho Lee
Choong Ho Lee received the Ph.D. degree in Information Science from Tohoku University, Sendai, Japan in 1998, and the M.S. and B.S. degrees from Yonsei University, Seoul, Korea, in 1985 and 1987, respectively. He had worked for KT from 1987 to 2000 as a researcher. He has been at Hanbat National University, Daejeon, Korea since 2000, and he is presently a Professor of the Department of Information and Communication Engineering. His research interests include digital image processing, computer vision, machine learning, big data analysis, and software education.
Journal of information and communication convergence engineering 2024; 22(1): 80-87
Published online March 31, 2024 https://doi.org/10.56977/jicce.2024.22.1.80
Copyright © Korea Institute of Information and Communication Engineering.
Aaron Daniel Snowberger and Choong Ho Lee
*, Member, KIICE
Department of Information and Communication Engineering, Hanbat National University, Daejeon 34158, Republic of Korea
Correspondence to:Choong Ho Lee (E-mail: chlee@hanbat.ac.kr)
Department of Information and Communication Engineering, Hanbat National University, Daejeon 34158, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
The Manchu language holds historical significance, but a complete dataset of Manchu script letters for training optical character recognition machine-learning models is currently unavailable. Therefore, this paper describes the process of creating a robust dataset of extracted Manchu script letters. Rather than performing automatic letter segmentation based on whitespace or the thickness of the central word stem, an image of the Manchu script was manually inspected, and one copy of the desired letter was selected as a region of interest. This selected region of interest was used as a template to match all other occurrences of the same letter within the Manchu script image. Although the dataset in this study contained only 4,000 images of five Manchu script letters, these letters were collected from twenty-eight writing styles. A full dataset of Manchu letters is expected to be obtained through this process. The collected dataset was normalized and trained using a simple convolutional neural network to verify its effectiveness.
Keywords: Character Extraction, Data Collection, Dataset Creation, Manchu Characters, Template Matching
The Manchu language served as the lingua franca of the Qing Dynasty for over 200 years, facilitating trade and exchange throughout Asia and Europe. During that time, scholars in Russia, Joseon Korea, Tokugawa Japan, and European missionaries took a keen interest in Manchu for its usefulness in making the Chinese language more accessible [1].
However, today, only about 50 people can still read and write Manchu [2]. This implies a significant risk of losing numerous historical documents. Estimates indicate that nearly 20 percent of the 10 million documents in the First Historical Archives of China in Beijing were written in Manchu. In the provincial archive of Harbin, there may be up to 60 tons of Manchu documents [3]. There are also large archives of Qing Dynasty documents in Shenyang, Taipei, Daliang, Changchun, and Mongolia [4], as well as overseas archives in Japan, Europe, and North America [5].
Three major factors complicate the recovery of these documents. First, Manchu is a dying language. The last generation of native speakers is slowly aging out of the population. Second, few scholars have actively studied Manchu or conducted translation work using Manchu. Third, the vast amount of archival data is overwhelming. One estimate suggests that even if 100 people spent 100 years translating all Manchu archives, they would still be unable to finish [3].
Therefore, optical character recognition and machine learning technologies are becoming increasingly important for preserving and translating historical documents such as those written in Manchu. However, there is currently no available dataset of extracted and labeled Manchu script letters to aid in building an optical character recognition (OCR) model. Therefore, this study outlines a process for simultaneously extracting and labeling Manchu letters from images of Manchu scripts to build and normalize a machine-learning dataset. The dataset gathered and normalized for this study was trained using a simple neural network to confirm the viability of the approach.
Since the early 2000s, assorted studies have been conducted on Manchu script recognition. These studies typically fall into one of three categories. The first type of study focuses on recognizing Manchu letters based on strokes [6-8]. The second type of study focuses on the segmentation or extraction of Manchu letters from words for recognition [9-10]. The third type of study focuses on the segmentation-free recognition of Manchu words [11-14]. Of these three methods, the segmentation-free method appears to be the most intuitive, given the complexity of stroke and letter classifications. However, the segmentation-free recognition of Manchu words is not free of difficulties.
One difficulty is the need for a significantly larger Manchu words dataset for training, as opposed to a dataset containing only letters or strokes. However, because public datasets for the Manchu are largely unavailable, custom datasets are typically created for research purposes. Huang et al. [11] found that creating a Manchu word dataset often results in an imbalance, lacking sufficient word quantity for effective training, and thus necessitates augmentation.
The second difficulty is the possibility of additional noise or deformations appearing in Manchu word images when they are resized, by squashing or stretching, to match the input shape of the convolutional neural network (CNN). Zheng et al. [12] attempted to solve this problem by introducing a CNN with spatial pyramid pooling (SPP) rather than max-pooling in the final layer. The SPP layer contains three independent pooling layers, each dividing an input feature map into patches and creating a fixed-dimensional vector that can be input into the last fully connected layer of the CNN. This allows the recognition of Manchu words of arbitrary sizes without segmentation.
The third difficulty concerns the algorithmic complexity of segmentation-free Manchu word classifiers. For example, Zhang et al. [14] presented a 10-step algorithm for performing segmentation-free recognition with a deep CNN, where two steps of the algorithm required two separate appendices. The first appendix explains the seventh step, an extraction method for the vector outline of the j-th letter of a Manchu word, in ten additional steps. The second appendix explains the eighth step, which calculates the connection strength between the i-th contrast image and the j-th letter, in six additional steps.
Therefore, despite the increased frequency and potential outcomes of segmentation-free studies in recent years, they remain far from a validated approach for Manchu word recognition. Accordingly, in a previous study [15], we attempted a new type of Manchu script letter segmentation with limited success. In that study, we used Python to segment the images of the Manchu script into lines and words based on the surrounding white space. We then segmented each Manchu word into individual letters by cropping the image horizontally in vertical pixel rows where the image had the lowest number of black pixels. In other words, we cropped the image so that no strokes vertically overlapped with the central word stem. Fig. 1 illustrates this segmentation method. Three cropping locations are displayed in the histogram, where the word is divided into letters. These three locations are located at vertical pixel rows 7, 13, and 25, which crops the word into four letters.
Although the segmentation method described above was effective for separating lines and words of text precisely, two additional problems arose. First, when applied to a collection of varying handwriting styles, the method proved ineffective in segmenting letters. This was particularly true in cases where long letter strokes overlapped vertically into the horizontal space of other letters. Second, even when the segmentation method worked perfectly, the issue of labeling the segmented data persisted.
Therefore, this paper presents a process for the simultaneous extraction and labeling of Manchu script letters to create a machine-learning dataset. Although the method presented here requires the manual identification of Manchu script letters in an image, it is effective for numerous documents and handwriting styles.
To verify the effectiveness of this extraction and labeling method, a small dataset of Manchu script letters was collected, normalized, and trained using a small neural network. After training for 50 epochs, the network classified the test images with an accuracy of 98.75%.
The proposed process for Manchu script letter extraction and labeling uses OpenCV’s selectROIs and matchTemplate functions. First, the letters to be collected are defined in a label array. Next, selectROIs is used to select a region of interest containing a single Manchu letter. The selected ROI is then used as a template in the matchTemplate function to locate the matching copies in the given image. Because overlapping bounding boxes may exist for a matched template, a non-maximum suppression algorithm is used to minimize repeated matches.
The resulting matching bounding boxes are used to crop and save matching copies of the letters in labeled folders. This pattern is observed for each image of Manchu script in each folder of images collected from scanned books.
In this study, five different texts were used to gather 4,000 Manchu letters from 28 writing styles [16-20]. This allowed us to create a dataset with large data variance. Table 1 lists the selected texts, the number of scanned images, and the number of script styles for each text.
Table 1 . Number of images and writing styles collected from each text.
Text | Images | Writing Styles |
---|---|---|
Gospel of St. Matthew | 79 | 1 |
Yumin tingzheng | 138 | 6 |
A Manchu Grammar | 409 | 4 |
Manchu Written Text | 24 | 1 |
A Textbook for Reading Documents | 154 | 16 |
TOTAL | 804 | 28 |
First, the PDF copies of each text were converted into JPG images. Next, each JPG image was visually enhanced using Photoshop and cropped into portions containing only the Manchu script. The images were then preprocessed in Python with erosion and dilation functions to enhance stroke features. Fig. 2 shows a sample of each type of text before preprocessing. Fig. 3 shows a sample of the preprocessed images. Various differing writing styles are present in Li [20] and are shown in the bottom row of both images.
A total of 804 Manchu script images were cropped, preprocessed, and organized into folders based on the source text. Next, a Python program was developed to select ROIs and match the templates across all images.
Initially, we define an array of letters to be visually located in the Manchu script images. The letter names are used as folder labels for the cropped and saved letter images. Letter names are also appended at the beginning of the filename for each cropped and saved letter.
For this study, the Manchu letters ‘j,’ ‘u,’ ‘w,’ ‘a,’ and ‘n’ were selected. The word ‘juwan’ means ‘ten’ in Manchu. In some images, the full word ‘juwan’ was present as a unit, making ROI selection simpler. However, in most images, letters were selected separately from different words. Fig. 4 shows the word ‘juwan’ in several styles.
When running, the Python program opens the first image in each folder in full-size grayscale. The selectROIs function provides a crosshair for letter selection in the image. The first letter, ‘j’ is located and selected. After selection, ‘enter’ or ‘space’ are pressed on the keyboard to advance the selectROIs function to wait for the second letter input, ‘u.’ This process is repeated until all five letters in ‘juwan’ are selected. Fig. 5 illustrates the selection method.
These letter selections are stored in a template array that is used later when matching the templates. After all five letters are selected, ‘ESC’ is pressed on the keyboard to end the selectROIs function and advance to the matchTemplates function.
If one of the five letters cannot be located in an image, nothing is selected, and ‘enter’ or ‘space’ is pressed on the keyboard to advance selectROIs to the next selection. If all five letters cannot be located in an image, nothing is selected, and ‘ESC’ is pressed on the keyboard to end the selectROIs function and advance to the matchTemplates function.
If the image template array created with selectROIs is not empty, each letter template is processed individually. First, if no template exists for a given letter, the letter is skipped, and the template of the next letter is evaluated. When a template exists for a letter, the program locates the letter’s folder in the project’s root directory and stores the folder path as a variable. If a folder for that letter does not yet exist, the program creates a new folder with that letter’s label and stores the folder path as a variable.
After the saved destination folder path is stored, the original grayscale Manchu-script image and grayscale letter templates are passed to OpenCV’s matchTemplates function. The TM_SQDIFF_NORMED method is used for template matching, which returns an output array of the comparison results.
The matchTemplates function is often used with OpenCV’s minMaxLoc function to determine the best match for the template in the output array. For TM_SQDIFF and TM_SQDIFF_NORMED, the best matches are obtained using the minimum value of the minMaxLoc function. Because we need to determine all the best matches for a given template in the source image, we use the ‘np.where’ function to find all x- and y-coordinates where the normalized correlation coefficient is below a given threshold. In this study, the threshold was set to 0.05.
However, using ‘np.where’ results in multiple overlapping detections for each match. Therefore, we also used a non-maximum suppression function proposed by Malisiewicz et al. [21], which was ported to Python by Rosebrock [22], to filter the best matches from numerous overlapping detections.
The non-maximum suppression function computes the area of each matched template’s bounding boxes and sorts them using the bottom-right y-coordinates. It then loops through all matches and calculates the maximum x- and y-coordinates for the start of each bounding box and the minimum x- and y-coordinates for the end of each bounding box. Subsequently, the width and height of each bounding box are computed. It then computes the overlap ratio of the bounding boxes. Finally, it deletes matching bounding boxes from the array that exceed the overlap threshold. This should result in only one (or extremely few) overlapping bounding boxes. In this study, the overlap threshold is set to 0.95. An example of the number of detected bounding boxes before and after non-maximum suppression is shown in Fig. 6.
The remaining array of matched bounding boxes is returned to the Python program. The program then crops and saves each match from the original image to its letter folder using the stored folder path variable. Each cropped and saved image filename includes the letter name, followed by an underscore, and a timestamp when the image was cropped and saved. Appending a timestamp to each image’s filename was more effective than appending sequential numbers because of the possibility of overwriting previously saved files. Fig. 7 shows a selection of twelve cropped ‘j’ templates.
After the first image is processed, the Python program loops sequentially through every remaining image in the given folder, performing the same series of steps listed above. The reason for manual letter selection (Step 1) for every image in the folder is twofold. First, although most source folders contain images with the same handwriting style, multiple handwriting styles were present in the case of Li [20]. Second, even when every image used the same handwriting style, there was sufficient variation between the individual images to require separate processing of each image. When only the first selected image was used for template selection and matching across all images, numerous false positives were detected in later images. Therefore, using only the first image in the folder for template selection and matching is ineffective for the entire folder. Rather, locating one copy of each letter in each image separately and allowing the program to perform template matching for each image sequentially was found to be more effective. Fig. 8 shows a flowchart of the process involved in selecting letters and matching templates for all Manchu script images.
After letter selection, most letters were matched two to three times per page, resulting in most letter folders containing over 1,000 matching images. However, the letter ‘a’ folder contained nearly 50,000 matches. This letter was the simplest letter, containing only a single stroke protruding to the left of the central stem. Thus, it was easily overmatched. Likewise, the ‘n’ folder also contained over three times as many matches as the other letter folders.
To correct this overmatching problem in future studies, we suggest two possible solutions. First, because ‘n’ is an extremely common word ending, it might be more effective to combine it with ‘a’ or whatever other letter precedes it. Second, the threshold for the non-maximum suppression algorithm in the previous section should be decreased to 0.5 or even 0.3. Both suggestions would reduce the number of matched images saved by the program. This might also result in a less tedious process of manual folder inspection later.
After all Manchu script images were processed and the matching templates were saved into folders labeled with the letter name, each folder was manually inspected to remove false positives or incorrect matches. Because the ‘a’ and ‘n’ folders contained multiple times the number of matched images as the other folders, a script was run to systematically relocate every nth image into new folders.
For the ‘a’ folder, every tenth image was relocated to a separate folder. For the ‘n’ folder, every sixth image was relocated. And for the ‘j’ folder, every second image was relocated.
This systematic relocation of letters resulted in five-letter folders with almost 1,000 images each. Additionally, each folder was manually inspected using Windows Explorer with the Medium Icons view and incorrect matches were discarded. Finally, a Python script was used to select 800 images randomly from each folder for the final dataset. Table 2 lists the number of letters per folder resulting from relocating the letters and manually inspecting each folder.
Table 2 . Results of Manchu letter selection and matching.
Letter | # of Matches | # Relocated | After Inspection | Final |
---|---|---|---|---|
j | 2,245 | 1,655 | 828 | 800 |
u | 998 | 998 | 886 | 800 |
w | 1,028 | 1,028 | 957 | 800 |
a | 47,726 | 1,160 | 1,158 | 800 |
n | 6,717 | 1,073 | 1,013 | 800 |
The resulting 4,000-image dataset was normalized like the MNIST digits dataset [23]. Each image was opened in grayscale and binarized with pixel values under 80 set to 0 (black), and all other values set to 1 (white). Subsequently, images were inverted. The Python Image Library’s getbbox function was used to find the bounding box around each letter and crop the image to its bounding box. The images were resized to a length of 28 pixels on the long side, and padding was added to create square images of 28×28 pixels. Fig. 9 shows a sample of both the original dataset and the normalized dataset for the letter ‘j.’
Finally, a simple neural network was built and trained to verify the effectiveness of this Manchu dataset creation method.
First, the dataset was loaded into an array of (image, label) tuples using the folder name as the label. Every fifth image was placed in the test dataset. The remaining images were used as the training dataset. The test dataset contained 160 copies of each letter (800 images in total), and the training dataset contained 640 copies of each letter (3,200 images in total). Images were converted to PyTorch Tensors, and the labels were integer-encoded. The letter ‘j’ was encoded as 0, ‘u’ as 1, ‘w’ as 2, ‘a’ as 3, and ‘n’ as 4. Finally, data loaders were created for the training and test sets with a batch size of 100, and the shuffle parameter was set to True.
The neural network was built and trained in PyTorch and was based on LeNet-5 [24] with some modifications. The network consisted of two convolutional layers, followed by ReLU activation functions and max-pooling layers. Both convolutional layers used a kernel size of 5, a stride of 1, and a padding of 2. The size of the in_channels for the first convolutional layer was one and the size of the out_channels was 16. The in- and out-channels of the second convolutional layer were 16 and 32, respectively. Both max-pooling layers used a kernel size of two and a stride of two. The output from the second convolutional layer was flattened and sent to the final fully connected layer. The fully connected layer had 1,568 inputs and five outputs for five letters.
The network model is relatively small, with a total of 21,093 parameters. However, this small model was employed to assess the effectiveness of the data collection and labeling approaches. Fig. 10 provides a summary of the model.
The neural network was trained for 50 epochs with a batch size of 100. A stochastic gradient descent was used with a cross-entropy loss, a learning rate of 0.01, and a momentum of 0.5. After training, the model was evaluated on the test dataset with an overall accuracy of 98.75%. Fig. 11 shows the comprehensive test results for each letter, as well as the overall accuracy of the model’s predictions.
After evaluating the classification accuracy of the model, a script was created to randomly display a series of images from the test dataset with both predicted and true labels. The images were output in a 10 × 10 grid with predicted labels on the left and true labels on the right in parentheses. Correct predictions are displayed in green and incorrect predictions are shown in red. A sample of this 10 × 10 grid of predicted and labeled images is shown in Fig. 12.
The focus of this study was to describe the process of creating a dataset of Manchu script letters for machine learning. The process involves manually selecting Manchu script letters from script images and then performing template matching to extract and label matching copies of each selected letter. In this study, the size of the created dataset was intentionally kept small (using only five letters) to quickly assess the effectiveness of the method. The dataset was normalized and trained using a simple CNN model. Likewise, the CNN was also kept small to focus on the data collection method rather than on the neural network architecture.
The data collection method in this study is unique in that it does not employ automatic letter segmentation based on stroke features or the central stem of a word, unlike other studies. Rather, it relies on the manual identification of Manchu letters in an image of Manchu script. This is an important distinction because it implies that letters in multiple handwriting styles can be similarly extracted and labeled.
Although this method of Manchu letter extraction and labeling is time-consuming, manual letter selection for each image also eliminates the need to sort and label letters that were segmented using an automatic extraction process. Additionally, the process outlined in this study provides an effective approach for gathering data from various handwriting styles across numerous source texts.
Therefore, we expect to create a full dataset of Manchu script letters with an unlimited number of writing styles using this method. This is significant because no Manchu script letter dataset currently exists that can be used for machine learning. Additionally, this method may prove beneficial for creating datasets of other script-based languages, such as Urdu, Pashto, Bangla, Kannada, and ancient Mongolian, which are often referenced in Manchu OCR studies.
Table 1 . Number of images and writing styles collected from each text.
Text | Images | Writing Styles |
---|---|---|
Gospel of St. Matthew | 79 | 1 |
Yumin tingzheng | 138 | 6 |
A Manchu Grammar | 409 | 4 |
Manchu Written Text | 24 | 1 |
A Textbook for Reading Documents | 154 | 16 |
TOTAL | 804 | 28 |
Table 2 . Results of Manchu letter selection and matching.
Letter | # of Matches | # Relocated | After Inspection | Final |
---|---|---|---|---|
j | 2,245 | 1,655 | 828 | 800 |
u | 998 | 998 | 886 | 800 |
w | 1,028 | 1,028 | 957 | 800 |
a | 47,726 | 1,160 | 1,158 | 800 |
n | 6,717 | 1,073 | 1,013 | 800 |