Search 닫기

Regular paper

Split Viewer

Journal of information and communication convergence engineering 2024; 22(1): 80-87

Published online March 31, 2024

https://doi.org/10.56977/jicce.2024.22.1.80

© Korea Institute of Information and Communication Engineering

Manchu Script Letters Dataset Creation and Labeling

Aaron Daniel Snowberger and Choong Ho Lee *, Member, KIICE

Department of Information and Communication Engineering, Hanbat National University, Daejeon 34158, Republic of Korea

Correspondence to : Choong Ho Lee (E-mail: chlee@hanbat.ac.kr)
Department of Information and Communication Engineering, Hanbat National University, Daejeon 34158, Republic of Korea

Received: June 7, 2023; Revised: August 25, 2023; Accepted: September 25, 2023

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

The Manchu language holds historical significance, but a complete dataset of Manchu script letters for training optical character recognition machine-learning models is currently unavailable. Therefore, this paper describes the process of creating a robust dataset of extracted Manchu script letters. Rather than performing automatic letter segmentation based on whitespace or the thickness of the central word stem, an image of the Manchu script was manually inspected, and one copy of the desired letter was selected as a region of interest. This selected region of interest was used as a template to match all other occurrences of the same letter within the Manchu script image. Although the dataset in this study contained only 4,000 images of five Manchu script letters, these letters were collected from twenty-eight writing styles. A full dataset of Manchu letters is expected to be obtained through this process. The collected dataset was normalized and trained using a simple convolutional neural network to verify its effectiveness.

Keywords Character Extraction, Data Collection, Dataset Creation, Manchu Characters, Template Matching

The Manchu language served as the lingua franca of the Qing Dynasty for over 200 years, facilitating trade and exchange throughout Asia and Europe. During that time, scholars in Russia, Joseon Korea, Tokugawa Japan, and European missionaries took a keen interest in Manchu for its usefulness in making the Chinese language more accessible [1].

However, today, only about 50 people can still read and write Manchu [2]. This implies a significant risk of losing numerous historical documents. Estimates indicate that nearly 20 percent of the 10 million documents in the First Historical Archives of China in Beijing were written in Manchu. In the provincial archive of Harbin, there may be up to 60 tons of Manchu documents [3]. There are also large archives of Qing Dynasty documents in Shenyang, Taipei, Daliang, Changchun, and Mongolia [4], as well as overseas archives in Japan, Europe, and North America [5].

Three major factors complicate the recovery of these documents. First, Manchu is a dying language. The last generation of native speakers is slowly aging out of the population. Second, few scholars have actively studied Manchu or conducted translation work using Manchu. Third, the vast amount of archival data is overwhelming. One estimate suggests that even if 100 people spent 100 years translating all Manchu archives, they would still be unable to finish [3].

Therefore, optical character recognition and machine learning technologies are becoming increasingly important for preserving and translating historical documents such as those written in Manchu. However, there is currently no available dataset of extracted and labeled Manchu script letters to aid in building an optical character recognition (OCR) model. Therefore, this study outlines a process for simultaneously extracting and labeling Manchu letters from images of Manchu scripts to build and normalize a machine-learning dataset. The dataset gathered and normalized for this study was trained using a simple neural network to confirm the viability of the approach.

Since the early 2000s, assorted studies have been conducted on Manchu script recognition. These studies typically fall into one of three categories. The first type of study focuses on recognizing Manchu letters based on strokes [6-8]. The second type of study focuses on the segmentation or extraction of Manchu letters from words for recognition [9-10]. The third type of study focuses on the segmentation-free recognition of Manchu words [11-14]. Of these three methods, the segmentation-free method appears to be the most intuitive, given the complexity of stroke and letter classifications. However, the segmentation-free recognition of Manchu words is not free of difficulties.

One difficulty is the need for a significantly larger Manchu words dataset for training, as opposed to a dataset containing only letters or strokes. However, because public datasets for the Manchu are largely unavailable, custom datasets are typically created for research purposes. Huang et al. [11] found that creating a Manchu word dataset often results in an imbalance, lacking sufficient word quantity for effective training, and thus necessitates augmentation.

The second difficulty is the possibility of additional noise or deformations appearing in Manchu word images when they are resized, by squashing or stretching, to match the input shape of the convolutional neural network (CNN). Zheng et al. [12] attempted to solve this problem by introducing a CNN with spatial pyramid pooling (SPP) rather than max-pooling in the final layer. The SPP layer contains three independent pooling layers, each dividing an input feature map into patches and creating a fixed-dimensional vector that can be input into the last fully connected layer of the CNN. This allows the recognition of Manchu words of arbitrary sizes without segmentation.

The third difficulty concerns the algorithmic complexity of segmentation-free Manchu word classifiers. For example, Zhang et al. [14] presented a 10-step algorithm for performing segmentation-free recognition with a deep CNN, where two steps of the algorithm required two separate appendices. The first appendix explains the seventh step, an extraction method for the vector outline of the j-th letter of a Manchu word, in ten additional steps. The second appendix explains the eighth step, which calculates the connection strength between the i-th contrast image and the j-th letter, in six additional steps.

Therefore, despite the increased frequency and potential outcomes of segmentation-free studies in recent years, they remain far from a validated approach for Manchu word recognition. Accordingly, in a previous study [15], we attempted a new type of Manchu script letter segmentation with limited success. In that study, we used Python to segment the images of the Manchu script into lines and words based on the surrounding white space. We then segmented each Manchu word into individual letters by cropping the image horizontally in vertical pixel rows where the image had the lowest number of black pixels. In other words, we cropped the image so that no strokes vertically overlapped with the central word stem. Fig. 1 illustrates this segmentation method. Three cropping locations are displayed in the histogram, where the word is divided into letters. These three locations are located at vertical pixel rows 7, 13, and 25, which crops the word into four letters.

Fig. 1. Manchu word segmentation according to the lowest count of black pixels in each horizontal row of the image.

Although the segmentation method described above was effective for separating lines and words of text precisely, two additional problems arose. First, when applied to a collection of varying handwriting styles, the method proved ineffective in segmenting letters. This was particularly true in cases where long letter strokes overlapped vertically into the horizontal space of other letters. Second, even when the segmentation method worked perfectly, the issue of labeling the segmented data persisted.

Therefore, this paper presents a process for the simultaneous extraction and labeling of Manchu script letters to create a machine-learning dataset. Although the method presented here requires the manual identification of Manchu script letters in an image, it is effective for numerous documents and handwriting styles.

To verify the effectiveness of this extraction and labeling method, a small dataset of Manchu script letters was collected, normalized, and trained using a small neural network. After training for 50 epochs, the network classified the test images with an accuracy of 98.75%.

The proposed process for Manchu script letter extraction and labeling uses OpenCV’s selectROIs and matchTemplate functions. First, the letters to be collected are defined in a label array. Next, selectROIs is used to select a region of interest containing a single Manchu letter. The selected ROI is then used as a template in the matchTemplate function to locate the matching copies in the given image. Because overlapping bounding boxes may exist for a matched template, a non-maximum suppression algorithm is used to minimize repeated matches.

The resulting matching bounding boxes are used to crop and save matching copies of the letters in labeled folders. This pattern is observed for each image of Manchu script in each folder of images collected from scanned books.

A. Data Collection

In this study, five different texts were used to gather 4,000 Manchu letters from 28 writing styles [16-20]. This allowed us to create a dataset with large data variance. Table 1 lists the selected texts, the number of scanned images, and the number of script styles for each text.

Table 1 . Number of images and writing styles collected from each text

TextImagesWriting Styles
Gospel of St. Matthew791
Yumin tingzheng1386
A Manchu Grammar4094
Manchu Written Text241
A Textbook for Reading Documents15416
TOTAL80428


First, the PDF copies of each text were converted into JPG images. Next, each JPG image was visually enhanced using Photoshop and cropped into portions containing only the Manchu script. The images were then preprocessed in Python with erosion and dilation functions to enhance stroke features. Fig. 2 shows a sample of each type of text before preprocessing. Fig. 3 shows a sample of the preprocessed images. Various differing writing styles are present in Li [20] and are shown in the bottom row of both images.

Fig. 2. JPG images before preprocessing. Top row from left to right: Gospel of St. Matthew [16], Yumen tingzheng [17], A Manchu Grammar with Analyzed Texts [18], Manchu Written Text [19]. Bottom row: Manchu, A Textbook for Reading Documents [20].

Fig. 3. Preprocessed images. Top row from left to right: Gospel of St. Matthew [16], Yumen tingzheng [17], A Manchu Grammar with Analyzed Texts [18], Manchu Written Text [19]. Bottom row: Manchu, A Textbook for Reading Documents [20].

B. Dataset Creation

1) Letter Selection

A total of 804 Manchu script images were cropped, preprocessed, and organized into folders based on the source text. Next, a Python program was developed to select ROIs and match the templates across all images.

Initially, we define an array of letters to be visually located in the Manchu script images. The letter names are used as folder labels for the cropped and saved letter images. Letter names are also appended at the beginning of the filename for each cropped and saved letter.

For this study, the Manchu letters ‘j,’ ‘u,’ ‘w,’ ‘a,’ and ‘n’ were selected. The word ‘juwan’ means ‘ten’ in Manchu. In some images, the full word ‘juwan’ was present as a unit, making ROI selection simpler. However, in most images, letters were selected separately from different words. Fig. 4 shows the word ‘juwan’ in several styles.

Fig. 4. ’Juwan,’ the Manchu word for ‘ten’ in several styles (left), and the division of individual letters making up the word (right).

When running, the Python program opens the first image in each folder in full-size grayscale. The selectROIs function provides a crosshair for letter selection in the image. The first letter, ‘j’ is located and selected. After selection, ‘enter’ or ‘space’ are pressed on the keyboard to advance the selectROIs function to wait for the second letter input, ‘u.’ This process is repeated until all five letters in ‘juwan’ are selected. Fig. 5 illustrates the selection method.

Fig. 5. A Manchu script image with a selection of the Manchu letter ‘j’.

These letter selections are stored in a template array that is used later when matching the templates. After all five letters are selected, ‘ESC’ is pressed on the keyboard to end the selectROIs function and advance to the matchTemplates function.

If one of the five letters cannot be located in an image, nothing is selected, and ‘enter’ or ‘space’ is pressed on the keyboard to advance selectROIs to the next selection. If all five letters cannot be located in an image, nothing is selected, and ‘ESC’ is pressed on the keyboard to end the selectROIs function and advance to the matchTemplates function.

2) Template Matching

If the image template array created with selectROIs is not empty, each letter template is processed individually. First, if no template exists for a given letter, the letter is skipped, and the template of the next letter is evaluated. When a template exists for a letter, the program locates the letter’s folder in the project’s root directory and stores the folder path as a variable. If a folder for that letter does not yet exist, the program creates a new folder with that letter’s label and stores the folder path as a variable.

After the saved destination folder path is stored, the original grayscale Manchu-script image and grayscale letter templates are passed to OpenCV’s matchTemplates function. The TM_SQDIFF_NORMED method is used for template matching, which returns an output array of the comparison results.

The matchTemplates function is often used with OpenCV’s minMaxLoc function to determine the best match for the template in the output array. For TM_SQDIFF and TM_SQDIFF_NORMED, the best matches are obtained using the minimum value of the minMaxLoc function. Because we need to determine all the best matches for a given template in the source image, we use the ‘np.where’ function to find all x- and y-coordinates where the normalized correlation coefficient is below a given threshold. In this study, the threshold was set to 0.05.

However, using ‘np.where’ results in multiple overlapping detections for each match. Therefore, we also used a non-maximum suppression function proposed by Malisiewicz et al. [21], which was ported to Python by Rosebrock [22], to filter the best matches from numerous overlapping detections.

3) Non-Maximum Suppression

The non-maximum suppression function computes the area of each matched template’s bounding boxes and sorts them using the bottom-right y-coordinates. It then loops through all matches and calculates the maximum x- and y-coordinates for the start of each bounding box and the minimum x- and y-coordinates for the end of each bounding box. Subsequently, the width and height of each bounding box are computed. It then computes the overlap ratio of the bounding boxes. Finally, it deletes matching bounding boxes from the array that exceed the overlap threshold. This should result in only one (or extremely few) overlapping bounding boxes. In this study, the overlap threshold is set to 0.95. An example of the number of detected bounding boxes before and after non-maximum suppression is shown in Fig. 6.

Fig. 6. Example of performing non-maximum suppression on detected bounding boxes with an overlap threshold of 0.95.

4) Cropping and Saving Matches

The remaining array of matched bounding boxes is returned to the Python program. The program then crops and saves each match from the original image to its letter folder using the stored folder path variable. Each cropped and saved image filename includes the letter name, followed by an underscore, and a timestamp when the image was cropped and saved. Appending a timestamp to each image’s filename was more effective than appending sequential numbers because of the possibility of overwriting previously saved files. Fig. 7 shows a selection of twelve cropped ‘j’ templates.

Fig. 7. Sample of matched and cropped images for the Manchu letter ‘j.’

5) Processing All Images in a Source Folder

After the first image is processed, the Python program loops sequentially through every remaining image in the given folder, performing the same series of steps listed above. The reason for manual letter selection (Step 1) for every image in the folder is twofold. First, although most source folders contain images with the same handwriting style, multiple handwriting styles were present in the case of Li [20]. Second, even when every image used the same handwriting style, there was sufficient variation between the individual images to require separate processing of each image. When only the first selected image was used for template selection and matching across all images, numerous false positives were detected in later images. Therefore, using only the first image in the folder for template selection and matching is ineffective for the entire folder. Rather, locating one copy of each letter in each image separately and allowing the program to perform template matching for each image sequentially was found to be more effective. Fig. 8 shows a flowchart of the process involved in selecting letters and matching templates for all Manchu script images.

Fig. 8. Manchu letter selection and matching Manchu process.

6) Matching Results

After letter selection, most letters were matched two to three times per page, resulting in most letter folders containing over 1,000 matching images. However, the letter ‘a’ folder contained nearly 50,000 matches. This letter was the simplest letter, containing only a single stroke protruding to the left of the central stem. Thus, it was easily overmatched. Likewise, the ‘n’ folder also contained over three times as many matches as the other letter folders.

To correct this overmatching problem in future studies, we suggest two possible solutions. First, because ‘n’ is an extremely common word ending, it might be more effective to combine it with ‘a’ or whatever other letter precedes it. Second, the threshold for the non-maximum suppression algorithm in the previous section should be decreased to 0.5 or even 0.3. Both suggestions would reduce the number of matched images saved by the program. This might also result in a less tedious process of manual folder inspection later.

7) Manual Inspection

After all Manchu script images were processed and the matching templates were saved into folders labeled with the letter name, each folder was manually inspected to remove false positives or incorrect matches. Because the ‘a’ and ‘n’ folders contained multiple times the number of matched images as the other folders, a script was run to systematically relocate every nth image into new folders.

For the ‘a’ folder, every tenth image was relocated to a separate folder. For the ‘n’ folder, every sixth image was relocated. And for the ‘j’ folder, every second image was relocated.

This systematic relocation of letters resulted in five-letter folders with almost 1,000 images each. Additionally, each folder was manually inspected using Windows Explorer with the Medium Icons view and incorrect matches were discarded. Finally, a Python script was used to select 800 images randomly from each folder for the final dataset. Table 2 lists the number of letters per folder resulting from relocating the letters and manually inspecting each folder.

Table 2 . Results of Manchu letter selection and matching

Letter# of Matches# RelocatedAfter InspectionFinal
j2,2451,655828800
u998998886800
w1,0281,028957800
a47,7261,1601,158800
n6,7171,0731,013800


8) Dataset Normalization

The resulting 4,000-image dataset was normalized like the MNIST digits dataset [23]. Each image was opened in grayscale and binarized with pixel values under 80 set to 0 (black), and all other values set to 1 (white). Subsequently, images were inverted. The Python Image Library’s getbbox function was used to find the bounding box around each letter and crop the image to its bounding box. The images were resized to a length of 28 pixels on the long side, and padding was added to create square images of 28×28 pixels. Fig. 9 shows a sample of both the original dataset and the normalized dataset for the letter ‘j.’

Fig. 9. Sample of matched images for the letter ‘j’ (left), and normalized images (right).

9) Verification of Dataset Effectiveness

Finally, a simple neural network was built and trained to verify the effectiveness of this Manchu dataset creation method.

First, the dataset was loaded into an array of (image, label) tuples using the folder name as the label. Every fifth image was placed in the test dataset. The remaining images were used as the training dataset. The test dataset contained 160 copies of each letter (800 images in total), and the training dataset contained 640 copies of each letter (3,200 images in total). Images were converted to PyTorch Tensors, and the labels were integer-encoded. The letter ‘j’ was encoded as 0, ‘u’ as 1, ‘w’ as 2, ‘a’ as 3, and ‘n’ as 4. Finally, data loaders were created for the training and test sets with a batch size of 100, and the shuffle parameter was set to True.

The neural network was built and trained in PyTorch and was based on LeNet-5 [24] with some modifications. The network consisted of two convolutional layers, followed by ReLU activation functions and max-pooling layers. Both convolutional layers used a kernel size of 5, a stride of 1, and a padding of 2. The size of the in_channels for the first convolutional layer was one and the size of the out_channels was 16. The in- and out-channels of the second convolutional layer were 16 and 32, respectively. Both max-pooling layers used a kernel size of two and a stride of two. The output from the second convolutional layer was flattened and sent to the final fully connected layer. The fully connected layer had 1,568 inputs and five outputs for five letters.

The network model is relatively small, with a total of 21,093 parameters. However, this small model was employed to assess the effectiveness of the data collection and labeling approaches. Fig. 10 provides a summary of the model.

Fig. 10. Model summary for the LeNet-5-based CNN used to train this dataset.

The neural network was trained for 50 epochs with a batch size of 100. A stochastic gradient descent was used with a cross-entropy loss, a learning rate of 0.01, and a momentum of 0.5. After training, the model was evaluated on the test dataset with an overall accuracy of 98.75%. Fig. 11 shows the comprehensive test results for each letter, as well as the overall accuracy of the model’s predictions.

Fig. 11. Comprehensive accuracy measurements for each letter in the test dataset as well as the overall accuracy of the model’s predictions.

After evaluating the classification accuracy of the model, a script was created to randomly display a series of images from the test dataset with both predicted and true labels. The images were output in a 10 × 10 grid with predicted labels on the left and true labels on the right in parentheses. Correct predictions are displayed in green and incorrect predictions are shown in red. A sample of this 10 × 10 grid of predicted and labeled images is shown in Fig. 12.

Fig. 12. Randomly selected images from the test dataset with the model’s classification predictions.

The focus of this study was to describe the process of creating a dataset of Manchu script letters for machine learning. The process involves manually selecting Manchu script letters from script images and then performing template matching to extract and label matching copies of each selected letter. In this study, the size of the created dataset was intentionally kept small (using only five letters) to quickly assess the effectiveness of the method. The dataset was normalized and trained using a simple CNN model. Likewise, the CNN was also kept small to focus on the data collection method rather than on the neural network architecture.

The data collection method in this study is unique in that it does not employ automatic letter segmentation based on stroke features or the central stem of a word, unlike other studies. Rather, it relies on the manual identification of Manchu letters in an image of Manchu script. This is an important distinction because it implies that letters in multiple handwriting styles can be similarly extracted and labeled.

Although this method of Manchu letter extraction and labeling is time-consuming, manual letter selection for each image also eliminates the need to sort and label letters that were segmented using an automatic extraction process. Additionally, the process outlined in this study provides an effective approach for gathering data from various handwriting styles across numerous source texts.

Therefore, we expect to create a full dataset of Manchu script letters with an unlimited number of writing styles using this method. This is significant because no Manchu script letter dataset currently exists that can be used for machine learning. Additionally, this method may prove beneficial for creating datasets of other script-based languages, such as Urdu, Pashto, Bangla, Kannada, and ancient Mongolian, which are often referenced in Manchu OCR studies.

  1. M. Saarela, “The early modern travels of Manchu: A script and its study in east asia and europe,” Philadelphia: University of Pennsylvania Press, 2020. DOI: 10.9783/9780812296938.
    CrossRef
  2. “Manchu ethnologue,” Internet Archive, 2016, [Online], Available: https://web.archive.org/web/20161217235916/https://www.ethnologue.com/18/language/mnc/.
  3. D. Lague, “Manchu language lives mostly in archives,” The New York Times, 17 Mar. 2007, [Online], Available: https://www.nytimes.com/2007/03/17/world/asia/18manchu_side.html.
  4. J. Miyawaki-Okada, “Report on the Manchu documents stored at the Mongolian national central archives of history,” Saksaha: A Journal of Manchu Studies, vol. 4, 1999. DOI: 10.3998/saksaha.13401746.0004.002.
  5. M. C. Elliott, “The Manchu-language archives of the Qing Dynasty and the origins of the palace memorial system,” Late Imperial China, vol. 22, no. 1, pp. 1-70, Jun. 2001. DOI: 10.1353/late.2001.0002.
    CrossRef
  6. G. Y. Zhang, J. J. Li, R. W. He, and A. X. Wang, “An offline recognition method of handwritten primitive Manchu characters based on strokes,” in in Ninth International Workshop on Frontiers in Handwriting Recognition, Kokubunji, Japan, pp. 432-437, 2004. DOI: 10.1109/IWFHR.2004.16.
    CrossRef
  7. G. Y. Zhang, J. J. Li, and A. X. Wang, “A new recognition method for the handwritten Manchu character unit,” in in Proceedings of the Fifth International Conference on Machine Learning and Cybernetics, Dalian, China, pp. 3339-3344, 2006. DOI: 10.1109/ICMLC.2006.258471.
    CrossRef
  8. S. Xu, M. Li, and M. Q. Zhu, “Manchu text extract based on fuzzy clustering,” Information Technology Journal, vol. 12, no. 24, pp. 8323-8327, Dec. 2013. DOI: 10.3923/itj.2013.8323.8327.
    CrossRef
  9. S. Xu, G. Q. Qi, M. Li, R. R. Zheng, and C. John, “An improved Manchu character recognition method,” Journal of Mechanical Engineering Research and Developments, vol. 39, no. 2, pp. 536-543, 2016. DOI: 10.7508/jmerd.2016.02.033.
  10. S. Xu, M. Li, R. R. Zheng, and S. Michael, “Manchu character segmentation and recognition method,” Journal of Discrete Mathematical Sciences and Cryptography, vol. 20, no. 1, pp. 43-53, Dec. 2016. DOI: 10.1080/09720529.2016.1177965.
    CrossRef
  11. D. Huang, M. Li, R. Zheng, S. Xu, and J. Bi, “synthetic data and DAG-SVM classifier for segmentation-free Manchu word recognition,” in in 2017 International Conference on Computing Intelligence and Information System (CIIS), Nanjing, China, pp. 46-50, 2017. DOI: 10.1109/CIIS.2017.15.
    CrossRef
  12. M. Li, R. Zheng, S. Xu, Y. Fu, and D. Huang, “Manchu word recognition based on convolutional neural network with spatial pyramid pooling,” in in 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, pp. 1-6, 2018. 10.1109/CISPBMEI.2018.8633131.
    CrossRef
  13. R. Zheng, M. Li, J. He, J. Bi, and B. Wu, “Segmentation-free multifont printed Manchu word recognition using deep convolutional features and data augmentation,” in in 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, pp. 1-6, 2018. DOI: 10.1109/CISP-BMEI.2018.8633208.
    CrossRef
  14. D. D. Zhang, Y. Liu, Z. W. Wang, and D. P Wang, “OCR with the deep CNN model for ligature script-based languages like Manchu,” Scientific Programming, vol. 2021, pp. 1-9, Jun. 2021. DOI: 10.1155/2021/5520338.
    CrossRef
  15. A. Snowberger and C. H. Lee, “A new segmentation and extraction method for Manchu character units,” in Proceedings for 2022 International Conference on Future Information and Communication Engineering, Jeju, Korea, pp. 42-47, 2022.
  16. S. Lipovt︠ s ︡ ov, Gospel of St. Matthew in Manchu, British and Foreign Bible Society, 1822, [Online], Available: http://orthodox.cn/bible/manchu/.
  17. Z. Jifa, “Yumen tingzheng,” Wenshizhe Press, 2000.
  18. P. G. von Mollendorff and A Manchu Grammar, “A Manchu Grammar: With Analysed Texts,” Windham Press, 2013.
  19. K. Yoshihiro, “Manchu Written Text,” University of Tokyo Press, 1996.
  20. G. R. Li, “Manchu: A Textbook for Reading Documents,” 2nd ed., Honolulu, HI: National Foreign Language Resource Center, 2010.
  21. T. Malisiewicz, A. Gupta, and A. A. Efros, “Ensemble of exemplar-SVMs for object detection and beyond,” In International Conference on Computer Vision, 2011, [Online], Available: https://www.cs.cmu.edu/~tmalisie/projects/iccv11/index.html.. https://doi.org/10.1109/ICCV.2011.6126229.
    CrossRef
  22. A. Rosebrock, “(Faster) Non-maximum suppression in python,” 16 Feb. 2015. [Online], Available: https://pyimagesearch.com/2015/02/16/faster-non-maximum-suppression-python/.
  23. Y. Lecun and C. Cortes, “MNIST handwritten digit database,” ATT Labs, 2010, [Online] Available: http://yann.lecun.com/exdb/mnist/.
  24. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998. DOI: 10.1109/5.726791.
    CrossRef

Aaron Daniel Snowberger

Aaron Daniel Snowberger received his Ph.D. degree in Information and Communication Engineering from Hanbat National University, Korea in 2024. He received his B.S degree in Computer Science from the College of Engineering, University of Wyoming, USA in 2006 and his M.FA degree in Media Design from Full Sail University, USA in 2011. He taught at Jeonju University from 2010 to 2023 but is now lecturing in Computer Science elsewhere. His research interests include computer vision, natural language processing, signal processing, machine and deep learning, and language.


Choong Ho Lee

Choong Ho Lee received the Ph.D. degree in Information Science from Tohoku University, Sendai, Japan in 1998, and the M.S. and B.S. degrees from Yonsei University, Seoul, Korea, in 1985 and 1987, respectively. He had worked for KT from 1987 to 2000 as a researcher. He has been at Hanbat National University, Daejeon, Korea since 2000, and he is presently a Professor of the Department of Information and Communication Engineering. His research interests include digital image processing, computer vision, machine learning, big data analysis, and software education.


Article

Regular paper

Journal of information and communication convergence engineering 2024; 22(1): 80-87

Published online March 31, 2024 https://doi.org/10.56977/jicce.2024.22.1.80

Copyright © Korea Institute of Information and Communication Engineering.

Manchu Script Letters Dataset Creation and Labeling

Aaron Daniel Snowberger and Choong Ho Lee *, Member, KIICE

Department of Information and Communication Engineering, Hanbat National University, Daejeon 34158, Republic of Korea

Correspondence to:Choong Ho Lee (E-mail: chlee@hanbat.ac.kr)
Department of Information and Communication Engineering, Hanbat National University, Daejeon 34158, Republic of Korea

Received: June 7, 2023; Revised: August 25, 2023; Accepted: September 25, 2023

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

The Manchu language holds historical significance, but a complete dataset of Manchu script letters for training optical character recognition machine-learning models is currently unavailable. Therefore, this paper describes the process of creating a robust dataset of extracted Manchu script letters. Rather than performing automatic letter segmentation based on whitespace or the thickness of the central word stem, an image of the Manchu script was manually inspected, and one copy of the desired letter was selected as a region of interest. This selected region of interest was used as a template to match all other occurrences of the same letter within the Manchu script image. Although the dataset in this study contained only 4,000 images of five Manchu script letters, these letters were collected from twenty-eight writing styles. A full dataset of Manchu letters is expected to be obtained through this process. The collected dataset was normalized and trained using a simple convolutional neural network to verify its effectiveness.

Keywords: Character Extraction, Data Collection, Dataset Creation, Manchu Characters, Template Matching

I. INTRODUCTION

The Manchu language served as the lingua franca of the Qing Dynasty for over 200 years, facilitating trade and exchange throughout Asia and Europe. During that time, scholars in Russia, Joseon Korea, Tokugawa Japan, and European missionaries took a keen interest in Manchu for its usefulness in making the Chinese language more accessible [1].

However, today, only about 50 people can still read and write Manchu [2]. This implies a significant risk of losing numerous historical documents. Estimates indicate that nearly 20 percent of the 10 million documents in the First Historical Archives of China in Beijing were written in Manchu. In the provincial archive of Harbin, there may be up to 60 tons of Manchu documents [3]. There are also large archives of Qing Dynasty documents in Shenyang, Taipei, Daliang, Changchun, and Mongolia [4], as well as overseas archives in Japan, Europe, and North America [5].

Three major factors complicate the recovery of these documents. First, Manchu is a dying language. The last generation of native speakers is slowly aging out of the population. Second, few scholars have actively studied Manchu or conducted translation work using Manchu. Third, the vast amount of archival data is overwhelming. One estimate suggests that even if 100 people spent 100 years translating all Manchu archives, they would still be unable to finish [3].

Therefore, optical character recognition and machine learning technologies are becoming increasingly important for preserving and translating historical documents such as those written in Manchu. However, there is currently no available dataset of extracted and labeled Manchu script letters to aid in building an optical character recognition (OCR) model. Therefore, this study outlines a process for simultaneously extracting and labeling Manchu letters from images of Manchu scripts to build and normalize a machine-learning dataset. The dataset gathered and normalized for this study was trained using a simple neural network to confirm the viability of the approach.

II. RELATED WORK

Since the early 2000s, assorted studies have been conducted on Manchu script recognition. These studies typically fall into one of three categories. The first type of study focuses on recognizing Manchu letters based on strokes [6-8]. The second type of study focuses on the segmentation or extraction of Manchu letters from words for recognition [9-10]. The third type of study focuses on the segmentation-free recognition of Manchu words [11-14]. Of these three methods, the segmentation-free method appears to be the most intuitive, given the complexity of stroke and letter classifications. However, the segmentation-free recognition of Manchu words is not free of difficulties.

One difficulty is the need for a significantly larger Manchu words dataset for training, as opposed to a dataset containing only letters or strokes. However, because public datasets for the Manchu are largely unavailable, custom datasets are typically created for research purposes. Huang et al. [11] found that creating a Manchu word dataset often results in an imbalance, lacking sufficient word quantity for effective training, and thus necessitates augmentation.

The second difficulty is the possibility of additional noise or deformations appearing in Manchu word images when they are resized, by squashing or stretching, to match the input shape of the convolutional neural network (CNN). Zheng et al. [12] attempted to solve this problem by introducing a CNN with spatial pyramid pooling (SPP) rather than max-pooling in the final layer. The SPP layer contains three independent pooling layers, each dividing an input feature map into patches and creating a fixed-dimensional vector that can be input into the last fully connected layer of the CNN. This allows the recognition of Manchu words of arbitrary sizes without segmentation.

The third difficulty concerns the algorithmic complexity of segmentation-free Manchu word classifiers. For example, Zhang et al. [14] presented a 10-step algorithm for performing segmentation-free recognition with a deep CNN, where two steps of the algorithm required two separate appendices. The first appendix explains the seventh step, an extraction method for the vector outline of the j-th letter of a Manchu word, in ten additional steps. The second appendix explains the eighth step, which calculates the connection strength between the i-th contrast image and the j-th letter, in six additional steps.

Therefore, despite the increased frequency and potential outcomes of segmentation-free studies in recent years, they remain far from a validated approach for Manchu word recognition. Accordingly, in a previous study [15], we attempted a new type of Manchu script letter segmentation with limited success. In that study, we used Python to segment the images of the Manchu script into lines and words based on the surrounding white space. We then segmented each Manchu word into individual letters by cropping the image horizontally in vertical pixel rows where the image had the lowest number of black pixels. In other words, we cropped the image so that no strokes vertically overlapped with the central word stem. Fig. 1 illustrates this segmentation method. Three cropping locations are displayed in the histogram, where the word is divided into letters. These three locations are located at vertical pixel rows 7, 13, and 25, which crops the word into four letters.

Figure 1. Manchu word segmentation according to the lowest count of black pixels in each horizontal row of the image.

Although the segmentation method described above was effective for separating lines and words of text precisely, two additional problems arose. First, when applied to a collection of varying handwriting styles, the method proved ineffective in segmenting letters. This was particularly true in cases where long letter strokes overlapped vertically into the horizontal space of other letters. Second, even when the segmentation method worked perfectly, the issue of labeling the segmented data persisted.

Therefore, this paper presents a process for the simultaneous extraction and labeling of Manchu script letters to create a machine-learning dataset. Although the method presented here requires the manual identification of Manchu script letters in an image, it is effective for numerous documents and handwriting styles.

To verify the effectiveness of this extraction and labeling method, a small dataset of Manchu script letters was collected, normalized, and trained using a small neural network. After training for 50 epochs, the network classified the test images with an accuracy of 98.75%.

III. SYSTEM MODEL AND METHODS

The proposed process for Manchu script letter extraction and labeling uses OpenCV’s selectROIs and matchTemplate functions. First, the letters to be collected are defined in a label array. Next, selectROIs is used to select a region of interest containing a single Manchu letter. The selected ROI is then used as a template in the matchTemplate function to locate the matching copies in the given image. Because overlapping bounding boxes may exist for a matched template, a non-maximum suppression algorithm is used to minimize repeated matches.

The resulting matching bounding boxes are used to crop and save matching copies of the letters in labeled folders. This pattern is observed for each image of Manchu script in each folder of images collected from scanned books.

A. Data Collection

In this study, five different texts were used to gather 4,000 Manchu letters from 28 writing styles [16-20]. This allowed us to create a dataset with large data variance. Table 1 lists the selected texts, the number of scanned images, and the number of script styles for each text.

Table 1 . Number of images and writing styles collected from each text.

TextImagesWriting Styles
Gospel of St. Matthew791
Yumin tingzheng1386
A Manchu Grammar4094
Manchu Written Text241
A Textbook for Reading Documents15416
TOTAL80428


First, the PDF copies of each text were converted into JPG images. Next, each JPG image was visually enhanced using Photoshop and cropped into portions containing only the Manchu script. The images were then preprocessed in Python with erosion and dilation functions to enhance stroke features. Fig. 2 shows a sample of each type of text before preprocessing. Fig. 3 shows a sample of the preprocessed images. Various differing writing styles are present in Li [20] and are shown in the bottom row of both images.

Figure 2. JPG images before preprocessing. Top row from left to right: Gospel of St. Matthew [16], Yumen tingzheng [17], A Manchu Grammar with Analyzed Texts [18], Manchu Written Text [19]. Bottom row: Manchu, A Textbook for Reading Documents [20].

Figure 3. Preprocessed images. Top row from left to right: Gospel of St. Matthew [16], Yumen tingzheng [17], A Manchu Grammar with Analyzed Texts [18], Manchu Written Text [19]. Bottom row: Manchu, A Textbook for Reading Documents [20].

B. Dataset Creation

1) Letter Selection

A total of 804 Manchu script images were cropped, preprocessed, and organized into folders based on the source text. Next, a Python program was developed to select ROIs and match the templates across all images.

Initially, we define an array of letters to be visually located in the Manchu script images. The letter names are used as folder labels for the cropped and saved letter images. Letter names are also appended at the beginning of the filename for each cropped and saved letter.

For this study, the Manchu letters ‘j,’ ‘u,’ ‘w,’ ‘a,’ and ‘n’ were selected. The word ‘juwan’ means ‘ten’ in Manchu. In some images, the full word ‘juwan’ was present as a unit, making ROI selection simpler. However, in most images, letters were selected separately from different words. Fig. 4 shows the word ‘juwan’ in several styles.

Figure 4. ’Juwan,’ the Manchu word for ‘ten’ in several styles (left), and the division of individual letters making up the word (right).

When running, the Python program opens the first image in each folder in full-size grayscale. The selectROIs function provides a crosshair for letter selection in the image. The first letter, ‘j’ is located and selected. After selection, ‘enter’ or ‘space’ are pressed on the keyboard to advance the selectROIs function to wait for the second letter input, ‘u.’ This process is repeated until all five letters in ‘juwan’ are selected. Fig. 5 illustrates the selection method.

Figure 5. A Manchu script image with a selection of the Manchu letter ‘j’.

These letter selections are stored in a template array that is used later when matching the templates. After all five letters are selected, ‘ESC’ is pressed on the keyboard to end the selectROIs function and advance to the matchTemplates function.

If one of the five letters cannot be located in an image, nothing is selected, and ‘enter’ or ‘space’ is pressed on the keyboard to advance selectROIs to the next selection. If all five letters cannot be located in an image, nothing is selected, and ‘ESC’ is pressed on the keyboard to end the selectROIs function and advance to the matchTemplates function.

2) Template Matching

If the image template array created with selectROIs is not empty, each letter template is processed individually. First, if no template exists for a given letter, the letter is skipped, and the template of the next letter is evaluated. When a template exists for a letter, the program locates the letter’s folder in the project’s root directory and stores the folder path as a variable. If a folder for that letter does not yet exist, the program creates a new folder with that letter’s label and stores the folder path as a variable.

After the saved destination folder path is stored, the original grayscale Manchu-script image and grayscale letter templates are passed to OpenCV’s matchTemplates function. The TM_SQDIFF_NORMED method is used for template matching, which returns an output array of the comparison results.

The matchTemplates function is often used with OpenCV’s minMaxLoc function to determine the best match for the template in the output array. For TM_SQDIFF and TM_SQDIFF_NORMED, the best matches are obtained using the minimum value of the minMaxLoc function. Because we need to determine all the best matches for a given template in the source image, we use the ‘np.where’ function to find all x- and y-coordinates where the normalized correlation coefficient is below a given threshold. In this study, the threshold was set to 0.05.

However, using ‘np.where’ results in multiple overlapping detections for each match. Therefore, we also used a non-maximum suppression function proposed by Malisiewicz et al. [21], which was ported to Python by Rosebrock [22], to filter the best matches from numerous overlapping detections.

3) Non-Maximum Suppression

The non-maximum suppression function computes the area of each matched template’s bounding boxes and sorts them using the bottom-right y-coordinates. It then loops through all matches and calculates the maximum x- and y-coordinates for the start of each bounding box and the minimum x- and y-coordinates for the end of each bounding box. Subsequently, the width and height of each bounding box are computed. It then computes the overlap ratio of the bounding boxes. Finally, it deletes matching bounding boxes from the array that exceed the overlap threshold. This should result in only one (or extremely few) overlapping bounding boxes. In this study, the overlap threshold is set to 0.95. An example of the number of detected bounding boxes before and after non-maximum suppression is shown in Fig. 6.

Figure 6. Example of performing non-maximum suppression on detected bounding boxes with an overlap threshold of 0.95.

4) Cropping and Saving Matches

The remaining array of matched bounding boxes is returned to the Python program. The program then crops and saves each match from the original image to its letter folder using the stored folder path variable. Each cropped and saved image filename includes the letter name, followed by an underscore, and a timestamp when the image was cropped and saved. Appending a timestamp to each image’s filename was more effective than appending sequential numbers because of the possibility of overwriting previously saved files. Fig. 7 shows a selection of twelve cropped ‘j’ templates.

Figure 7. Sample of matched and cropped images for the Manchu letter ‘j.’

5) Processing All Images in a Source Folder

After the first image is processed, the Python program loops sequentially through every remaining image in the given folder, performing the same series of steps listed above. The reason for manual letter selection (Step 1) for every image in the folder is twofold. First, although most source folders contain images with the same handwriting style, multiple handwriting styles were present in the case of Li [20]. Second, even when every image used the same handwriting style, there was sufficient variation between the individual images to require separate processing of each image. When only the first selected image was used for template selection and matching across all images, numerous false positives were detected in later images. Therefore, using only the first image in the folder for template selection and matching is ineffective for the entire folder. Rather, locating one copy of each letter in each image separately and allowing the program to perform template matching for each image sequentially was found to be more effective. Fig. 8 shows a flowchart of the process involved in selecting letters and matching templates for all Manchu script images.

Figure 8. Manchu letter selection and matching Manchu process.

6) Matching Results

After letter selection, most letters were matched two to three times per page, resulting in most letter folders containing over 1,000 matching images. However, the letter ‘a’ folder contained nearly 50,000 matches. This letter was the simplest letter, containing only a single stroke protruding to the left of the central stem. Thus, it was easily overmatched. Likewise, the ‘n’ folder also contained over three times as many matches as the other letter folders.

To correct this overmatching problem in future studies, we suggest two possible solutions. First, because ‘n’ is an extremely common word ending, it might be more effective to combine it with ‘a’ or whatever other letter precedes it. Second, the threshold for the non-maximum suppression algorithm in the previous section should be decreased to 0.5 or even 0.3. Both suggestions would reduce the number of matched images saved by the program. This might also result in a less tedious process of manual folder inspection later.

7) Manual Inspection

After all Manchu script images were processed and the matching templates were saved into folders labeled with the letter name, each folder was manually inspected to remove false positives or incorrect matches. Because the ‘a’ and ‘n’ folders contained multiple times the number of matched images as the other folders, a script was run to systematically relocate every nth image into new folders.

For the ‘a’ folder, every tenth image was relocated to a separate folder. For the ‘n’ folder, every sixth image was relocated. And for the ‘j’ folder, every second image was relocated.

This systematic relocation of letters resulted in five-letter folders with almost 1,000 images each. Additionally, each folder was manually inspected using Windows Explorer with the Medium Icons view and incorrect matches were discarded. Finally, a Python script was used to select 800 images randomly from each folder for the final dataset. Table 2 lists the number of letters per folder resulting from relocating the letters and manually inspecting each folder.

Table 2 . Results of Manchu letter selection and matching.

Letter# of Matches# RelocatedAfter InspectionFinal
j2,2451,655828800
u998998886800
w1,0281,028957800
a47,7261,1601,158800
n6,7171,0731,013800


8) Dataset Normalization

The resulting 4,000-image dataset was normalized like the MNIST digits dataset [23]. Each image was opened in grayscale and binarized with pixel values under 80 set to 0 (black), and all other values set to 1 (white). Subsequently, images were inverted. The Python Image Library’s getbbox function was used to find the bounding box around each letter and crop the image to its bounding box. The images were resized to a length of 28 pixels on the long side, and padding was added to create square images of 28×28 pixels. Fig. 9 shows a sample of both the original dataset and the normalized dataset for the letter ‘j.’

Figure 9. Sample of matched images for the letter ‘j’ (left), and normalized images (right).

9) Verification of Dataset Effectiveness

Finally, a simple neural network was built and trained to verify the effectiveness of this Manchu dataset creation method.

First, the dataset was loaded into an array of (image, label) tuples using the folder name as the label. Every fifth image was placed in the test dataset. The remaining images were used as the training dataset. The test dataset contained 160 copies of each letter (800 images in total), and the training dataset contained 640 copies of each letter (3,200 images in total). Images were converted to PyTorch Tensors, and the labels were integer-encoded. The letter ‘j’ was encoded as 0, ‘u’ as 1, ‘w’ as 2, ‘a’ as 3, and ‘n’ as 4. Finally, data loaders were created for the training and test sets with a batch size of 100, and the shuffle parameter was set to True.

The neural network was built and trained in PyTorch and was based on LeNet-5 [24] with some modifications. The network consisted of two convolutional layers, followed by ReLU activation functions and max-pooling layers. Both convolutional layers used a kernel size of 5, a stride of 1, and a padding of 2. The size of the in_channels for the first convolutional layer was one and the size of the out_channels was 16. The in- and out-channels of the second convolutional layer were 16 and 32, respectively. Both max-pooling layers used a kernel size of two and a stride of two. The output from the second convolutional layer was flattened and sent to the final fully connected layer. The fully connected layer had 1,568 inputs and five outputs for five letters.

The network model is relatively small, with a total of 21,093 parameters. However, this small model was employed to assess the effectiveness of the data collection and labeling approaches. Fig. 10 provides a summary of the model.

Figure 10. Model summary for the LeNet-5-based CNN used to train this dataset.

IV. RESULTS

The neural network was trained for 50 epochs with a batch size of 100. A stochastic gradient descent was used with a cross-entropy loss, a learning rate of 0.01, and a momentum of 0.5. After training, the model was evaluated on the test dataset with an overall accuracy of 98.75%. Fig. 11 shows the comprehensive test results for each letter, as well as the overall accuracy of the model’s predictions.

Figure 11. Comprehensive accuracy measurements for each letter in the test dataset as well as the overall accuracy of the model’s predictions.

After evaluating the classification accuracy of the model, a script was created to randomly display a series of images from the test dataset with both predicted and true labels. The images were output in a 10 × 10 grid with predicted labels on the left and true labels on the right in parentheses. Correct predictions are displayed in green and incorrect predictions are shown in red. A sample of this 10 × 10 grid of predicted and labeled images is shown in Fig. 12.

Figure 12. Randomly selected images from the test dataset with the model’s classification predictions.

V. CONCLUSION

The focus of this study was to describe the process of creating a dataset of Manchu script letters for machine learning. The process involves manually selecting Manchu script letters from script images and then performing template matching to extract and label matching copies of each selected letter. In this study, the size of the created dataset was intentionally kept small (using only five letters) to quickly assess the effectiveness of the method. The dataset was normalized and trained using a simple CNN model. Likewise, the CNN was also kept small to focus on the data collection method rather than on the neural network architecture.

The data collection method in this study is unique in that it does not employ automatic letter segmentation based on stroke features or the central stem of a word, unlike other studies. Rather, it relies on the manual identification of Manchu letters in an image of Manchu script. This is an important distinction because it implies that letters in multiple handwriting styles can be similarly extracted and labeled.

Although this method of Manchu letter extraction and labeling is time-consuming, manual letter selection for each image also eliminates the need to sort and label letters that were segmented using an automatic extraction process. Additionally, the process outlined in this study provides an effective approach for gathering data from various handwriting styles across numerous source texts.

Therefore, we expect to create a full dataset of Manchu script letters with an unlimited number of writing styles using this method. This is significant because no Manchu script letter dataset currently exists that can be used for machine learning. Additionally, this method may prove beneficial for creating datasets of other script-based languages, such as Urdu, Pashto, Bangla, Kannada, and ancient Mongolian, which are often referenced in Manchu OCR studies.

Fig 1.

Figure 1.Manchu word segmentation according to the lowest count of black pixels in each horizontal row of the image.
Journal of Information and Communication Convergence Engineering 2024; 22: 80-87https://doi.org/10.56977/jicce.2024.22.1.80

Fig 2.

Figure 2.JPG images before preprocessing. Top row from left to right: Gospel of St. Matthew [16], Yumen tingzheng [17], A Manchu Grammar with Analyzed Texts [18], Manchu Written Text [19]. Bottom row: Manchu, A Textbook for Reading Documents [20].
Journal of Information and Communication Convergence Engineering 2024; 22: 80-87https://doi.org/10.56977/jicce.2024.22.1.80

Fig 3.

Figure 3.Preprocessed images. Top row from left to right: Gospel of St. Matthew [16], Yumen tingzheng [17], A Manchu Grammar with Analyzed Texts [18], Manchu Written Text [19]. Bottom row: Manchu, A Textbook for Reading Documents [20].
Journal of Information and Communication Convergence Engineering 2024; 22: 80-87https://doi.org/10.56977/jicce.2024.22.1.80

Fig 4.

Figure 4.’Juwan,’ the Manchu word for ‘ten’ in several styles (left), and the division of individual letters making up the word (right).
Journal of Information and Communication Convergence Engineering 2024; 22: 80-87https://doi.org/10.56977/jicce.2024.22.1.80

Fig 5.

Figure 5.A Manchu script image with a selection of the Manchu letter ‘j’.
Journal of Information and Communication Convergence Engineering 2024; 22: 80-87https://doi.org/10.56977/jicce.2024.22.1.80

Fig 6.

Figure 6.Example of performing non-maximum suppression on detected bounding boxes with an overlap threshold of 0.95.
Journal of Information and Communication Convergence Engineering 2024; 22: 80-87https://doi.org/10.56977/jicce.2024.22.1.80

Fig 7.

Figure 7.Sample of matched and cropped images for the Manchu letter ‘j.’
Journal of Information and Communication Convergence Engineering 2024; 22: 80-87https://doi.org/10.56977/jicce.2024.22.1.80

Fig 8.

Figure 8.Manchu letter selection and matching Manchu process.
Journal of Information and Communication Convergence Engineering 2024; 22: 80-87https://doi.org/10.56977/jicce.2024.22.1.80

Fig 9.

Figure 9.Sample of matched images for the letter ‘j’ (left), and normalized images (right).
Journal of Information and Communication Convergence Engineering 2024; 22: 80-87https://doi.org/10.56977/jicce.2024.22.1.80

Fig 10.

Figure 10.Model summary for the LeNet-5-based CNN used to train this dataset.
Journal of Information and Communication Convergence Engineering 2024; 22: 80-87https://doi.org/10.56977/jicce.2024.22.1.80

Fig 11.

Figure 11.Comprehensive accuracy measurements for each letter in the test dataset as well as the overall accuracy of the model’s predictions.
Journal of Information and Communication Convergence Engineering 2024; 22: 80-87https://doi.org/10.56977/jicce.2024.22.1.80

Fig 12.

Figure 12.Randomly selected images from the test dataset with the model’s classification predictions.
Journal of Information and Communication Convergence Engineering 2024; 22: 80-87https://doi.org/10.56977/jicce.2024.22.1.80

Table 1 . Number of images and writing styles collected from each text.

TextImagesWriting Styles
Gospel of St. Matthew791
Yumin tingzheng1386
A Manchu Grammar4094
Manchu Written Text241
A Textbook for Reading Documents15416
TOTAL80428

Table 2 . Results of Manchu letter selection and matching.

Letter# of Matches# RelocatedAfter InspectionFinal
j2,2451,655828800
u998998886800
w1,0281,028957800
a47,7261,1601,158800
n6,7171,0731,013800

References

  1. M. Saarela, “The early modern travels of Manchu: A script and its study in east asia and europe,” Philadelphia: University of Pennsylvania Press, 2020. DOI: 10.9783/9780812296938.
    CrossRef
  2. “Manchu ethnologue,” Internet Archive, 2016, [Online], Available: https://web.archive.org/web/20161217235916/https://www.ethnologue.com/18/language/mnc/.
  3. D. Lague, “Manchu language lives mostly in archives,” The New York Times, 17 Mar. 2007, [Online], Available: https://www.nytimes.com/2007/03/17/world/asia/18manchu_side.html.
  4. J. Miyawaki-Okada, “Report on the Manchu documents stored at the Mongolian national central archives of history,” Saksaha: A Journal of Manchu Studies, vol. 4, 1999. DOI: 10.3998/saksaha.13401746.0004.002.
  5. M. C. Elliott, “The Manchu-language archives of the Qing Dynasty and the origins of the palace memorial system,” Late Imperial China, vol. 22, no. 1, pp. 1-70, Jun. 2001. DOI: 10.1353/late.2001.0002.
    CrossRef
  6. G. Y. Zhang, J. J. Li, R. W. He, and A. X. Wang, “An offline recognition method of handwritten primitive Manchu characters based on strokes,” in in Ninth International Workshop on Frontiers in Handwriting Recognition, Kokubunji, Japan, pp. 432-437, 2004. DOI: 10.1109/IWFHR.2004.16.
    CrossRef
  7. G. Y. Zhang, J. J. Li, and A. X. Wang, “A new recognition method for the handwritten Manchu character unit,” in in Proceedings of the Fifth International Conference on Machine Learning and Cybernetics, Dalian, China, pp. 3339-3344, 2006. DOI: 10.1109/ICMLC.2006.258471.
    CrossRef
  8. S. Xu, M. Li, and M. Q. Zhu, “Manchu text extract based on fuzzy clustering,” Information Technology Journal, vol. 12, no. 24, pp. 8323-8327, Dec. 2013. DOI: 10.3923/itj.2013.8323.8327.
    CrossRef
  9. S. Xu, G. Q. Qi, M. Li, R. R. Zheng, and C. John, “An improved Manchu character recognition method,” Journal of Mechanical Engineering Research and Developments, vol. 39, no. 2, pp. 536-543, 2016. DOI: 10.7508/jmerd.2016.02.033.
  10. S. Xu, M. Li, R. R. Zheng, and S. Michael, “Manchu character segmentation and recognition method,” Journal of Discrete Mathematical Sciences and Cryptography, vol. 20, no. 1, pp. 43-53, Dec. 2016. DOI: 10.1080/09720529.2016.1177965.
    CrossRef
  11. D. Huang, M. Li, R. Zheng, S. Xu, and J. Bi, “synthetic data and DAG-SVM classifier for segmentation-free Manchu word recognition,” in in 2017 International Conference on Computing Intelligence and Information System (CIIS), Nanjing, China, pp. 46-50, 2017. DOI: 10.1109/CIIS.2017.15.
    CrossRef
  12. M. Li, R. Zheng, S. Xu, Y. Fu, and D. Huang, “Manchu word recognition based on convolutional neural network with spatial pyramid pooling,” in in 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, pp. 1-6, 2018. 10.1109/CISPBMEI.2018.8633131.
    CrossRef
  13. R. Zheng, M. Li, J. He, J. Bi, and B. Wu, “Segmentation-free multifont printed Manchu word recognition using deep convolutional features and data augmentation,” in in 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, pp. 1-6, 2018. DOI: 10.1109/CISP-BMEI.2018.8633208.
    CrossRef
  14. D. D. Zhang, Y. Liu, Z. W. Wang, and D. P Wang, “OCR with the deep CNN model for ligature script-based languages like Manchu,” Scientific Programming, vol. 2021, pp. 1-9, Jun. 2021. DOI: 10.1155/2021/5520338.
    CrossRef
  15. A. Snowberger and C. H. Lee, “A new segmentation and extraction method for Manchu character units,” in Proceedings for 2022 International Conference on Future Information and Communication Engineering, Jeju, Korea, pp. 42-47, 2022.
  16. S. Lipovt︠ s ︡ ov, Gospel of St. Matthew in Manchu, British and Foreign Bible Society, 1822, [Online], Available: http://orthodox.cn/bible/manchu/.
  17. Z. Jifa, “Yumen tingzheng,” Wenshizhe Press, 2000.
  18. P. G. von Mollendorff and A Manchu Grammar, “A Manchu Grammar: With Analysed Texts,” Windham Press, 2013.
  19. K. Yoshihiro, “Manchu Written Text,” University of Tokyo Press, 1996.
  20. G. R. Li, “Manchu: A Textbook for Reading Documents,” 2nd ed., Honolulu, HI: National Foreign Language Resource Center, 2010.
  21. T. Malisiewicz, A. Gupta, and A. A. Efros, “Ensemble of exemplar-SVMs for object detection and beyond,” In International Conference on Computer Vision, 2011, [Online], Available: https://www.cs.cmu.edu/~tmalisie/projects/iccv11/index.html.. https://doi.org/10.1109/ICCV.2011.6126229.
    CrossRef
  22. A. Rosebrock, “(Faster) Non-maximum suppression in python,” 16 Feb. 2015. [Online], Available: https://pyimagesearch.com/2015/02/16/faster-non-maximum-suppression-python/.
  23. Y. Lecun and C. Cortes, “MNIST handwritten digit database,” ATT Labs, 2010, [Online] Available: http://yann.lecun.com/exdb/mnist/.
  24. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998. DOI: 10.1109/5.726791.
    CrossRef
JICCE
Dec 31, 2024 Vol.22 No.4, pp. 267~343

Stats or Metrics

Share this article on

  • line

Journal of Information and Communication Convergence Engineering Jouranl of information and
communication convergence engineering
(J. Inf. Commun. Converg. Eng.)

eISSN 2234-8883
pISSN 2234-8255