Journal of information and communication convergence engineering 2024; 22(4): 288-295
Published online December 31, 2024
https://doi.org/10.56977/jicce.2024.22.4.288
© Korea Institute of Information and Communication Engineering
Correspondence to : Khang Nhut Lam (E-mail: lnkhang@ctu.edu.vn)
Department of Information Technology, Can Tho University, Can Tho 94100, Vietnam
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Recipe generation is an important task in both research and real life. In this study, we explore several pretrained language models that generate recipes from a list of text-based ingredients. Our recipe-generation models use a standard self-attention mechanism in Transformer and integrate a re-attention mechanism in Vision Transformer. The models were trained using a common paradigm based on cross-entropy loss and the BRIO paradigm combining contrastive and cross-entropy losses to achieve the best performance faster and eliminate exposure bias. Specifically, we utilize a generation model to produce N recipe candidates from ingredients. These initial candidates are used to train a BRIO-based recipe-generation model to produce N new candidates, which are used for iteratively fine-tuning the model to enhance the recipe quality. We experimentally evaluated our models using the RecipeNLG and CookingVN-recipe datasets in English and Vietnamese, respectively. Our best model, which leverages BART with re-attention and is trained using BRIO, outperforms the existing models.
Keywords Attention mechanism, BART, Recipe-generation model, Transformer
Individuals who cook their own meals often encounter the difficult decision of what dish to make using the ingredients in their kitchens or refrigerators. Some people also find it tedious to cook the same dishes repeatedly. Companies have released gadgets to help people cook, manage food, and calculate calories. Researchers have introduced models for generating cooking recipes from text-based ingredients or food images.
This study explored models that propose written recipes from a textual description of available ingredients, a natural language generation (NLG) challenge. In other words, we developed a system to create cooking recipes from input ingredients. The candidate NLG models considered were from the GPT family, BART [1], and T5 [2]. The GPT models are autoregressive and use only Transformer decoder, whereas the BART and T5 models are sequence-to-sequence and follow the Transformer architecture [3]. For tasks that use a language model as the backbone, the autoregressive models are the “preferable choice” but “require the model to look back, analyze multiple pieces of content, or engage in extensive re-reading” [4]. Hence, sequence-to-sequence models are “better choices” [5]. Safwat [6] claims that BART “demonstrated superior scores over T5 on the NLG evaluation” problem. In addition, the BRIO training paradigm [7] based on non-deterministic distribution helps the generation model achieve its best performance faster and eliminates exposure bias during training and inference, specifically in text abstractive summarization.
In this study, we fine-tuned and trained cooking recipe generation models using a non-deterministic distribution to improve recipe quality. This study makes the following contributions. We expand the CookingVN-recipe dataset for recipe generation in Vietnamese. We discuss the integration of a re-attention mechanism, which was originally used for image processing, into recipe generation. In addition, we present an adaptation of the BRIO paradigm to train these models to achieve optimal performance. Our proposed recipe-generation models with re-attention and training using BRIO outperformed existing models.
The inputs to a recipe-generation system are food images and/or text-based ingredients. Hence, we classified recipe generation approaches into two classes based on the input type: food image- and ingredient text-based.
Many cooking recipe generation systems use food images as inputs, and produce recipes as outputs. These systems commonly consist of a model for recognizing ingredients in images and a model for generating recipes based on the ingredients. This category can be further divided into three subcategories based on the methodology used.
Retrieval-based methods determine the most relevant recipes from food images in the database. These approaches involve an image recognition model to identify ingredients from images and a retrieval model to search for the bestmatch recipes. For example, Lim et al. [8] used CNNs to recognize ingredients from images, which were then used to retrieve matching recipes from a dataset using Elasticsearch1). Similarly, Morol et al. [9] employed CNN-based models to recognize ingredients in images and used a linear search to find the best-match recipes in a database.
Generation-based methods create recipes using generation models with ingredients identified in the images. Kumar et al. [10] used ResNet to extract features from food images and a feed-forward neural model to recognize ingredients. These extracted features and ingredients were concatenated and fed into Transformer to generate recipes.
Hybrid methods use both food images and the corresponding original recipes from a database to generate new recipes. Wang et al. [11] proposed a food cross-modal retrieval module for creating new recipes from original recipes and recipe trees. In particular, the authors used graph attention networks to build tree embeddings from food images and inferred recipe trees using RNN. FIRE [12] uses BLIP [13] to create titles, Vision Transformer (ViT) [14] to extract ingredients, and T5 to generate recipes.
These models typically employ deep-learning-based language models to create new recipes from input ingredients. Chef Transformer2) based on T5 [2] is an example of such a model. The GPT-2 model [15] has been used to develop several cooking recipe generators, including RecipeGPT [16], RecipeNLG [17], RecipeMC [18], and Ratatouille [19]. Fujita et al. [20] increased the system's ability to reflect input ingredients in output recipes using an encoder-decoder model with reinforcement learning. Reusch et al. [21] constructed RecipeGM using LSTM to encode ingredients and decode recipes and enhanced this model for better recipe generation with multi-head self-attention. Lam et al. [22] proposed a ViT5-based model [23] for generating Vietnamese cooking recipes.
Encoder-decoder models often fail to capture ingredients and create recipes with similar phrases due to their vanilla attention mechanism, which focuses on the same ingredients in each decoder output [20]. To solve this issue, we used a re-attention mechanism [24] that has been applied mostly in computer vision. Liu et al. [7] found that the inference of encoder-decoder models can be ineffective because these models use a deterministic target distribution that assigns all probability masses to the reference text. In this study, we employed a non-deterministic distribution in the training step to improve recipe generation. Fig. 1 illustrates the architecture of the proposed cooking recipe generator. Initially, a Transformer-based recipe-generation model was used to generate N candidates from the input ingredients. Each candidate consists of a food title, ingredients, and cooking recipe. These initial candidates are then used to train the BRIO-based recipe-generation model to produce N new candidates, which are iteratively used to further fine-tune the BRIO-based recipe generator to improve the quality of the cooking recipes. The remainder of this section describes the cooking recipe generation model.
Direct Transformer- [3] and BART-based [1] models are used as backbones to produce recipes from ingredients. Each layer in both the Encoder and Decoder of the standard Transformer has sub-layers of self-attention and a fully connected network. The self-attention in Transformer is computed as follows:
where Q, K, and V are the matrices of the query, key, and value, respectively;
Given the input text X={x1, ..., xn} and the reference text Y*={y1*, ..., yl*}, the generation model g, which is Transformer or BART in our study, is trained using the crossentropy loss Lxent between the decoder’s output and reference text:
where θ is the set of parameters of g and
The ingredients are segmented into words, vectorized, and added to the positional embeddings. These vectors are fed into the encoder of the backbone models to perform encoding. Reference recipes are processed, segmented into words, vectorized, added to positional embeddings, and fed into the Decoder. The output of the last layer of the Encoder is fed into all decoders at the encoder-decoder attention layers. The output of the final layer of the Decoder is passed through the linear and softmax layers to predict the generated tokens.
Zhou et al. [24] observed that the self-attention mechanism does not “learn effective concepts for representation learning.” They proposed a re-attention mechanism and achieved impressive results using ViT [14]. The authors replaced the self-attention layer within the Transformer block with the reattention layer, which is obtained using the following equation:
where ω is a learnable transformation matrix, which is multiplied to the self-attention map along the head dimension. Both ViT and our recipe-generation models adhere to the standard Transformer architecture, with the exception that the Vision Transformer model is used on images, and the recipe-generation models are used for text. We replaced the self-attention layer in both Encoder and Decoder of the backbone models with a re-attention layer to increase the diversity of the attention matrices in different layers and to avoid the attention collapse issue.
The BRIO technique [7] is used to train a generation model with a non-deterministic distribution. In the recipe-generation task, the backbone model acts as an autoregressive generation model g to create recipe candidates from input ingredients and as an evaluation model to calculate the quality scores of candidates. The model is trained using contrastive loss Lctr with a weight of γ and cross-entropy loss Lxent:
Lctr is obtained using the following equation:
where Yi and Yj are recipe candidates, λij is the margin, ROUGE(Yi,Y*)>>ROUGE(Yj,Y*), and f(Yi) is an estimated probability.
We performed experiments using Google Colab with a Tesla T4 GPU and 16 GB of RAM to create cooking recipes in both English and Vietnamese. Due to hardware limitations, we used 500,000 recipes randomly extracted from the RecipeNLG dataset [17] in English. For the Vietnamese dataset, we expanded the CookyVN-recipe dataset [22] using Selenium3) for crawling data from the Internet and employed Underthesea4) for cleaning and preprocessing data. We renamed this dataset to make it more familiar to users. CookingVN currently contains 77,024 recipes.
The datasets were organized into six columns: titles, servings, ingredients, quantities, instructions (or recipes), and links. The titles, ingredients, and recipes were used to train the models. Each dataset was divided into three parts: 75% for training, 8% for validation, and 17% for testing. Transformer [3] and BART [1] were utilized as backbones for generating English recipes, whereas BARTpho [25] was used as a backbone for generating Vietnamese recipes.
We conducted experiments to generate recipes using standard Transformers with the following configuration and parameters: 6 encoder layers, 6 decoder layers, 8 multi-head attention, an epoch number of 5, a batch size of 4, warm-up steps of 4,000, a learning rate of 0.01, and an Adam optimizer. The maximum lengths of the ingredients and cooking recipes were 120 and 1,024, respectively. Both Greedy and Beam searches were used for recipe generation. We applied a variety of beam sizes (3, 4, and 6) with the length normalization of α = 0.6. Table 1 lists the evaluation results of recipe generators using Transformers.
Table 1 . Results of recipe generators using Transformer
Search Algorithm | Size | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|---|
Greedy | - | 49.76 | 23.78 | 40.06 |
Beam | 3 | 49.98 | 24.23 | 40.43 |
4 | 49.92 | 24.20 | 40.41 | |
6 | 49.86 | 24.24 | 40.43 |
Because the Beam search performed better than the Greedy search, it was used for further experiments. In addition, because of hardware limitations, we experimented with a Beam size of four. Next, we replaced the self-attention layers with re-attention layers in the Encoder, Decoder, and both. The results of the recipe-generation model are presented in Table 2.
Table 2 . Results of recipe generators using Transformer and re-attention
Transformer models | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
Encoder w/ re-attention | 49.21 | 23.96 | 40.15 |
Decoder w/ re-attention | 49.20 | 23.76 | 39.88 |
Both w/ re-attention | 48.98 | 23.58 | 39.75 |
Similar to the experiments with Transformers, the BARTbased models were first used to create cooking recipes based on ingredients using the following parameters: 6 encoder layers and 6 decoder layers, 12 multi-head attention subclasses, an epoch number of 5, a learning rate of 0.0001, warm-up steps of 20,000, a batch size of 4, and an Adam optimizer.
Next, we modified these backbones by replacing the selfattention layers with re-attention layers. Instead of directly using the attention matrix for self-attention, re-attention uses a transformation matrix to mix the attention matrices from different heads in the same self-attention layer. The transformation matrix is a parameter that is learned during training.
The backbone generation models were fine-tuned to generate recipe candidates, which were subsequently used to improve the quality of the recipe-generation models in the next iteration. In particular, the BART model was fine-tuned to create N = 6 recipe candidates (Beam size of 6) to ensure the diversity of the candidates. The candidates were sorted in descending order based on the average F1-scores for ROUGE-1, ROUGE-2, and ROUGE-L. These recipe candidates were iteratively used to enhance the recipe-generation model. Experiments conducted by Liu et al. [7] demonstrated that their abstractive summarization model achieved the best performance after two fine-tuning steps. Therefore, the models were fine-tuned twice.
The backbone model was fine-tuned using MLE and label smoothing [26]. Parameters for label smoothing are 0.1 for the RecipeNLG dataset and 0.01 for the CookingVN-recipe dataset. The output of the model was used to obtain the cross-entropy loss Lxent and contrastive loss Lctr with a margin λ of 0.01. The final loss L of the model is the sum of the Lxent and Lctr multiplied by their corresponding weights, which are 0.1 and 10 in our experiments, respectively. The final loss L was then passed back to the model. After passing through a sufficient number of data samples (the accumulation gradients steps of 8), the model adjusted the learning rate metric, which was saved in the update step variable. When the value of the update step variable equals the evaluation interval (=1,000), the entire model is evaluated as follows:
• Scoring role: The model was evaluated based on its ability to score candidates created using the pretrained recipe-generation model and select the final output.
• Generation role: The model was evaluated based on its ability to generate cooking recipes. The model was trained using two loss functions such that Lxent helped enhance token generation to create complete recipes, and Lctr helped select the best recipes.
If the results improved, the model replaced the previously saved model; otherwise, it was ignored.
Table 3 displays the results of the proposed models. In addition, experiments show that the re-attention mechanism slightly improves the performance of recipe-generation models. In particular, when using BART and BARTpho as backbones, the ROUGE-L scores of the BRIO-based models with re-attention increase 3.61 and 3.38%, respectively, compared to the BRIO-based models without re-attention. Cooking recipe generation is a text generation task, the models of which are typically trained using maximum likelihood estimation or deterministic distribution. The BRIO training technique, which achieved state-of-the-art results in abstractive summarization, also enhances recipe-generation performance. Furthermore, the BART-based recipe-generation models with a reattention mechanism trained using BRIO perform the best. Lam et al. [27] reported decreased performance in abstractive summarization using ViT5 trained using BRIO. However, ViT5 trained using a common technique outperforms our BART-based models. Therefore, we explore ViT5-based models for cooking recipe generation in future studies.
Table 3 . Results of the recipe generators proposed
Model | Dataset used | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|---|
Coverage loss and reinforcement learning [20] | Food.com | 31.00 | 7.00 | 32.40 |
RecipeGPT [16] | Recipe1M+ | - | - | 37.00 |
RecipeNLG [17] | RecipeNLG | - | 13.54 | - |
RecipeMC [18] | 50.50 | 24.20 | - | |
Chef Transformer | - | 24.70 | - | |
BART | 50.24 | 26.22 | 39.81 | |
BART w/ re-attention | 51.27 | 26.27 | 40.12 | |
BART-BRIO | 54.86 | 27.43 | 43.80 | |
BART w/re-attention and BRIO | 56.25 | 26.36 | 45.38 | |
ViT5 [22] CookyVN-recipe | 64.45 | 35.92 | 38.21 | |
BARTpho | CookingVN-recipe | 56.49 | 29.72 | 34.09 |
BARTpho w/ re-attention | 56.91 | 29.34 | 35.72 | |
BARTpho - BRIO | 58.41 | 27.88 | 36.38 | |
BARTpho w/ re-attention and BRIO | 59.75 | 28.12 | 37.61 |
Table 4 shows titles, ingredients, and recipes created by the different models using the following input ingredients: cocoa powder, icing sugar, coconut, butter, wheat biscuits, and eggs. Recipes generated using BART tend to be short and simple, whereas those created using BART with reattention include more detailed recipes. The BRIO-based recipe-generation model produced the longest and most detailed cooking steps. However, these models may omit certain ingredients when generating recipes. In particular, “wheat biscuits” and “egg white” are not used in the recipes for “Biscuit Balls” and “Cocoa biscuits,” respectively.
Table 4 . Cooking recipes created from the input ingredients including cocoa powder, icing sugar, coconut, butter, wheat biscuits, and egg. Each input ingredient is colored for the presentation purpose only
Models | Titles | Ingredients | Recipes |
---|---|---|---|
BART | Biscuit Balls | 1/2 cup cocoa powder, 1/4 cup icing sugar, 2 tablespoons coconut, 3 tablespoons butter melted, 4 wheat biscuits crushed, egg beaten | Mix cocoa, icing sugar, coconut and butter. Roll into balls. Roll in beaten egg and place on a greased baking tray. Bake at 350°F for 10–12 minutes or until golden brown. Serve warm. |
BART-BRIO | Cocoa Biscuits | 1/2 cup cocoa powder, icing sugar, 1/4 cup coconut, 4 wheat biscuits, egg white | Preheat oven to 350°F. In a small bowl, mix together the cocoa powder, icing sugar and coconut. Set aside. Cut each wheat biscuit in half lengthwise. Scoop out the insides of the biscuits, leaving a 1/2-inch-thick shell. Fill the shells with the cocoa mixture. Place the shells on a baking sheet and bake for 15-20 minutes or until golden brown. Remove from the oven and allow to cool on the baking sheet for 5 minutes before removing to a wire rack to cool completely - |
BART w/re-attention | Chocolate Coconut Crumble | 1/2 cup cocoa powder, 1/4 cup icing sugar, 2 tablespoons coconut, 3 tablespoons butter, 4 wheat biscuits, egg | Preheat oven to 350°F. Grease a baking tray. In a small bowl, mix cocoa powder, icing sugar, coconut and butter. Set aside. In a large bowl, combine biscuits and egg. Stir in cocoa mixture. Transfer to prepared tray. Bake for 20–25 minutes or until golden brown. Serve warm |
BART w/re-attention and BRIO | Coconut Wheat Biscuits | 1/2 cup cocoa powder, 1/4 cup icing sugar, 2 tablespoons butter, 4 wheat biscuits, egg white | Preheat oven to 350°F. In a small bowl, combine cocoa powder, icing sugar, butter and egg white; mix well. Dip biscuits in cocoa mixture, then roll in coconut mixture. Place on a baking sheet lined with parchment paper. Bake for 10-12 minutes or until golden brown |
Interestingly, BRIO-based models with re-attention can use new ingredients that are not in the input ingredients to make the recipes complete and seem to balance both the details and nature of the cooking recipes created. For example, given the input ingredients {chicken breast, salad dressing, breadcrumbs, mozzarella cheese, parmesan}, the new ingredient “spaghetti” is introduced to cook “Chicken Parmesan.” The ingredients included 1 lb boneless skinless chicken, breast, 1/2 cup salad dressing, 2 cups breadcrumbs, shredded mozzarella cheese, parmesan parmigiano-reggiano, grated cheese, and 3/4 cup cooked spaghetti broken into bitesized pieces. The recipe is “Preheat oven to 350 degrees. Spray a 9×13 inch baking dish with nonstick cooking spray. Place chicken in baking dish. Pour salad dressing over chicken. Sprinkle with bread crumbs and cheese. Bake for 30 minutes or until chicken is no longer pink in the center and crumbs are lightly browned. Serve with spaghetti”. We observed that some input ingredients may not be present in the generated ingredients, but may appear in the cooking recipes created using BRIO-based models with re-attention.
End-to-end cooking recipe generation systems that use generation models encounter several challenges. As mentioned previously, these models may introduce new ingredients or omit some input ingredients. Second, the generated recipes may include irrelevant cooking steps. For example, similar to the findings of Lam et al. [22], our models may occasionally produce recipes that simply place cleaned fish or beef into a pot without mentioning the necessary previous steps, such as cutting or seasoning. Moreover, a certain cooking step may use products from previous steps. To address this issue, cooking steps can be converted into an action sequence planning problem, as suggested by Pareek et al. [28]. Third, people usually decide on cooking methods, such as boiling, frying, or grilling before looking for detailed cooking instructions. A limitation of the proposed approach is that these models do not pay much attention to generating food titles that match the cooking instructions. Therefore, creating food names that reflect cooking methods must be carefully considered. Finally, people tend to prefer specific cooking methods and have personal tastes (e.g., low sugar, low salt, or avoiding oil). To generate recipes that meet user demands, recipe-generation systems should be able to analyze user food tastes, such as those based on recipe ratings [29] or additional user requirements [30].
In future work, we will develop a multimodal recipe-generation model to generate recipes that not only meet user demands, but also provide logical cooking steps. Furthermore, we will employ an ingredient understanding model to identify the main ingredients and understand user requirements regarding ingredients to use a food name generation model to produce food titles, which include information about a specific cooking method in an obvious or creative manner; and a taste prediction model to predict the user-preferred food flavor. The outputs of these models will be used to enhance the recipe-generation model.
We present an overview of cooking recipe generation models and develop models that use Transformer and BART as backbones. We expanded the CookingVN-recipe dataset to include 77,024 recipes. The highlighted contributions of our work include integrating the re-attention mechanism into cooking recipe generation models to enhance recipe quality and employing the BRIO paradigm to train recipe-generation models for optimal performance. The BRIO-based recipegeneration models performed better than the existing models. In addition, we analyzed and addressed the drawbacks of generation-based models for cooking recipes, including the proposed approaches. Our study not only enhances cooking recipe generation models based on text ingredients, but also has the potential to integrate recipe-generation models based on food images. This incorporation can be employed in the recipe-generation phase after the ingredients are predicted from the images.
Khang Nhut Lam
earned her Master’s degree in Information Technology from Ewha Womans University, Seoul, Korea, in 2009, and her Ph.D. degree in Computer Science from the University of Colorado, Colorado Springs, USA, in 2015. Since 2015, she has been a lecturer in the Department of Information Technology at Can Tho University. Her research interests include Natural Language Processing, Question-Answering systems, Image Captioning, and Deep Learning.
My-Khanh Thi Nguyen
received a Bachelor’s degree in Information Technology from Can Tho University, Vietnam, in 2024. Her research interests include Image Captioning, Text Generation, and Deep Learning.
Huu Trong Nguyen
is a senior student in Information Technology at Can Tho University, Vietnam. His research interests focus on Text Generation and Deep Learning.
Vi Trieu Huynh
received a Bachelor’s degree in Information Technology in 2020, and a Master’s degree in Computer Science in 2024, both from Can Tho University, Vietnam. He is an IT developer at FPT University, Vietnam. His research interests include Natural Language Processing, Text Generation, and Deep Learning.
Van Lam Le
is a senior lecturer in the Department of Information Technology, Can Tho University, Vietnam. He received a Master’s degree in Information Technology from the University of Newcastle, Australia, and a Ph.D. degree in Computer Science from Victoria University of Wellington, New Zealand. His research is focused on Network Security, IoT, Digital Transformation, and Machine Learning.
Jugal Kalita
earned his Bachelor of Technology degree from the Indian Institute of Technology in Kharagpur, India. He received his Master of Science degree from the University of Saskatchewan, Canada, and a Master of Science and a Ph.D. from the University of Pennsylvania. He teaches a variety of classes, having taught almost 20 different classes during his career at UCCS. His research interests are in Natural Language Processing, Computational Linguistics, and Machine Learning including Deep Learning.
Journal of information and communication convergence engineering 2024; 22(4): 288-295
Published online December 31, 2024 https://doi.org/10.56977/jicce.2024.22.4.288
Copyright © Korea Institute of Information and Communication Engineering.
Khang Nhut Lam 1*, My-Khanh Thi Nguyen
1, Huu Trong Nguyen
1, Vi Trieu Huynh
2, Van Lam Le
1, and Jugal Kalita3
1Department of Information Technology, Can Tho University, Can Tho 94100, Vietnam
2Research and Application Development Department, FPT University, Can Tho 94100, Vietnam
3Department of Computer Science, University of Colorado, Colorado Springs, CO 80918, USA
Correspondence to:Khang Nhut Lam (E-mail: lnkhang@ctu.edu.vn)
Department of Information Technology, Can Tho University, Can Tho 94100, Vietnam
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Recipe generation is an important task in both research and real life. In this study, we explore several pretrained language models that generate recipes from a list of text-based ingredients. Our recipe-generation models use a standard self-attention mechanism in Transformer and integrate a re-attention mechanism in Vision Transformer. The models were trained using a common paradigm based on cross-entropy loss and the BRIO paradigm combining contrastive and cross-entropy losses to achieve the best performance faster and eliminate exposure bias. Specifically, we utilize a generation model to produce N recipe candidates from ingredients. These initial candidates are used to train a BRIO-based recipe-generation model to produce N new candidates, which are used for iteratively fine-tuning the model to enhance the recipe quality. We experimentally evaluated our models using the RecipeNLG and CookingVN-recipe datasets in English and Vietnamese, respectively. Our best model, which leverages BART with re-attention and is trained using BRIO, outperforms the existing models.
Keywords: Attention mechanism, BART, Recipe-generation model, Transformer
Individuals who cook their own meals often encounter the difficult decision of what dish to make using the ingredients in their kitchens or refrigerators. Some people also find it tedious to cook the same dishes repeatedly. Companies have released gadgets to help people cook, manage food, and calculate calories. Researchers have introduced models for generating cooking recipes from text-based ingredients or food images.
This study explored models that propose written recipes from a textual description of available ingredients, a natural language generation (NLG) challenge. In other words, we developed a system to create cooking recipes from input ingredients. The candidate NLG models considered were from the GPT family, BART [1], and T5 [2]. The GPT models are autoregressive and use only Transformer decoder, whereas the BART and T5 models are sequence-to-sequence and follow the Transformer architecture [3]. For tasks that use a language model as the backbone, the autoregressive models are the “preferable choice” but “require the model to look back, analyze multiple pieces of content, or engage in extensive re-reading” [4]. Hence, sequence-to-sequence models are “better choices” [5]. Safwat [6] claims that BART “demonstrated superior scores over T5 on the NLG evaluation” problem. In addition, the BRIO training paradigm [7] based on non-deterministic distribution helps the generation model achieve its best performance faster and eliminates exposure bias during training and inference, specifically in text abstractive summarization.
In this study, we fine-tuned and trained cooking recipe generation models using a non-deterministic distribution to improve recipe quality. This study makes the following contributions. We expand the CookingVN-recipe dataset for recipe generation in Vietnamese. We discuss the integration of a re-attention mechanism, which was originally used for image processing, into recipe generation. In addition, we present an adaptation of the BRIO paradigm to train these models to achieve optimal performance. Our proposed recipe-generation models with re-attention and training using BRIO outperformed existing models.
The inputs to a recipe-generation system are food images and/or text-based ingredients. Hence, we classified recipe generation approaches into two classes based on the input type: food image- and ingredient text-based.
Many cooking recipe generation systems use food images as inputs, and produce recipes as outputs. These systems commonly consist of a model for recognizing ingredients in images and a model for generating recipes based on the ingredients. This category can be further divided into three subcategories based on the methodology used.
Retrieval-based methods determine the most relevant recipes from food images in the database. These approaches involve an image recognition model to identify ingredients from images and a retrieval model to search for the bestmatch recipes. For example, Lim et al. [8] used CNNs to recognize ingredients from images, which were then used to retrieve matching recipes from a dataset using Elasticsearch1). Similarly, Morol et al. [9] employed CNN-based models to recognize ingredients in images and used a linear search to find the best-match recipes in a database.
Generation-based methods create recipes using generation models with ingredients identified in the images. Kumar et al. [10] used ResNet to extract features from food images and a feed-forward neural model to recognize ingredients. These extracted features and ingredients were concatenated and fed into Transformer to generate recipes.
Hybrid methods use both food images and the corresponding original recipes from a database to generate new recipes. Wang et al. [11] proposed a food cross-modal retrieval module for creating new recipes from original recipes and recipe trees. In particular, the authors used graph attention networks to build tree embeddings from food images and inferred recipe trees using RNN. FIRE [12] uses BLIP [13] to create titles, Vision Transformer (ViT) [14] to extract ingredients, and T5 to generate recipes.
These models typically employ deep-learning-based language models to create new recipes from input ingredients. Chef Transformer2) based on T5 [2] is an example of such a model. The GPT-2 model [15] has been used to develop several cooking recipe generators, including RecipeGPT [16], RecipeNLG [17], RecipeMC [18], and Ratatouille [19]. Fujita et al. [20] increased the system's ability to reflect input ingredients in output recipes using an encoder-decoder model with reinforcement learning. Reusch et al. [21] constructed RecipeGM using LSTM to encode ingredients and decode recipes and enhanced this model for better recipe generation with multi-head self-attention. Lam et al. [22] proposed a ViT5-based model [23] for generating Vietnamese cooking recipes.
Encoder-decoder models often fail to capture ingredients and create recipes with similar phrases due to their vanilla attention mechanism, which focuses on the same ingredients in each decoder output [20]. To solve this issue, we used a re-attention mechanism [24] that has been applied mostly in computer vision. Liu et al. [7] found that the inference of encoder-decoder models can be ineffective because these models use a deterministic target distribution that assigns all probability masses to the reference text. In this study, we employed a non-deterministic distribution in the training step to improve recipe generation. Fig. 1 illustrates the architecture of the proposed cooking recipe generator. Initially, a Transformer-based recipe-generation model was used to generate N candidates from the input ingredients. Each candidate consists of a food title, ingredients, and cooking recipe. These initial candidates are then used to train the BRIO-based recipe-generation model to produce N new candidates, which are iteratively used to further fine-tune the BRIO-based recipe generator to improve the quality of the cooking recipes. The remainder of this section describes the cooking recipe generation model.
Direct Transformer- [3] and BART-based [1] models are used as backbones to produce recipes from ingredients. Each layer in both the Encoder and Decoder of the standard Transformer has sub-layers of self-attention and a fully connected network. The self-attention in Transformer is computed as follows:
where Q, K, and V are the matrices of the query, key, and value, respectively;
Given the input text X={x1, ..., xn} and the reference text Y*={y1*, ..., yl*}, the generation model g, which is Transformer or BART in our study, is trained using the crossentropy loss Lxent between the decoder’s output and reference text:
where θ is the set of parameters of g and
The ingredients are segmented into words, vectorized, and added to the positional embeddings. These vectors are fed into the encoder of the backbone models to perform encoding. Reference recipes are processed, segmented into words, vectorized, added to positional embeddings, and fed into the Decoder. The output of the last layer of the Encoder is fed into all decoders at the encoder-decoder attention layers. The output of the final layer of the Decoder is passed through the linear and softmax layers to predict the generated tokens.
Zhou et al. [24] observed that the self-attention mechanism does not “learn effective concepts for representation learning.” They proposed a re-attention mechanism and achieved impressive results using ViT [14]. The authors replaced the self-attention layer within the Transformer block with the reattention layer, which is obtained using the following equation:
where ω is a learnable transformation matrix, which is multiplied to the self-attention map along the head dimension. Both ViT and our recipe-generation models adhere to the standard Transformer architecture, with the exception that the Vision Transformer model is used on images, and the recipe-generation models are used for text. We replaced the self-attention layer in both Encoder and Decoder of the backbone models with a re-attention layer to increase the diversity of the attention matrices in different layers and to avoid the attention collapse issue.
The BRIO technique [7] is used to train a generation model with a non-deterministic distribution. In the recipe-generation task, the backbone model acts as an autoregressive generation model g to create recipe candidates from input ingredients and as an evaluation model to calculate the quality scores of candidates. The model is trained using contrastive loss Lctr with a weight of γ and cross-entropy loss Lxent:
Lctr is obtained using the following equation:
where Yi and Yj are recipe candidates, λij is the margin, ROUGE(Yi,Y*)>>ROUGE(Yj,Y*), and f(Yi) is an estimated probability.
We performed experiments using Google Colab with a Tesla T4 GPU and 16 GB of RAM to create cooking recipes in both English and Vietnamese. Due to hardware limitations, we used 500,000 recipes randomly extracted from the RecipeNLG dataset [17] in English. For the Vietnamese dataset, we expanded the CookyVN-recipe dataset [22] using Selenium3) for crawling data from the Internet and employed Underthesea4) for cleaning and preprocessing data. We renamed this dataset to make it more familiar to users. CookingVN currently contains 77,024 recipes.
The datasets were organized into six columns: titles, servings, ingredients, quantities, instructions (or recipes), and links. The titles, ingredients, and recipes were used to train the models. Each dataset was divided into three parts: 75% for training, 8% for validation, and 17% for testing. Transformer [3] and BART [1] were utilized as backbones for generating English recipes, whereas BARTpho [25] was used as a backbone for generating Vietnamese recipes.
We conducted experiments to generate recipes using standard Transformers with the following configuration and parameters: 6 encoder layers, 6 decoder layers, 8 multi-head attention, an epoch number of 5, a batch size of 4, warm-up steps of 4,000, a learning rate of 0.01, and an Adam optimizer. The maximum lengths of the ingredients and cooking recipes were 120 and 1,024, respectively. Both Greedy and Beam searches were used for recipe generation. We applied a variety of beam sizes (3, 4, and 6) with the length normalization of α = 0.6. Table 1 lists the evaluation results of recipe generators using Transformers.
Table 1 . Results of recipe generators using Transformer.
Search Algorithm | Size | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|---|
Greedy | - | 49.76 | 23.78 | 40.06 |
Beam | 3 | 49.98 | 24.23 | 40.43 |
4 | 49.92 | 24.20 | 40.41 | |
6 | 49.86 | 24.24 | 40.43 |
Because the Beam search performed better than the Greedy search, it was used for further experiments. In addition, because of hardware limitations, we experimented with a Beam size of four. Next, we replaced the self-attention layers with re-attention layers in the Encoder, Decoder, and both. The results of the recipe-generation model are presented in Table 2.
Table 2 . Results of recipe generators using Transformer and re-attention.
Transformer models | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
Encoder w/ re-attention | 49.21 | 23.96 | 40.15 |
Decoder w/ re-attention | 49.20 | 23.76 | 39.88 |
Both w/ re-attention | 48.98 | 23.58 | 39.75 |
Similar to the experiments with Transformers, the BARTbased models were first used to create cooking recipes based on ingredients using the following parameters: 6 encoder layers and 6 decoder layers, 12 multi-head attention subclasses, an epoch number of 5, a learning rate of 0.0001, warm-up steps of 20,000, a batch size of 4, and an Adam optimizer.
Next, we modified these backbones by replacing the selfattention layers with re-attention layers. Instead of directly using the attention matrix for self-attention, re-attention uses a transformation matrix to mix the attention matrices from different heads in the same self-attention layer. The transformation matrix is a parameter that is learned during training.
The backbone generation models were fine-tuned to generate recipe candidates, which were subsequently used to improve the quality of the recipe-generation models in the next iteration. In particular, the BART model was fine-tuned to create N = 6 recipe candidates (Beam size of 6) to ensure the diversity of the candidates. The candidates were sorted in descending order based on the average F1-scores for ROUGE-1, ROUGE-2, and ROUGE-L. These recipe candidates were iteratively used to enhance the recipe-generation model. Experiments conducted by Liu et al. [7] demonstrated that their abstractive summarization model achieved the best performance after two fine-tuning steps. Therefore, the models were fine-tuned twice.
The backbone model was fine-tuned using MLE and label smoothing [26]. Parameters for label smoothing are 0.1 for the RecipeNLG dataset and 0.01 for the CookingVN-recipe dataset. The output of the model was used to obtain the cross-entropy loss Lxent and contrastive loss Lctr with a margin λ of 0.01. The final loss L of the model is the sum of the Lxent and Lctr multiplied by their corresponding weights, which are 0.1 and 10 in our experiments, respectively. The final loss L was then passed back to the model. After passing through a sufficient number of data samples (the accumulation gradients steps of 8), the model adjusted the learning rate metric, which was saved in the update step variable. When the value of the update step variable equals the evaluation interval (=1,000), the entire model is evaluated as follows:
• Scoring role: The model was evaluated based on its ability to score candidates created using the pretrained recipe-generation model and select the final output.
• Generation role: The model was evaluated based on its ability to generate cooking recipes. The model was trained using two loss functions such that Lxent helped enhance token generation to create complete recipes, and Lctr helped select the best recipes.
If the results improved, the model replaced the previously saved model; otherwise, it was ignored.
Table 3 displays the results of the proposed models. In addition, experiments show that the re-attention mechanism slightly improves the performance of recipe-generation models. In particular, when using BART and BARTpho as backbones, the ROUGE-L scores of the BRIO-based models with re-attention increase 3.61 and 3.38%, respectively, compared to the BRIO-based models without re-attention. Cooking recipe generation is a text generation task, the models of which are typically trained using maximum likelihood estimation or deterministic distribution. The BRIO training technique, which achieved state-of-the-art results in abstractive summarization, also enhances recipe-generation performance. Furthermore, the BART-based recipe-generation models with a reattention mechanism trained using BRIO perform the best. Lam et al. [27] reported decreased performance in abstractive summarization using ViT5 trained using BRIO. However, ViT5 trained using a common technique outperforms our BART-based models. Therefore, we explore ViT5-based models for cooking recipe generation in future studies.
Table 3 . Results of the recipe generators proposed.
Model | Dataset used | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|---|
Coverage loss and reinforcement learning [20] | Food.com | 31.00 | 7.00 | 32.40 |
RecipeGPT [16] | Recipe1M+ | - | - | 37.00 |
RecipeNLG [17] | RecipeNLG | - | 13.54 | - |
RecipeMC [18] | 50.50 | 24.20 | - | |
Chef Transformer | - | 24.70 | - | |
BART | 50.24 | 26.22 | 39.81 | |
BART w/ re-attention | 51.27 | 26.27 | 40.12 | |
BART-BRIO | 54.86 | 27.43 | 43.80 | |
BART w/re-attention and BRIO | 56.25 | 26.36 | 45.38 | |
ViT5 [22] CookyVN-recipe | 64.45 | 35.92 | 38.21 | |
BARTpho | CookingVN-recipe | 56.49 | 29.72 | 34.09 |
BARTpho w/ re-attention | 56.91 | 29.34 | 35.72 | |
BARTpho - BRIO | 58.41 | 27.88 | 36.38 | |
BARTpho w/ re-attention and BRIO | 59.75 | 28.12 | 37.61 |
Table 4 shows titles, ingredients, and recipes created by the different models using the following input ingredients: cocoa powder, icing sugar, coconut, butter, wheat biscuits, and eggs. Recipes generated using BART tend to be short and simple, whereas those created using BART with reattention include more detailed recipes. The BRIO-based recipe-generation model produced the longest and most detailed cooking steps. However, these models may omit certain ingredients when generating recipes. In particular, “wheat biscuits” and “egg white” are not used in the recipes for “Biscuit Balls” and “Cocoa biscuits,” respectively.
Table 4 . Cooking recipes created from the input ingredients including cocoa powder, icing sugar, coconut, butter, wheat biscuits, and egg. Each input ingredient is colored for the presentation purpose only.
Models | Titles | Ingredients | Recipes |
---|---|---|---|
BART | Biscuit Balls | 1/2 cup cocoa powder, 1/4 cup icing sugar, 2 tablespoons coconut, 3 tablespoons butter melted, 4 wheat biscuits crushed, egg beaten | Mix cocoa, icing sugar, coconut and butter. Roll into balls. Roll in beaten egg and place on a greased baking tray. Bake at 350°F for 10–12 minutes or until golden brown. Serve warm. |
BART-BRIO | Cocoa Biscuits | 1/2 cup cocoa powder, icing sugar, 1/4 cup coconut, 4 wheat biscuits, egg white | Preheat oven to 350°F. In a small bowl, mix together the cocoa powder, icing sugar and coconut. Set aside. Cut each wheat biscuit in half lengthwise. Scoop out the insides of the biscuits, leaving a 1/2-inch-thick shell. Fill the shells with the cocoa mixture. Place the shells on a baking sheet and bake for 15-20 minutes or until golden brown. Remove from the oven and allow to cool on the baking sheet for 5 minutes before removing to a wire rack to cool completely - |
BART w/re-attention | Chocolate Coconut Crumble | 1/2 cup cocoa powder, 1/4 cup icing sugar, 2 tablespoons coconut, 3 tablespoons butter, 4 wheat biscuits, egg | Preheat oven to 350°F. Grease a baking tray. In a small bowl, mix cocoa powder, icing sugar, coconut and butter. Set aside. In a large bowl, combine biscuits and egg. Stir in cocoa mixture. Transfer to prepared tray. Bake for 20–25 minutes or until golden brown. Serve warm |
BART w/re-attention and BRIO | Coconut Wheat Biscuits | 1/2 cup cocoa powder, 1/4 cup icing sugar, 2 tablespoons butter, 4 wheat biscuits, egg white | Preheat oven to 350°F. In a small bowl, combine cocoa powder, icing sugar, butter and egg white; mix well. Dip biscuits in cocoa mixture, then roll in coconut mixture. Place on a baking sheet lined with parchment paper. Bake for 10-12 minutes or until golden brown |
Interestingly, BRIO-based models with re-attention can use new ingredients that are not in the input ingredients to make the recipes complete and seem to balance both the details and nature of the cooking recipes created. For example, given the input ingredients {chicken breast, salad dressing, breadcrumbs, mozzarella cheese, parmesan}, the new ingredient “spaghetti” is introduced to cook “Chicken Parmesan.” The ingredients included 1 lb boneless skinless chicken, breast, 1/2 cup salad dressing, 2 cups breadcrumbs, shredded mozzarella cheese, parmesan parmigiano-reggiano, grated cheese, and 3/4 cup cooked spaghetti broken into bitesized pieces. The recipe is “Preheat oven to 350 degrees. Spray a 9×13 inch baking dish with nonstick cooking spray. Place chicken in baking dish. Pour salad dressing over chicken. Sprinkle with bread crumbs and cheese. Bake for 30 minutes or until chicken is no longer pink in the center and crumbs are lightly browned. Serve with spaghetti”. We observed that some input ingredients may not be present in the generated ingredients, but may appear in the cooking recipes created using BRIO-based models with re-attention.
End-to-end cooking recipe generation systems that use generation models encounter several challenges. As mentioned previously, these models may introduce new ingredients or omit some input ingredients. Second, the generated recipes may include irrelevant cooking steps. For example, similar to the findings of Lam et al. [22], our models may occasionally produce recipes that simply place cleaned fish or beef into a pot without mentioning the necessary previous steps, such as cutting or seasoning. Moreover, a certain cooking step may use products from previous steps. To address this issue, cooking steps can be converted into an action sequence planning problem, as suggested by Pareek et al. [28]. Third, people usually decide on cooking methods, such as boiling, frying, or grilling before looking for detailed cooking instructions. A limitation of the proposed approach is that these models do not pay much attention to generating food titles that match the cooking instructions. Therefore, creating food names that reflect cooking methods must be carefully considered. Finally, people tend to prefer specific cooking methods and have personal tastes (e.g., low sugar, low salt, or avoiding oil). To generate recipes that meet user demands, recipe-generation systems should be able to analyze user food tastes, such as those based on recipe ratings [29] or additional user requirements [30].
In future work, we will develop a multimodal recipe-generation model to generate recipes that not only meet user demands, but also provide logical cooking steps. Furthermore, we will employ an ingredient understanding model to identify the main ingredients and understand user requirements regarding ingredients to use a food name generation model to produce food titles, which include information about a specific cooking method in an obvious or creative manner; and a taste prediction model to predict the user-preferred food flavor. The outputs of these models will be used to enhance the recipe-generation model.
We present an overview of cooking recipe generation models and develop models that use Transformer and BART as backbones. We expanded the CookingVN-recipe dataset to include 77,024 recipes. The highlighted contributions of our work include integrating the re-attention mechanism into cooking recipe generation models to enhance recipe quality and employing the BRIO paradigm to train recipe-generation models for optimal performance. The BRIO-based recipegeneration models performed better than the existing models. In addition, we analyzed and addressed the drawbacks of generation-based models for cooking recipes, including the proposed approaches. Our study not only enhances cooking recipe generation models based on text ingredients, but also has the potential to integrate recipe-generation models based on food images. This incorporation can be employed in the recipe-generation phase after the ingredients are predicted from the images.
Table 1 . Results of recipe generators using Transformer.
Search Algorithm | Size | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|---|
Greedy | - | 49.76 | 23.78 | 40.06 |
Beam | 3 | 49.98 | 24.23 | 40.43 |
4 | 49.92 | 24.20 | 40.41 | |
6 | 49.86 | 24.24 | 40.43 |
Table 2 . Results of recipe generators using Transformer and re-attention.
Transformer models | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
Encoder w/ re-attention | 49.21 | 23.96 | 40.15 |
Decoder w/ re-attention | 49.20 | 23.76 | 39.88 |
Both w/ re-attention | 48.98 | 23.58 | 39.75 |
Table 3 . Results of the recipe generators proposed.
Model | Dataset used | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|---|
Coverage loss and reinforcement learning [20] | Food.com | 31.00 | 7.00 | 32.40 |
RecipeGPT [16] | Recipe1M+ | - | - | 37.00 |
RecipeNLG [17] | RecipeNLG | - | 13.54 | - |
RecipeMC [18] | 50.50 | 24.20 | - | |
Chef Transformer | - | 24.70 | - | |
BART | 50.24 | 26.22 | 39.81 | |
BART w/ re-attention | 51.27 | 26.27 | 40.12 | |
BART-BRIO | 54.86 | 27.43 | 43.80 | |
BART w/re-attention and BRIO | 56.25 | 26.36 | 45.38 | |
ViT5 [22] CookyVN-recipe | 64.45 | 35.92 | 38.21 | |
BARTpho | CookingVN-recipe | 56.49 | 29.72 | 34.09 |
BARTpho w/ re-attention | 56.91 | 29.34 | 35.72 | |
BARTpho - BRIO | 58.41 | 27.88 | 36.38 | |
BARTpho w/ re-attention and BRIO | 59.75 | 28.12 | 37.61 |
Table 4 . Cooking recipes created from the input ingredients including cocoa powder, icing sugar, coconut, butter, wheat biscuits, and egg. Each input ingredient is colored for the presentation purpose only.
Models | Titles | Ingredients | Recipes |
---|---|---|---|
BART | Biscuit Balls | 1/2 cup cocoa powder, 1/4 cup icing sugar, 2 tablespoons coconut, 3 tablespoons butter melted, 4 wheat biscuits crushed, egg beaten | Mix cocoa, icing sugar, coconut and butter. Roll into balls. Roll in beaten egg and place on a greased baking tray. Bake at 350°F for 10–12 minutes or until golden brown. Serve warm. |
BART-BRIO | Cocoa Biscuits | 1/2 cup cocoa powder, icing sugar, 1/4 cup coconut, 4 wheat biscuits, egg white | Preheat oven to 350°F. In a small bowl, mix together the cocoa powder, icing sugar and coconut. Set aside. Cut each wheat biscuit in half lengthwise. Scoop out the insides of the biscuits, leaving a 1/2-inch-thick shell. Fill the shells with the cocoa mixture. Place the shells on a baking sheet and bake for 15-20 minutes or until golden brown. Remove from the oven and allow to cool on the baking sheet for 5 minutes before removing to a wire rack to cool completely - |
BART w/re-attention | Chocolate Coconut Crumble | 1/2 cup cocoa powder, 1/4 cup icing sugar, 2 tablespoons coconut, 3 tablespoons butter, 4 wheat biscuits, egg | Preheat oven to 350°F. Grease a baking tray. In a small bowl, mix cocoa powder, icing sugar, coconut and butter. Set aside. In a large bowl, combine biscuits and egg. Stir in cocoa mixture. Transfer to prepared tray. Bake for 20–25 minutes or until golden brown. Serve warm |
BART w/re-attention and BRIO | Coconut Wheat Biscuits | 1/2 cup cocoa powder, 1/4 cup icing sugar, 2 tablespoons butter, 4 wheat biscuits, egg white | Preheat oven to 350°F. In a small bowl, combine cocoa powder, icing sugar, butter and egg white; mix well. Dip biscuits in cocoa mixture, then roll in coconut mixture. Place on a baking sheet lined with parchment paper. Bake for 10-12 minutes or until golden brown |
Ansary Shafew, Dongwan Kim, and Daehee Kim
Journal of information and communication convergence engineering 2024; 22(4): 303-309 https://doi.org/10.56977/jicce.2024.22.4.303