Journal of information and communication convergence engineering 2024; 22(2): 98-108
Published online June 30, 2024
https://doi.org/10.56977/jicce.2024.22.2.98
© Korea Institute of Information and Communication Engineering
Correspondence to : Seog Chung Seo (E-mail: scseo@kookmin.ac.kr)
Department of Financial Information Security, Kookmin University, Seoul 02707, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Scrypt is a password-based key derivation function proposed by Colin Percival in 2009 that has a memory-hard structure. Scrypt has been intentionally designed with a memory-intensive structure to make password cracking using ASICs, GPUs, and similar hardware more difficult. However, in this study, we thoroughly analyzed the operation of Scrypt and proposed strategies to maximize computational parallelism in GPU environments. Through these optimizations, we achieved an outstanding performance improvement of 8284.4% compared with traditional CPU-based Scrypt computations. Moreover, the GPU-optimized implementation presented in this paper outperforms the simple GPU-based Scrypt processing by a significant margin, providing a performance improvement of 204.84% in the RTX3090. These results demonstrate the effectiveness of our proposed approach in harnessing the computational power of GPUs and achieving remarkable performance gains in Scrypt calculations. Our proposed implementation is the first GPU implementation of Scrypt, demonstrating the ability to efficiently crack Scrypt.
Keywords GPU, CUDA, Scrypt, Password-Based Key Derivation Function, Optimization
Password-based key-derivation functions are used for two primary purposes. First, to hash passwords so that an attacker who gains access to a password file does not immediately possess the passwords contained therein. Second, to generate cryptographic keys are generated to encrypt and authenticate data [1]. All currently used key-derivation algorithms are constructed from hash functions, against which no pre-image attacks are known, making it nearly impossible to directly attack the key-derivation function itself. Florêncio and Herley [2] found that users choose passwords with an average bit-strength 40.54 bits. This level of complexity is usually less than that required for cryptographic keys; therefore, brute-force attacks are more likely to succeed.
A graphics processing unit (GPU) is a device that is effective for the parallel processing of computations using many threads. With advancements in GPU device performance, the general purpose on GPU (GPGPU) technology, which performs general-purpose computing operations on GPU devices, has emerged. As a result, computed unified device architecture (CUDA) technology has been developed to design the computational capabilities of the GPU architecture using existing computer languages [3,4].
Scrypt is a password-based key-derivation function proposed by Percival (2009). Originally proposed for use in the online backup service Tarsnap. Scrypt is now mainly used in cryptocurrencies such as Bither, Bitcoin core, and Bitcoin knot files. To defend against brute-force attacks with a high success rate in key-derivation cryptographic algorithms, Scrypt employs a memory-hard structure that requires a large amount of memory for execution. This approach restricts the amount of parallel processing that an attacker can perform using their hardware or GPU architecture [1].
In recent years, the Bitcoin mining industry’s remarkable growth and lucrative economic incentives have led to significant advancements in application-specific integrated circuit (ASIC) hash units [6]. For adversaries equipped with ASICs, the possibility of brute forcing a password database appears feasible, and existing pricing functions do not provide sufficient deterrence while remaining manageable for honest central processing unit (CPU) users. Consequently, there is a growing interest in developing ASIC-resistant hash functions and proof-of-work schemes that do not grant ASICs a significant advantage [6]. Numerous studies on ASIC-resistant hash functions and proof-of work include [1,6-16]. ASICs possess two fundamental advantages over CPUs (or generalpurpose GPUs): a smaller area and better energy efficiency when normalized for speed. Achieving ASIC resistance essentially involves reducing ASICs' area and energy efficiency advantages. Prior research on ASIC resistance has primarily followed the memory-hard function approach initially proposed by Percival [1].
This paper presents a methodology for optimizing the ASIC-resistant Scrypt algorithm in a GPU environment to maximize parallel computation. Four different computation models are proposed for Scrypt calculations with p-values of 2 and 4, and the optimal combinations for each p-value are determined through experiments. The proposed optimized Scrypt software offers a remarkable 8284.4% improvement in processing throughput compared with the Scrypt implementation in OpenSSL running on a CPU. Additionally, by leveraging GPU thread parallelization techniques and utilizing internal shared memory, the GPU-optimized Scrypt implementation achieved a 204.84% increase in the processing throughput compared with a simple GPU-based Scrypt implementation. The proposed implementation is the first GPU implementation of Scrypt, demonstrating its ability to efficiently crack Scrypt.
GPUs were designed for computer design work. With improvements in hardware, a new technology called general purpose on GPU (GPGPU) emerged, enabling general-purpose computing tasks traditionally handled by CPU devices to be processed on GPUs. NVIDIA introduced CUDA technology, which allows GPU device operations to be processed using general-purpose computing languages like C/C++/Python [3]. By designing the calculation method of the GPU architecture through this technology, the performance of algorithms can be improved.
The GPU architecture is composed of numerous threads and organized in a hierarchical structure. These threads are designed to perform the same task and assigned to grids and blocks, with multiple threads forming a block and multiple blocks forming a grid. Each thread has a unique index for its grid, block, and thread. When a thread is called in the kernel function, 32/64 units of warp operations are executed [4]. The GPU architecture is composed of three primary memory areas: global, shared, and constant. Global memory has the largest storage capacity, but has a significant performance overhead in memory access and storage. Shared memory is a memory area accessible by threads within the same block and has less performance overhead in terms of memory access and storage. Constant memory has a relatively smaller storage capacity than global memory but has a lower performance overhead in memory access and storage, and is a read-only memory area.
Compared to other architectures, the GPU architecture has a larger performance overhead in terms of memory access and storage speed. Therefore, an efficient memory-access process is necessary.
Scrypt is a password-based key-derivation function proposed by Colin Percival and is primarily used in Litecoin. The input values for Scrypt are shown in Table 1. In Scrypt, parameter p represents the number of parallel scryptROMix functions being implemented. During the development, it appeared that setting r = 8 and p = 1 would yield good results. However, as memory latency and CPU parallel processing increase, it became more likely that the optimum values for both r and p would increase as well [1]. In addition, because the calculations in scryptROMix are independent, using larger values for p allows for increased computational cost without increasing memory usage [1]. Therefore, in this study, we measured the performance of Scrypt using values larger than those that seemed to yield good results in 2009, specifically p = 2 and p = 4. Scrypt consists of four main functions: PBKDF2-HMAC-SHA256, Salsa20/8, scrypt-BlockMix, and scryptROMix. The overall structure of the Scrypt algorithm is illustrated in Fig. 1.
Table 1 . Parameters of the Scrypt algorithm
Parameters | Purpose |
---|---|
P | Passphrase, an n-octets string |
S | Salt, an n-octets string |
N | Memory cost parameter |
r | Block size parameter |
p | Parallelization parameter |
keyLen | Output length of the derived key |
Salsa20/8 is a hash function obtained by reducing Salsa20 to eight rounds and is not a cryptographic hash function [17]. By reducing the number of rounds, Salsa20/8 becomes less secure but achieves a higher performance. It uses a 64-byte input and produces a 64-byte output that is used internally in the scryptBlockMix algorithm. The scryptBlockMix function performs appropriate operations on each block using Salsa20/8 and XOR operations. After all operations are complete, the scryptBlockMix function updates 2r internal blocks and sorts them into even and odd orders, resulting in a uniform distribution of output values between 0 and 1, as shown in Fig. 2. The scryptROMix function is the core component of Scrypt, and is designed to create a memory-hard structure. It calculates many random values and stores them randomly in memory, as illustrated in Fig. 3.
Scrypt uses PBKDF2-HMAC-SHA256 at the beginning and end of its operation. The scryptROMix function is a key function of the Scrypt algorithm and is operated p times [5]. As shown in Table 2, the scryptROMix function constitutes more than 90% of the performance overhead among all functions in the Scrypt algorithm. Therefore, this study optimizes Scrypt by reducing the performance load of the scryptROMix function.
Table 2 . Cycle for each function (r=8, p=2, N=1024, derived key Len=64)
Algorithm | Clock Cycles | Ratio |
---|---|---|
PBKDF2 (block creation) | 166,833 | 1.03 |
scryptROMix | 16,141,658 | 98.58 |
PBKDF2 (key creation) | 65,009 | 0.39 |
Total | 16,373,500 | 100 |
Password-cracking techniques are essential in the field of information security because they highlight the importance of addressing vulnerabilities and protecting sensitive information [18]. Most operating systems and applications use key-derivation functions (KDFs) to convert plain-text passwords into hashed passwords, thus preventing attackers from obtaining them. Brute-force attacks, which test all combinations up to a certain length, are the only way to recover plain-text passwords from hashed passwords because they are one-way functions [19]. PBKDF2 is a widely used KDF algorithm approved by the National Institute of Standards and Technology (NIST), and there have been optimization studies on PBKDF2 [20]. Scrypt is another KDF that utilizes the PBKDF2 function. According to [1], Scrypt is 260 times more expensive to attack than PBKDF2. Moreover, when used for file encryption, Scrypt outperforms PBKDF2 by a factor of 20,000 in terms of computational cost. In this study, we utilize the optimization techniques presented in [20] for PBKDF2 and introduce a novel optimization method for Scrypt, which is believed to be more secure than PBKDF2. Taking advantage of the unique features of the GPU architecture, we conducted the first-ever optimization of Scrypt on a GPU platform.
The scryptROMix function is a computationally expensive operation in the Scrypt algorithm. Therefore, optimizing the scryptROMix function can lead to the overall optimization of Scrypt. Input p determines the number of times the scryptROMix function is used in Scrypt. Each scryptROMix function used is independent and does not affect each other. Therefore, in the GPU architecture, it is possible to configure the parallelization code such that each thread computes one scryptROMix function, and each block executes one Scrypt function. In other words, it is possible to configure the code such that the number of Scrypts executed simultaneously by blocks per grid and the size of threads per block are equal to the value of p. This means that the PBKDF2-HMAC-SHA256 function used at the beginning and end of Scrypt can be computed for both CPU and GPU architectures. Thus, we compared the following four methods.
The first method uses the PBKDF2 function at the beginning and end of Scrypt. The overall structure of the first method is illustrated in Fig. 4. Users who have existing Password and Salt values execute the PBKDF2 function on the CPU and obtain the initial input values for scryptROMix from the resulting output. The initial input block values are copied to the GPU using the CudaMemcpy() function. The cudaMemcpy() function is CUDA’s supported memory function for transferring data from the CPU to the GPU, and vice versa. Each thread then provides the corresponding initial block value from the cudaMemcpy() operation as input to the scryptROMix function using its unique index value threadIdx.x (tid) and obtains the resulting value B'. The cudaMemcpy() function is then used to copy the resulting block B' from the scryptROMix function back to the CPU. The copied result is then used as the input value for the final PBKDF2 function, which generates the derived key value dkout. This method minimizes computations on the GPU but involves copying a significant amount of memory.
The second method involves using the PBKDF2 function on the CPU in the initial step and on the GPU in the final step. The overall structure of the second method is illustrated in Fig. 5. Similar to the first method, the initial input values for the scryptROMix function are obtained using the PBKDF2 function executed on the CPU. This process continues until the output values of the scryptROMix function are obtained. However, in the second method, the final PBKDF2 function is executed on the GPU. The remaining steps are the same as those in the first method. Subsequently, the obtained output values of the scryptROMix function are used as input values for the PBKDF2 function at their respective positions to derive the key. In this process, it is important to note that each thread of every block in all optimization approaches performs individual scryptROMix computations rather than executing a single Scrypt function. Therefore, conditional statements must be employed to ensure that only one PBKDF2 operation occurs in each block. Any flow control instruction (if, switch, do, for, while) can significantly affect the effective instruction throughput by causing the threads of the same warp to diverge (i.e., follow different execution paths). If this occurs, different execution paths must be serialized, thereby increasing the total number of instructions executed for this warp. To obtain the best performance when the control flow depends on the thread ID, the control condition should be written to minimize the number of divergent warps [4]. After all scryptROMix operations are completed, the PBKDF2 operation is performed using the __syncthreads() function to ensure that all threads are synchronized before proceeding. The __syncthreads() function is employed to synchronize the behavior of all threads with that of the GPU. Although this function is crucial for ensuring the coordination among threads, it may introduce a slight performance penalty [4]. Finally, the derived key value is transferred from the GPU to the CPU via memory copying. This method involves less memory copying between the CPU and GPU architectures compared to the first method; however, it includes the use of the __syncthreads() function and conditional statements.
The third method uses the PBKDF2 function on the GPU for the initial operation and on the CPU for the final operation. The overall structure of the third method is illustrated in Fig. 6. The Password and Salt values are copied from the CPU to the GPU environment using the cudaMemcpy() function. The PBKDF2 operation is then performed to obtain the initial input block for Scrypt. Similarly, in the final PBKDF2 operation of the second method, a conditional statement is used to ensure that only one operation is performed per block. To prevent mixing of operation sequences caused by the concurrent nature of the GPU architecture, it is necessary to use the __syncthread() function. This ensures that all the threads within a block synchronize their execution before proceeding. The computed results are then assigned to the scryptROMix function as inputs according to the specific thread numbers, as shown in Fig 6. The blocks of B' obtained using the scryptROMix function are then sent back from the GPU to the CPU through memory copying. On the CPU side, the obtained blocks of B' are used as inputs for the last PBKDF2 function to derive the value of the derived key. This method, similar to the second method, involves a smaller amount of memory being copied between the CPU and GPU architectures compared to the first method. However, this requires the __syncthreads() function and conditional statements.
The fourth method involves performing all the PBKDF2 computations in the GPU environment. The overall structure of the fourth method is illustrated in Fig. 7. Similar to the third method, this includes copying the Password and Salt values to the GPU. The GPU executes the PBKDF2 function to obtain the initial input values for the scryptROMix function and the resulting values from the scryptROMix function, which are the same as in the third method. Subsequently, the resulting values from the scryptROMix function are used as inputs to the PBKDF2 function, corresponding to their respective positions, to derive the key, which is the same process as in the second method. To prevent bugs, the __syncthreads() function is used between each PBKDF2 computation and the scryptROMix function. Each PBKDF2 function is accompanied by a conditional statement. Finally, the derived key is copied from the GPU to the CPU to complete the process. This method requires the least amount of memory copying between the CPU and GPU. However, it includes two conditional statements and two __ syncthreads() functions.
Each method differs in the amount of memory copying, presence of conditional statements, and use of the __syncthreads() function. The impact of each factor varies depending on the environment. Therefore, to determine the best method, it is necessary to compare all four methods in the same environment. In this study, we aimed to obtain optimal results within a given parameter space by fixing the size of each parameter and varying the number of Scrypt algorithms operated simultaneously, which corresponds to the block size.
In the GPU architecture, there are three types of memory regions: global, shared, and constant. The global memory region provides the largest storage capacity, but has a higher performance overhead in terms of access and storage compared to other memory regions. To optimize the Scrypt algorithm with a memory-hard structure, it is important to minimize the usage of global memory and maximize the utilization of memory regions such as shared memory and constant memory, which have relatively higher performance and efficiency in access and storage.
Shared memory is a memory region that can be shared between threads within the same block. It has a smaller storage capacity than global memory but offers faster access speeds. To optimize memory usage, it is important to identify frequently accessed portions of the code and allocate them to shared memory. The GPU device used in this study was a GeForce RTX 3090, which has a maximum shared memory size of 48 KB per block. The names of the arrays used in the scryptROMix function and their corresponding memory sizes are listed in Table 3. In this case, the parts of the scryptROMix function that are heavily called by each thread and have a memory usage below 48KB are the X and T parts used when generating a new block using scrypt-BlockMix within the scryptROMix function. Thus, we utilized shared memory for the X and T parts. When using shared memory, it is important to minimize bank conflicts that cause threads to run serially and write code in a bankfree manner. In the GPU architecture, shared memory is divided into a certain number of memory banks per warp. According to NVIDIA’s official technical documentation, the RTX 3090 consists of 32 memory banks [4]. Each bank can be accessed simultaneously at the bank level. If different threads access the same bank in the shared memory, they access it sequentially, which diminishes the parallel nature of GPU architectures and reduces their performance. Therefore, to eliminate bank conflicts in shared memory, we pad arrays X and T in a manner similar to the example shown in Fig 8. The sizes of the arrays are specified in the 4th and 5th lines of Algorithm 1 to ensure that each thread accesses a padded version of X and T. This approach effectively mitigates bank conflicts in shared memory. Furthermore, as shown in line 6 of Algorithm 2, we use tid and Blen to perform indexing that aligns with the padding scheme shown in Fig 8. This ensures that the threads access the appropriate elements in the padded X- and T-arrays.
Table 3 . Memory for each array and number of accesses
Array | Memory (bytes) | Number of array access |
---|---|---|
B | 128 * r * p | 256 * r * p |
V | 128 * N * r * p | ((66r+1) * N+64r+1) * p |
X | 128 * r * p | ((34r+1) * N+32 * r) * p |
T | 128 * r * p | (64r+1) * N * p |
Constant memory is a memory region that provides a smaller storage capacity than global memory but offers faster access speed. It is a read-only memory area and ideal for storing values that remain constant throughout the execution of a program. The size of the constant memory region available on RTX 3090 is 64 KB. The Password and Salt values used in Scrypt remain constant and do not change throughout the computation. If the number of concurrently executing Scrypt algorithms is 2048, the combined size of the Password and Salt that can be utilized in each individual Scrypt algorithm is limited to a maximum of 32 bytes. Therefore, storing the values of Password and Salt in the constant memory region can improve the performance by taking advantage of the high access speed offered by constant memory. To store the Password and Salt values in the read-only constant memory region, the cudaMemcpyToSymbol() function should be used on the CPU. This function allows the copying of data from the CPU to a constant memory area on the GPU and offers higher data copy speeds than the cudaMemcpy() function. By storing Password and Salt in the constant memory region, the performance of memory copying in the 2nd to 4th approaches, in which Password and Salt were previously copied from the CPU to the GPU, can be improved. Additionally, in the GPU environment of the 2nd to 4th approaches where PBKDF2 functions are used, it is possible to minimize access to the global memory region within the PBKDF2 functions.
The combined code for all the approaches is presented in Algorithms 1 and 2.
The implementation was performed using Visual Studio 2019, Release x64, CUDA runtime version 10.2. and RTX 3090. The performance was measured by adjusting the number of concurrently running Scrypt algorithms and the size of p. All input values except for the value of p were set to r = 8, N = 1 024, a nd d kLen = 64. I n the initial proposal for Scrypt, it was believed that setting p = 1 would yield the best results. However, as time passed and factors such as memory latency and CPU parallel processing increased, it became more apparent that yielding good values for both r and p would likely increase [1]. In other words, the understanding of the algorithm evolved over time, and it was realized that higher values for both r and p might be more suitable for achieving better performance and security in light of the changing computational landscape. Therefore, in this study, we used the p-values 2 and 4. The performance of each optimization method and the simple GPU porting approach when p is set to 2, is shown in Fig. 9. Among the different optimization methods, the fourth approach demonstrated the highest performance, with a 204.84% improvement compared with the simple GPU porting approach when p is set to 2. The performance of each optimization method and the GPU simple porting approach when p is set to 4 are shown in Fig 10. Furthermore, when compared to the performance on a CPU (Ryzen 9 5900X), we achieved remarkable improvements of 8284.4% for p = 2 and 8122.5% for p = 4. The CPU code used for the measurement is based on code from OpenSSL 1.1.1s [21]. Among these, the second approach demonstrated the highest performance, showing a 168.14% improvement compared with the GPU simple porting approach when p is set to 4. We also used the same parameters for the performance measurements as in [22], and the results are listed in Table 4. In [22], Le et al. compared the performance of porting Scrypt using RTX 3090 with their optimization scheme. Although GPU optimization is not the main focus of [22], we carried out a comparison to demonstrate the differences between our method and other GPU porting methods. The parameters used in [22] are passlen = 1024 bits, Saltlen = 1024 bits, N = 1024, r = 1, p = 1, and dkout = 32. Because the key to our optimization approach is when p is greater than 1, we observe a performance improvement of 18.45%, which is not a significant performance difference. Because [22] did not measure the performance for different values of p, an exact comparison is difficult; however, we expect the performance difference to become more pronounced as the value of p increases. Based on the results, it can be observed that when p is small, indicating a smaller thread size, the fourth method demonstrates the highest performance. Conversely, when p exceeds a certain threshold, the second approach shows the highest performance. From these findings, it can be inferred that as the thread size increases, the performance overhead from conditional statements and __syncthread() functions becomes more significant compared with the increased amount of memory copying. We also saw a performance of 2618.32 Scrypt/s for a p-value of 8 and 705.75 Scrypt/s for a p-value of 16.
In this study, we optimized the Scrypt algorithm in a GPU environment using various approaches. We used the characteristics of each GPU memory region and modified the processing method such that one block handles one Scrypt, with each thread performing the scryptROMix function, which is a core function of the Scrypt algorithm. As a result, when 2048 instances of Scrypt were executed simultaneously, we achieved performance improvements of 204.84% for p = 2 and 168.14% for p = 4 compared with simple GPU porting. Moreover, when compared to the CPU performance, we observed remarkable improvements of 8284.4% for p = 2 and 8122.5% for p = 4. This is the first optimization of Scrypt in a GPU environment. In future work, we plan to extend our optimization techniques to other password-based KDFs in a GPU environment.
This work was supported by the National Research Foundation of Republic of Korea (NRF) grant funded by MSIT (No. 2022R1C1C1013368, 100%).
SeongJun Choi
is currently pursuing the B.S. degree at Kookmin University's Department of Information Security, Cryptol-ogy, and Mathematics, Kookmin University. His research interests include the efficient implementation of cryptographic hash functions in GPU environments.
DongCheon Kim
received the B.S. degree from the Department of Information Security, Cryptology, and Mathematics, Kookmin University, where he is currently pursuing the master’s degree in financial information security. His research interests include parallel programming environments like GPUs, post-quantum cryptography, and cryptographic module validation programs.
Seog Chung Seo
received the Ph.D. degree at Korea University, Seoul, Korea in 2011. He worked as a research staff member for Samsung Electronics from Sep. 2011 to April 2014. He was a senior research member of the Affiliated Institute of ETRI from 2014 to 2018. Currently, he works as an associate professor at Kookmin University in South Korea. His research interests include public-key cryptography, its efficient implementa-tions on various IT devices, cryptographic module validation programs, network security, and data authen-tication algorithms.
Journal of information and communication convergence engineering 2024; 22(2): 98-108
Published online June 30, 2024 https://doi.org/10.56977/jicce.2024.22.2.98
Copyright © Korea Institute of Information and Communication Engineering.
SeongJun Choi 1, DongCheon Kim 2, and Seog Chung Seo2*
1Department of Information Security, Cryptology, and Mathematics, Kookmin University, Seongbuk 02707, Republic of Korea
2Department of Financial Information Security, Kookmin University, Seoul 02707, Republic of Korea
Correspondence to:Seog Chung Seo (E-mail: scseo@kookmin.ac.kr)
Department of Financial Information Security, Kookmin University, Seoul 02707, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Scrypt is a password-based key derivation function proposed by Colin Percival in 2009 that has a memory-hard structure. Scrypt has been intentionally designed with a memory-intensive structure to make password cracking using ASICs, GPUs, and similar hardware more difficult. However, in this study, we thoroughly analyzed the operation of Scrypt and proposed strategies to maximize computational parallelism in GPU environments. Through these optimizations, we achieved an outstanding performance improvement of 8284.4% compared with traditional CPU-based Scrypt computations. Moreover, the GPU-optimized implementation presented in this paper outperforms the simple GPU-based Scrypt processing by a significant margin, providing a performance improvement of 204.84% in the RTX3090. These results demonstrate the effectiveness of our proposed approach in harnessing the computational power of GPUs and achieving remarkable performance gains in Scrypt calculations. Our proposed implementation is the first GPU implementation of Scrypt, demonstrating the ability to efficiently crack Scrypt.
Keywords: GPU, CUDA, Scrypt, Password-Based Key Derivation Function, Optimization
Password-based key-derivation functions are used for two primary purposes. First, to hash passwords so that an attacker who gains access to a password file does not immediately possess the passwords contained therein. Second, to generate cryptographic keys are generated to encrypt and authenticate data [1]. All currently used key-derivation algorithms are constructed from hash functions, against which no pre-image attacks are known, making it nearly impossible to directly attack the key-derivation function itself. Florêncio and Herley [2] found that users choose passwords with an average bit-strength 40.54 bits. This level of complexity is usually less than that required for cryptographic keys; therefore, brute-force attacks are more likely to succeed.
A graphics processing unit (GPU) is a device that is effective for the parallel processing of computations using many threads. With advancements in GPU device performance, the general purpose on GPU (GPGPU) technology, which performs general-purpose computing operations on GPU devices, has emerged. As a result, computed unified device architecture (CUDA) technology has been developed to design the computational capabilities of the GPU architecture using existing computer languages [3,4].
Scrypt is a password-based key-derivation function proposed by Percival (2009). Originally proposed for use in the online backup service Tarsnap. Scrypt is now mainly used in cryptocurrencies such as Bither, Bitcoin core, and Bitcoin knot files. To defend against brute-force attacks with a high success rate in key-derivation cryptographic algorithms, Scrypt employs a memory-hard structure that requires a large amount of memory for execution. This approach restricts the amount of parallel processing that an attacker can perform using their hardware or GPU architecture [1].
In recent years, the Bitcoin mining industry’s remarkable growth and lucrative economic incentives have led to significant advancements in application-specific integrated circuit (ASIC) hash units [6]. For adversaries equipped with ASICs, the possibility of brute forcing a password database appears feasible, and existing pricing functions do not provide sufficient deterrence while remaining manageable for honest central processing unit (CPU) users. Consequently, there is a growing interest in developing ASIC-resistant hash functions and proof-of-work schemes that do not grant ASICs a significant advantage [6]. Numerous studies on ASIC-resistant hash functions and proof-of work include [1,6-16]. ASICs possess two fundamental advantages over CPUs (or generalpurpose GPUs): a smaller area and better energy efficiency when normalized for speed. Achieving ASIC resistance essentially involves reducing ASICs' area and energy efficiency advantages. Prior research on ASIC resistance has primarily followed the memory-hard function approach initially proposed by Percival [1].
This paper presents a methodology for optimizing the ASIC-resistant Scrypt algorithm in a GPU environment to maximize parallel computation. Four different computation models are proposed for Scrypt calculations with p-values of 2 and 4, and the optimal combinations for each p-value are determined through experiments. The proposed optimized Scrypt software offers a remarkable 8284.4% improvement in processing throughput compared with the Scrypt implementation in OpenSSL running on a CPU. Additionally, by leveraging GPU thread parallelization techniques and utilizing internal shared memory, the GPU-optimized Scrypt implementation achieved a 204.84% increase in the processing throughput compared with a simple GPU-based Scrypt implementation. The proposed implementation is the first GPU implementation of Scrypt, demonstrating its ability to efficiently crack Scrypt.
GPUs were designed for computer design work. With improvements in hardware, a new technology called general purpose on GPU (GPGPU) emerged, enabling general-purpose computing tasks traditionally handled by CPU devices to be processed on GPUs. NVIDIA introduced CUDA technology, which allows GPU device operations to be processed using general-purpose computing languages like C/C++/Python [3]. By designing the calculation method of the GPU architecture through this technology, the performance of algorithms can be improved.
The GPU architecture is composed of numerous threads and organized in a hierarchical structure. These threads are designed to perform the same task and assigned to grids and blocks, with multiple threads forming a block and multiple blocks forming a grid. Each thread has a unique index for its grid, block, and thread. When a thread is called in the kernel function, 32/64 units of warp operations are executed [4]. The GPU architecture is composed of three primary memory areas: global, shared, and constant. Global memory has the largest storage capacity, but has a significant performance overhead in memory access and storage. Shared memory is a memory area accessible by threads within the same block and has less performance overhead in terms of memory access and storage. Constant memory has a relatively smaller storage capacity than global memory but has a lower performance overhead in memory access and storage, and is a read-only memory area.
Compared to other architectures, the GPU architecture has a larger performance overhead in terms of memory access and storage speed. Therefore, an efficient memory-access process is necessary.
Scrypt is a password-based key-derivation function proposed by Colin Percival and is primarily used in Litecoin. The input values for Scrypt are shown in Table 1. In Scrypt, parameter p represents the number of parallel scryptROMix functions being implemented. During the development, it appeared that setting r = 8 and p = 1 would yield good results. However, as memory latency and CPU parallel processing increase, it became more likely that the optimum values for both r and p would increase as well [1]. In addition, because the calculations in scryptROMix are independent, using larger values for p allows for increased computational cost without increasing memory usage [1]. Therefore, in this study, we measured the performance of Scrypt using values larger than those that seemed to yield good results in 2009, specifically p = 2 and p = 4. Scrypt consists of four main functions: PBKDF2-HMAC-SHA256, Salsa20/8, scrypt-BlockMix, and scryptROMix. The overall structure of the Scrypt algorithm is illustrated in Fig. 1.
Table 1 . Parameters of the Scrypt algorithm.
Parameters | Purpose |
---|---|
P | Passphrase, an n-octets string |
S | Salt, an n-octets string |
N | Memory cost parameter |
r | Block size parameter |
p | Parallelization parameter |
keyLen | Output length of the derived key |
Salsa20/8 is a hash function obtained by reducing Salsa20 to eight rounds and is not a cryptographic hash function [17]. By reducing the number of rounds, Salsa20/8 becomes less secure but achieves a higher performance. It uses a 64-byte input and produces a 64-byte output that is used internally in the scryptBlockMix algorithm. The scryptBlockMix function performs appropriate operations on each block using Salsa20/8 and XOR operations. After all operations are complete, the scryptBlockMix function updates 2r internal blocks and sorts them into even and odd orders, resulting in a uniform distribution of output values between 0 and 1, as shown in Fig. 2. The scryptROMix function is the core component of Scrypt, and is designed to create a memory-hard structure. It calculates many random values and stores them randomly in memory, as illustrated in Fig. 3.
Scrypt uses PBKDF2-HMAC-SHA256 at the beginning and end of its operation. The scryptROMix function is a key function of the Scrypt algorithm and is operated p times [5]. As shown in Table 2, the scryptROMix function constitutes more than 90% of the performance overhead among all functions in the Scrypt algorithm. Therefore, this study optimizes Scrypt by reducing the performance load of the scryptROMix function.
Table 2 . Cycle for each function (r=8, p=2, N=1024, derived key Len=64).
Algorithm | Clock Cycles | Ratio |
---|---|---|
PBKDF2 (block creation) | 166,833 | 1.03 |
scryptROMix | 16,141,658 | 98.58 |
PBKDF2 (key creation) | 65,009 | 0.39 |
Total | 16,373,500 | 100 |
Password-cracking techniques are essential in the field of information security because they highlight the importance of addressing vulnerabilities and protecting sensitive information [18]. Most operating systems and applications use key-derivation functions (KDFs) to convert plain-text passwords into hashed passwords, thus preventing attackers from obtaining them. Brute-force attacks, which test all combinations up to a certain length, are the only way to recover plain-text passwords from hashed passwords because they are one-way functions [19]. PBKDF2 is a widely used KDF algorithm approved by the National Institute of Standards and Technology (NIST), and there have been optimization studies on PBKDF2 [20]. Scrypt is another KDF that utilizes the PBKDF2 function. According to [1], Scrypt is 260 times more expensive to attack than PBKDF2. Moreover, when used for file encryption, Scrypt outperforms PBKDF2 by a factor of 20,000 in terms of computational cost. In this study, we utilize the optimization techniques presented in [20] for PBKDF2 and introduce a novel optimization method for Scrypt, which is believed to be more secure than PBKDF2. Taking advantage of the unique features of the GPU architecture, we conducted the first-ever optimization of Scrypt on a GPU platform.
The scryptROMix function is a computationally expensive operation in the Scrypt algorithm. Therefore, optimizing the scryptROMix function can lead to the overall optimization of Scrypt. Input p determines the number of times the scryptROMix function is used in Scrypt. Each scryptROMix function used is independent and does not affect each other. Therefore, in the GPU architecture, it is possible to configure the parallelization code such that each thread computes one scryptROMix function, and each block executes one Scrypt function. In other words, it is possible to configure the code such that the number of Scrypts executed simultaneously by blocks per grid and the size of threads per block are equal to the value of p. This means that the PBKDF2-HMAC-SHA256 function used at the beginning and end of Scrypt can be computed for both CPU and GPU architectures. Thus, we compared the following four methods.
The first method uses the PBKDF2 function at the beginning and end of Scrypt. The overall structure of the first method is illustrated in Fig. 4. Users who have existing Password and Salt values execute the PBKDF2 function on the CPU and obtain the initial input values for scryptROMix from the resulting output. The initial input block values are copied to the GPU using the CudaMemcpy() function. The cudaMemcpy() function is CUDA’s supported memory function for transferring data from the CPU to the GPU, and vice versa. Each thread then provides the corresponding initial block value from the cudaMemcpy() operation as input to the scryptROMix function using its unique index value threadIdx.x (tid) and obtains the resulting value B'. The cudaMemcpy() function is then used to copy the resulting block B' from the scryptROMix function back to the CPU. The copied result is then used as the input value for the final PBKDF2 function, which generates the derived key value dkout. This method minimizes computations on the GPU but involves copying a significant amount of memory.
The second method involves using the PBKDF2 function on the CPU in the initial step and on the GPU in the final step. The overall structure of the second method is illustrated in Fig. 5. Similar to the first method, the initial input values for the scryptROMix function are obtained using the PBKDF2 function executed on the CPU. This process continues until the output values of the scryptROMix function are obtained. However, in the second method, the final PBKDF2 function is executed on the GPU. The remaining steps are the same as those in the first method. Subsequently, the obtained output values of the scryptROMix function are used as input values for the PBKDF2 function at their respective positions to derive the key. In this process, it is important to note that each thread of every block in all optimization approaches performs individual scryptROMix computations rather than executing a single Scrypt function. Therefore, conditional statements must be employed to ensure that only one PBKDF2 operation occurs in each block. Any flow control instruction (if, switch, do, for, while) can significantly affect the effective instruction throughput by causing the threads of the same warp to diverge (i.e., follow different execution paths). If this occurs, different execution paths must be serialized, thereby increasing the total number of instructions executed for this warp. To obtain the best performance when the control flow depends on the thread ID, the control condition should be written to minimize the number of divergent warps [4]. After all scryptROMix operations are completed, the PBKDF2 operation is performed using the __syncthreads() function to ensure that all threads are synchronized before proceeding. The __syncthreads() function is employed to synchronize the behavior of all threads with that of the GPU. Although this function is crucial for ensuring the coordination among threads, it may introduce a slight performance penalty [4]. Finally, the derived key value is transferred from the GPU to the CPU via memory copying. This method involves less memory copying between the CPU and GPU architectures compared to the first method; however, it includes the use of the __syncthreads() function and conditional statements.
The third method uses the PBKDF2 function on the GPU for the initial operation and on the CPU for the final operation. The overall structure of the third method is illustrated in Fig. 6. The Password and Salt values are copied from the CPU to the GPU environment using the cudaMemcpy() function. The PBKDF2 operation is then performed to obtain the initial input block for Scrypt. Similarly, in the final PBKDF2 operation of the second method, a conditional statement is used to ensure that only one operation is performed per block. To prevent mixing of operation sequences caused by the concurrent nature of the GPU architecture, it is necessary to use the __syncthread() function. This ensures that all the threads within a block synchronize their execution before proceeding. The computed results are then assigned to the scryptROMix function as inputs according to the specific thread numbers, as shown in Fig 6. The blocks of B' obtained using the scryptROMix function are then sent back from the GPU to the CPU through memory copying. On the CPU side, the obtained blocks of B' are used as inputs for the last PBKDF2 function to derive the value of the derived key. This method, similar to the second method, involves a smaller amount of memory being copied between the CPU and GPU architectures compared to the first method. However, this requires the __syncthreads() function and conditional statements.
The fourth method involves performing all the PBKDF2 computations in the GPU environment. The overall structure of the fourth method is illustrated in Fig. 7. Similar to the third method, this includes copying the Password and Salt values to the GPU. The GPU executes the PBKDF2 function to obtain the initial input values for the scryptROMix function and the resulting values from the scryptROMix function, which are the same as in the third method. Subsequently, the resulting values from the scryptROMix function are used as inputs to the PBKDF2 function, corresponding to their respective positions, to derive the key, which is the same process as in the second method. To prevent bugs, the __syncthreads() function is used between each PBKDF2 computation and the scryptROMix function. Each PBKDF2 function is accompanied by a conditional statement. Finally, the derived key is copied from the GPU to the CPU to complete the process. This method requires the least amount of memory copying between the CPU and GPU. However, it includes two conditional statements and two __ syncthreads() functions.
Each method differs in the amount of memory copying, presence of conditional statements, and use of the __syncthreads() function. The impact of each factor varies depending on the environment. Therefore, to determine the best method, it is necessary to compare all four methods in the same environment. In this study, we aimed to obtain optimal results within a given parameter space by fixing the size of each parameter and varying the number of Scrypt algorithms operated simultaneously, which corresponds to the block size.
In the GPU architecture, there are three types of memory regions: global, shared, and constant. The global memory region provides the largest storage capacity, but has a higher performance overhead in terms of access and storage compared to other memory regions. To optimize the Scrypt algorithm with a memory-hard structure, it is important to minimize the usage of global memory and maximize the utilization of memory regions such as shared memory and constant memory, which have relatively higher performance and efficiency in access and storage.
Shared memory is a memory region that can be shared between threads within the same block. It has a smaller storage capacity than global memory but offers faster access speeds. To optimize memory usage, it is important to identify frequently accessed portions of the code and allocate them to shared memory. The GPU device used in this study was a GeForce RTX 3090, which has a maximum shared memory size of 48 KB per block. The names of the arrays used in the scryptROMix function and their corresponding memory sizes are listed in Table 3. In this case, the parts of the scryptROMix function that are heavily called by each thread and have a memory usage below 48KB are the X and T parts used when generating a new block using scrypt-BlockMix within the scryptROMix function. Thus, we utilized shared memory for the X and T parts. When using shared memory, it is important to minimize bank conflicts that cause threads to run serially and write code in a bankfree manner. In the GPU architecture, shared memory is divided into a certain number of memory banks per warp. According to NVIDIA’s official technical documentation, the RTX 3090 consists of 32 memory banks [4]. Each bank can be accessed simultaneously at the bank level. If different threads access the same bank in the shared memory, they access it sequentially, which diminishes the parallel nature of GPU architectures and reduces their performance. Therefore, to eliminate bank conflicts in shared memory, we pad arrays X and T in a manner similar to the example shown in Fig 8. The sizes of the arrays are specified in the 4th and 5th lines of Algorithm 1 to ensure that each thread accesses a padded version of X and T. This approach effectively mitigates bank conflicts in shared memory. Furthermore, as shown in line 6 of Algorithm 2, we use tid and Blen to perform indexing that aligns with the padding scheme shown in Fig 8. This ensures that the threads access the appropriate elements in the padded X- and T-arrays.
Table 3 . Memory for each array and number of accesses.
Array | Memory (bytes) | Number of array access |
---|---|---|
B | 128 * r * p | 256 * r * p |
V | 128 * N * r * p | ((66r+1) * N+64r+1) * p |
X | 128 * r * p | ((34r+1) * N+32 * r) * p |
T | 128 * r * p | (64r+1) * N * p |
Constant memory is a memory region that provides a smaller storage capacity than global memory but offers faster access speed. It is a read-only memory area and ideal for storing values that remain constant throughout the execution of a program. The size of the constant memory region available on RTX 3090 is 64 KB. The Password and Salt values used in Scrypt remain constant and do not change throughout the computation. If the number of concurrently executing Scrypt algorithms is 2048, the combined size of the Password and Salt that can be utilized in each individual Scrypt algorithm is limited to a maximum of 32 bytes. Therefore, storing the values of Password and Salt in the constant memory region can improve the performance by taking advantage of the high access speed offered by constant memory. To store the Password and Salt values in the read-only constant memory region, the cudaMemcpyToSymbol() function should be used on the CPU. This function allows the copying of data from the CPU to a constant memory area on the GPU and offers higher data copy speeds than the cudaMemcpy() function. By storing Password and Salt in the constant memory region, the performance of memory copying in the 2nd to 4th approaches, in which Password and Salt were previously copied from the CPU to the GPU, can be improved. Additionally, in the GPU environment of the 2nd to 4th approaches where PBKDF2 functions are used, it is possible to minimize access to the global memory region within the PBKDF2 functions.
The combined code for all the approaches is presented in Algorithms 1 and 2.
The implementation was performed using Visual Studio 2019, Release x64, CUDA runtime version 10.2. and RTX 3090. The performance was measured by adjusting the number of concurrently running Scrypt algorithms and the size of p. All input values except for the value of p were set to r = 8, N = 1 024, a nd d kLen = 64. I n the initial proposal for Scrypt, it was believed that setting p = 1 would yield the best results. However, as time passed and factors such as memory latency and CPU parallel processing increased, it became more apparent that yielding good values for both r and p would likely increase [1]. In other words, the understanding of the algorithm evolved over time, and it was realized that higher values for both r and p might be more suitable for achieving better performance and security in light of the changing computational landscape. Therefore, in this study, we used the p-values 2 and 4. The performance of each optimization method and the simple GPU porting approach when p is set to 2, is shown in Fig. 9. Among the different optimization methods, the fourth approach demonstrated the highest performance, with a 204.84% improvement compared with the simple GPU porting approach when p is set to 2. The performance of each optimization method and the GPU simple porting approach when p is set to 4 are shown in Fig 10. Furthermore, when compared to the performance on a CPU (Ryzen 9 5900X), we achieved remarkable improvements of 8284.4% for p = 2 and 8122.5% for p = 4. The CPU code used for the measurement is based on code from OpenSSL 1.1.1s [21]. Among these, the second approach demonstrated the highest performance, showing a 168.14% improvement compared with the GPU simple porting approach when p is set to 4. We also used the same parameters for the performance measurements as in [22], and the results are listed in Table 4. In [22], Le et al. compared the performance of porting Scrypt using RTX 3090 with their optimization scheme. Although GPU optimization is not the main focus of [22], we carried out a comparison to demonstrate the differences between our method and other GPU porting methods. The parameters used in [22] are passlen = 1024 bits, Saltlen = 1024 bits, N = 1024, r = 1, p = 1, and dkout = 32. Because the key to our optimization approach is when p is greater than 1, we observe a performance improvement of 18.45%, which is not a significant performance difference. Because [22] did not measure the performance for different values of p, an exact comparison is difficult; however, we expect the performance difference to become more pronounced as the value of p increases. Based on the results, it can be observed that when p is small, indicating a smaller thread size, the fourth method demonstrates the highest performance. Conversely, when p exceeds a certain threshold, the second approach shows the highest performance. From these findings, it can be inferred that as the thread size increases, the performance overhead from conditional statements and __syncthread() functions becomes more significant compared with the increased amount of memory copying. We also saw a performance of 2618.32 Scrypt/s for a p-value of 8 and 705.75 Scrypt/s for a p-value of 16.
In this study, we optimized the Scrypt algorithm in a GPU environment using various approaches. We used the characteristics of each GPU memory region and modified the processing method such that one block handles one Scrypt, with each thread performing the scryptROMix function, which is a core function of the Scrypt algorithm. As a result, when 2048 instances of Scrypt were executed simultaneously, we achieved performance improvements of 204.84% for p = 2 and 168.14% for p = 4 compared with simple GPU porting. Moreover, when compared to the CPU performance, we observed remarkable improvements of 8284.4% for p = 2 and 8122.5% for p = 4. This is the first optimization of Scrypt in a GPU environment. In future work, we plan to extend our optimization techniques to other password-based KDFs in a GPU environment.
This work was supported by the National Research Foundation of Republic of Korea (NRF) grant funded by MSIT (No. 2022R1C1C1013368, 100%).
Table 1 . Parameters of the Scrypt algorithm.
Parameters | Purpose |
---|---|
P | Passphrase, an n-octets string |
S | Salt, an n-octets string |
N | Memory cost parameter |
r | Block size parameter |
p | Parallelization parameter |
keyLen | Output length of the derived key |
Table 2 . Cycle for each function (r=8, p=2, N=1024, derived key Len=64).
Algorithm | Clock Cycles | Ratio |
---|---|---|
PBKDF2 (block creation) | 166,833 | 1.03 |
scryptROMix | 16,141,658 | 98.58 |
PBKDF2 (key creation) | 65,009 | 0.39 |
Total | 16,373,500 | 100 |
Table 3 . Memory for each array and number of accesses.
Array | Memory (bytes) | Number of array access |
---|---|---|
B | 128 * r * p | 256 * r * p |
V | 128 * N * r * p | ((66r+1) * N+64r+1) * p |
X | 128 * r * p | ((34r+1) * N+32 * r) * p |
T | 128 * r * p | (64r+1) * N * p |
Changsu Kim, Kyunghwan Kim, Hoekyung Jung
Journal of information and communication convergence engineering 2021; 19(4): 257-262 https://doi.org/10.6109/jicce.2021.19.4.257Kim, Bong-Gi;Song, Jin-Kook;Shin, Chang-Doon;
The Korea Institute of Information and Commucation Engineering 2008; 6(2): 222-227 https://doi.org/10.7853/.2008.6.2.222Seo, Hyo-Seok;Kwon, Oh-Young;
The Korea Institute of Information and Commucation Engineering 2010; 8(3): 323-327 https://doi.org/10.6109/jicce.2010.8.3.323