Longest Duplicate Substring
LeetCode Problem #1044 (Hard)
Problem Statement
Given a string S
, consider all duplicated substrings: (contiguous) substrings of S that occur 2 or more times. (The occurrences may overlap.)
Return any duplicated substring that has the longest possible length. (If S
does not have a duplicated substring, the answer is ""
.)
Testcases
Problem Explanation
In a particular string, there are several substrings which can be formed using the given arrangement of letters. The problem specifies that the substrings are contiguous. In this context, this means that the letters of each substring occur in a contiguous manner, rather than substrings themselves occurring in a contiguous manner. These substrings may overlap with other substrings and they may also occur multiple times at any place within the original string.
For instance, in the given example, the substring "ana"
occurs twice within the original string, "banana"
. There is no other substring with a greater length which occurs more than two times within the string.
In another example, we can see that the substring "xyz"
occurs twice within the original string, "abcxyzdefxyz"
. The letters in the substring are contiguous but the substrings themselves are not required to be sequential in occurrence.
We must find one such substring which is of the greatest possible length and also occurs a minimum of twice within the original string.
Prerequisite Knowledge
Substring
A substring is a contiguous sequence of characters that can be formed from a string, where the order of the characters matter. This is different from a subsequence, where the characters can be found anywhere in the string but must retain relative sequential order.
In order for a sequence of characters to be considered a substring of a string, all the letters of the substring must be found in the same arrangement in the original string.
Example
Let us consider the string "coding"
.
The sequence of elements in this string is as follows:
The substrings of this string must have characters that are found in the same consecutive sequential arrangement as in the original string.
Correct Substring: "cod"
Incorrect Substring: "cng"
To further understand the difference a substring and a subsequence, refer to the Prerequisite Knowledge section in Is Subsequence:
Rabin-Karp Algorithm
The Rabin-Karp Algorithm is a pattern-matching algorithm which uses the concept of hashing to efficiently filter the characters that do not match within the initial phase itself, instead of traversing through the entire array of characters, unlike naive string matching algorithms.
Naive String Matching
In the naive approach to string matching, during each iteration, the substring that is being searched slides over each individual character to check if the characters at the shifted positions match the characters in the substring.
The time complexity of this approach is O(m*n)
where m
is the length of the original string and n
is the length of the string.
Rabin-Karp String Matching
In the Rabin-Karp approach, we assign an arbitrary weight value to each character in the original string. We must consider three values here.
m
-> The length of the original stringn
-> The length of the substringd
-> Any suitable value, conventionally taken as the number of characters present in the input set (original string).
Using a hash formula, we can calculate the value of a particular substring. The hash value of this substring is compared with that of a substring of the same length from the original string. Only if the hash values match, the character-by-character comparison is performed. Otherwise, the substring slides over to the next subsequent window of the same length as the substring. This is also known as a rolling hash.
The time complexity of this algorithm is O(m+n)
, which is significantly better than the time complexity of the naive approach.
Sources
For more detailed explanations of how this algorithm works, refer to:
Algorithm
Rabin-Karp/Binary Search Approach
Longest Duplicate Substring
Pre-compute the required powers of 26 modulus a prime number for the purpose of calculating with single precision arithmetic.
During each iteration, check if the substring with length mid is found in the original string using the rabin_karp() helper function.
Perform binary search to check if there is a longer substring which is duplicated in the original string.
Rabin-Karp Helper Function
Compute the hash value for the first "length" characters.
Store the hash value in a vector.
Using a sliding window, matain the hash values of previous substrings to be able to reuse them in subsequent iterations.
Compare the hash value of the substring to match with the hash value of the window in the original string.
a. If the hash values match, do a character-by-character comparison to check if the substrings match.
b. If a match does not exist in that window, add the hash value to the map.
Return the longest substring that matches, or an empty string if no substring matches.
Code
Last updated
Was this helpful?