Longest Duplicate Substring

LeetCode Problem #1044 (Hard)

Problem Statement

Given a string S, consider all duplicated substrings: (contiguous) substrings of S that occur 2 or more times. (The occurrences may overlap.)

Return any duplicated substring that has the longest possible length. (If S does not have a duplicated substring, the answer is "".)

Note:

2 <= S.length <= 10^5

S consists of lowercase English letters.

Testcases

Example 1:

Input: "banana"
Output: "ana"

Example 2:

Input: "abcd"
Output: ""

Problem Explanation

In a particular string, there are several substrings which can be formed using the given arrangement of letters. The problem specifies that the substrings are contiguous. In this context, this means that the letters of each substring occur in a contiguous manner, rather than substrings themselves occurring in a contiguous manner. These substrings may overlap with other substrings and they may also occur multiple times at any place within the original string.

For instance, in the given example, the substring "ana" occurs twice within the original string, "banana". There is no other substring with a greater length which occurs more than two times within the string.

Length = 3

"banana"
 "ana"
   "ana"

In another example, we can see that the substring "xyz" occurs twice within the original string, "abcxyzdefxyz". The letters in the substring are contiguous but the substrings themselves are not required to be sequential in occurrence.

Length = 3

"abcxyzdefxyz"
   "xyz"
         "xyz"

We must find one such substring which is of the greatest possible length and also occurs a minimum of twice within the original string.

Prerequisite Knowledge

Substring

A substring is a contiguous sequence of characters that can be formed from a string, where the order of the characters matter. This is different from a subsequence, where the characters can be found anywhere in the string but must retain relative sequential order.

In order for a sequence of characters to be considered a substring of a string, all the letters of the substring must be found in the same arrangement in the original string.

Example

Let us consider the string "coding".

The sequence of elements in this string is as follows:

c o d i n g
1 2 3 4 5 6

The substrings of this string must have characters that are found in the same consecutive sequential arrangement as in the original string.

Correct Substring: "cod"

c o d
1 2 3 (consecutive sequential order)

Incorrect Substring: "cng"

c n g
1 5 6 (relative sequential order)

Although the characters in the incorrect substring maintain relative sequential order, they are not consecutive to one another as present in the original string.

To further understand the difference a substring and a subsequence, refer to the Prerequisite Knowledge section in Is Subsequence:

Is Subsequence

Rabin-Karp Algorithm

The Rabin-Karp Algorithm is a pattern-matching algorithm which uses the concept of hashing to efficiently filter the characters that do not match within the initial phase itself, instead of traversing through the entire array of characters, unlike naive string matching algorithms.

Naive String Matching

In the naive approach to string matching, during each iteration, the substring that is being searched slides over each individual character to check if the characters at the shifted positions match the characters in the substring.

String: "abcabdabe", Substring: "cab"

abcabdabe
cab      -> no match
 cab     -> no match
  cab    -> match

The time complexity of this approach is O(m*n) where m is the length of the original string and n is the length of the string.

Rabin-Karp String Matching

In the Rabin-Karp approach, we assign an arbitrary weight value to each character in the original string. We must consider three values here.

m -> The length of the original string
n -> The length of the substring
d -> Any suitable value, conventionally taken as the number of characters present in the input set (original string).

Using a hash formula, we can calculate the value of a particular substring. The hash value of this substring is compared with that of a substring of the same length from the original string. Only if the hash values match, the character-by-character comparison is performed. Otherwise, the substring slides over to the next subsequent window of the same length as the substring. This is also known as a rolling hash.

The time complexity of this algorithm is O(m+n), which is significantly better than the time complexity of the naive approach.

Sources

For more detailed explanations of how this algorithm works, refer to:

Algorithm

Rabin-Karp/Binary Search Approach

Longest Duplicate Substring

Pre-compute the required powers of 26 modulus a prime number for the purpose of calculating with single precision arithmetic.
During each iteration, check if the substring with length mid is found in the original string using the rabin_karp() helper function.
Perform binary search to check if there is a longer substring which is duplicated in the original string.

Rabin-Karp Helper Function

Compute the hash value for the first "length" characters.
Store the hash value in a vector.
Using a sliding window, matain the hash values of previous substrings to be able to reuse them in subsequent iterations.
Compare the hash value of the substring to match with the hash value of the window in the original string.
a. If the hash values match, do a character-by-character comparison to check if the substrings match.
b. If a match does not exist in that window, add the hash value to the map.
Return the longest substring that matches, or an empty string if no substring matches.

Time Complexity: O(n+m)

Code

Global Variables

int prime = 19260817;
string res;
vector<int> power;

Longest Duplicate Substring

string longestDupSubstring(string S) {
    string res = "";
    int len = S.length();
    power = vector<int>(len, 1);
	  // For the hash calculation, pre-compute pow(26, k) where 0 < k < S.length() modulus prime
    for (int i = 1 ; i < len; i++) {
        power[i] = (power[i - 1] * 26) % prime;
    }
    // Use binary search to find the largest possible length of a duplicate substring
    int left = 0, right = len;
    while (left <= right) {
        int mid = left + (right - left) / 2;
        string curr = rabin_karp(mid, S);
        if (curr.length() == 0) {
            right = mid - 1;
        } 
        else {
            if (curr.length() > res.length()) {
                res = curr;
            }
            else {
                left = mid + 1;
            }
        }
    }
    return res;
}

Rabin-Karp Helper Function

string rabin_karp(int reqLen, string &str) {
    // If the required length is 0, return the empty string
    if (reqLen == 0) {
        return "";
    }
    unordered_map<int, vector<int>> hash = unordered_map<int, vector<int>>();
    long long current = 0;
		// Compute the hash value of the first "length" characters
    for (int i = 0 ; i < reqLen; i++) {
        current = ((current * 26) % prime + (str[i] - 'a')) % prime;
    }
    // Store the result in a hashmap that maps from the hash value to starting index
    hash[current] = vector<int>(1, 0);
    for (int i = reqLen ; i < str.length(); i++) {
        // Use a sliding window to maintain the hash value of the current substring
        current = ((current - (long long) power[reqLen - 1] * (str[i - reqLen] - 'a')) % prime + prime) % prime;
        current = (current * 26 + (str[i] - 'a')) % prime;
        // If the hash value does not exist already, add it to the map
        if (hash.find(current) == hash.end()) {
            hash[current] = vector<int>(1, i - reqLen + 1);
        } 
        else {
		    // Compare each character to check for a match
		        for (auto itr : hash[current]) {
		            if (strcmp((str.substr(itr, reqLen)).data(), str.substr(i - reqLen + 1, reqLen).data()) == 0) {
                    return str.substr(itr, reqLen);
                }
            }
            // If no match is found, add the hash value to the map
            hash[current].push_back(i - reqLen + 1);
        }
    }
    return "";
}

PreviousH-Index II NextPermutation Sequence

Last updated 4 years ago

Was this helpful?