# Longest Duplicate Substring

## Problem Statement

Given a string `S`, consider all *duplicated substrings*: (contiguous) substrings of S that occur 2 or more times. (The occurrences may overlap.)

Return **any** duplicated substring that has the longest possible length.  (If `S` does not have a duplicated substring, the answer is `""`.)

{% hint style="info" %}
Note:

`2 <= S.length <= 10^5`

`S` consists of lowercase English letters.
{% endhint %}

## Testcases

```
Example 1:

Input: "banana"
Output: "ana"

Example 2:

Input: "abcd"
Output: ""
```

## Problem Explanation

In a particular string, there are several substrings which can be formed using the given arrangement of letters. The problem specifies that the substrings are *contiguous*. In this context, this means that the *letters of each substring occur in a contiguous manner*, rather than substrings themselves occurring in a contiguous manner. These substrings may overlap with other substrings and they may also occur multiple times at any place within the original string.&#x20;

For instance, in the given example, the substring `"ana"` occurs twice within the original string, `"banana"`. There is no other substring with a greater length which occurs more than two times within the string.&#x20;

{% code title="Length = 3" %}

```
"banana"
 "ana"
   "ana"
```

{% endcode %}

In another example, we can see that the substring `"xyz"` occurs twice within the original string, `"abcxyzdefxyz"`. The letters in the substring are contiguous but the substrings themselves are not required to be sequential in occurrence.&#x20;

{% code title="Length = 3" %}

```
"abcxyzdefxyz"
   "xyz"
         "xyz"
```

{% endcode %}

We must find one such substring which is of the *greatest possible length* and also occurs a *minimum of twice* within the original string.

## Prerequisite Knowledge

### Substring

A substring is a contiguous sequence of characters that can be formed from a string, where the order of the characters matter. This is different from a subsequence, where the characters can be found anywhere in the string but must retain relative sequential order.&#x20;

In order for a sequence of characters to be considered a substring of a string, all the letters of the substring must be found in the same arrangement in the original string.

#### Example

Let us consider the string `"coding"`.&#x20;

The sequence of elements in this string is as follows:

```
c o d i n g
1 2 3 4 5 6
```

The substrings of this string must have characters that are found in the same consecutive sequential arrangement as in the original string.

**Correct Substring**: `"cod"`

```
c o d
1 2 3 (consecutive sequential order)
```

**Incorrect Substring**: `"cng"`

```
c n g
1 5 6 (relative sequential order)
```

{% hint style="info" %}
Although the characters in the incorrect substring maintain relative sequential order, they are not consecutive to one another as present in the original string.
{% endhint %}

To further understand the difference a substring and a subsequence, refer to the *Prerequisite Knowledge* section in **Is Subsequence**:

{% content-ref url="/pages/-M9SAvK0xLy5SmX2AytF" %}
[Is Subsequence](/june-leetcoding-challenge/week-2/is-subsequence.md)
{% endcontent-ref %}

### Rabin-Karp Algorithm

The Rabin-Karp Algorithm is a pattern-matching algorithm which uses the concept of hashing to efficiently filter the characters that do not match within the initial phase itself, instead of traversing through the entire array of characters, unlike naive string matching algorithms.

#### Naive String Matching

In the naive approach to string matching, during each iteration, the substring that is being searched slides over each individual character to check if the characters at the shifted positions match the characters in the substring.

{% code title="String: "abcabdabe", Substring: "cab"" %}

```
abcabdabe
cab      -> no match
 cab     -> no match
  cab    -> match
```

{% endcode %}

The time complexity of this approach is `O(m*n)` where `m` is the length of the original string and `n` is the length of the string.&#x20;

#### Rabin-Karp String Matching

In the Rabin-Karp approach, we assign an arbitrary weight value to each character in the original string. We must consider three values here.&#x20;

* `m` -> The length of the original string
* `n` -> The length of the substring
* `d` -> Any suitable value, conventionally taken as the number of characters present in the input set (original string).

Using a hash formula, we can calculate the value of a particular substring. The hash value of this substring is compared with that of a substring of the same length from the original string. Only if the hash values match, the character-by-character comparison is performed. Otherwise, the substring slides over to the next subsequent window of the same length as the substring. This is also known as a *rolling hash*.

![Character Weights](/files/-MAbML2wk2fg2bbQYItb)

![Substring](/files/-MAbMSahQW6Wu2LvUBQa)

![Original String Sliding Window Comparison](/files/-MAbM_7uENGnyzouhdLr)

The time complexity of this algorithm is `O(m+n)`, which is significantly better than the time complexity of the naive approach.

#### Sources

For more detailed explanations of how this algorithm works, refer to:

* <https://www.programiz.com/dsa/rabin-karp-algorithm> \[Source of Images]
* <https://www.geeksforgeeks.org/rabin-karp-algorithm-for-pattern-searching/>
* <https://www.youtube.com/watch?v=BQ9E-2umSWc>

## Algorithm

### Rabin-Karp/Binary Search Approach

**Longest Duplicate Substring**

1. Pre-compute the required powers of 26 modulus a prime number for the purpose of calculating with single precision arithmetic.
2. During each iteration, check if the substring with length mid is found in the original string using the rabin\_karp() helper function.
3. Perform binary search to check if there is a longer substring which is duplicated in the original string.

**Rabin-Karp Helper Function**

1. Compute the hash value for the first "length" characters.
2. Store the hash value in a vector.
3. Using a sliding window, matain the hash values of previous substrings to be able to reuse them in subsequent iterations.
4. Compare the hash value of the substring to match with the hash value of the window in the original string.

   &#x20;a. If the hash values match, do a character-by-character comparison to check if the substrings match.

   &#x20;b. If a match does not exist in that window, add the hash value to the map.
5. Return the longest substring that matches, or an empty string if no substring matches.

{% hint style="info" %}
Time Complexity: O(n+m)
{% endhint %}

## Code

{% code title="Global Variables" %}

```cpp
int prime = 19260817;
string res;
vector<int> power;
```

{% endcode %}

{% code title="Longest Duplicate Substring" %}

```cpp
string longestDupSubstring(string S) {
    string res = "";
    int len = S.length();
    power = vector<int>(len, 1);
	  // For the hash calculation, pre-compute pow(26, k) where 0 < k < S.length() modulus prime
    for (int i = 1 ; i < len; i++) {
        power[i] = (power[i - 1] * 26) % prime;
    }
    // Use binary search to find the largest possible length of a duplicate substring
    int left = 0, right = len;
    while (left <= right) {
        int mid = left + (right - left) / 2;
        string curr = rabin_karp(mid, S);
        if (curr.length() == 0) {
            right = mid - 1;
        } 
        else {
            if (curr.length() > res.length()) {
                res = curr;
            }
            else {
                left = mid + 1;
            }
        }
    }
    return res;
}
```

{% endcode %}

{% code title="Rabin-Karp Helper Function" %}

```cpp
string rabin_karp(int reqLen, string &str) {
    // If the required length is 0, return the empty string
    if (reqLen == 0) {
        return "";
    }
    unordered_map<int, vector<int>> hash = unordered_map<int, vector<int>>();
    long long current = 0;
		// Compute the hash value of the first "length" characters
    for (int i = 0 ; i < reqLen; i++) {
        current = ((current * 26) % prime + (str[i] - 'a')) % prime;
    }
    // Store the result in a hashmap that maps from the hash value to starting index
    hash[current] = vector<int>(1, 0);
    for (int i = reqLen ; i < str.length(); i++) {
        // Use a sliding window to maintain the hash value of the current substring
        current = ((current - (long long) power[reqLen - 1] * (str[i - reqLen] - 'a')) % prime + prime) % prime;
        current = (current * 26 + (str[i] - 'a')) % prime;
        // If the hash value does not exist already, add it to the map
        if (hash.find(current) == hash.end()) {
            hash[current] = vector<int>(1, i - reqLen + 1);
        } 
        else {
		    // Compare each character to check for a match
		        for (auto itr : hash[current]) {
		            if (strcmp((str.substr(itr, reqLen)).data(), str.substr(i - reqLen + 1, reqLen).data()) == 0) {
                    return str.substr(itr, reqLen);
                }
            }
            // If no match is found, add the hash value to the map
            hash[current].push_back(i - reqLen + 1);
        }
    }
    return "";
}
```

{% endcode %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://keerthisreeraghu.gitbook.io/june-leetcoding-challenge/week-3/longest-duplicate-substring.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
