Prefix Operations and Tries | CS 61B Data Structures, Spring 2019

Data Structures Summary

The problem we are presented: Given a stream of data, retrieve(搜索) information of interest.

All of the data structures we have discussed so far have been to solve the search problem. How you might ask? Each of the data structures we’ve learned are used for storing information in schemes that make searching efficient in specific scenarios.

Remember that these are Abstract Data Types. This means that we define the behavior, not the implementation

the possible implementations in previous chapters for Abstract Data Types above

Chaining HT: Chaining Hash table

we have devised in earlier chapters can be used in implementing many different ADTs. You’ll also notice the red colored implementations (meaning poor performance), telling us that not all implementations are optimal for the behavior we are trying to achieve.

Abstraction

Abstraction often happens in layers. Abstract Data Types can often contain two abstract ideas boiling down to one implementation. Let’s consider some examples:

If we remembered the Priority Queue ADT, we were attempting to find an implementation that would be efficient for PQ operations. We decided that our Priority Queue would be implemented using a Heap Ordered Tree, but as we saw we had several approaches (1A, 1B, 1C, 2, 3) of representing a tree for heaps.
A similar idea is an External Chaining Hash Table. This data structure is implemented using an array of buckets, but these buckets can be done using either an ArrayList, Resizing Array, Linked List, or BST.

These two examples tell us that we can often think of an ADT by the use of another ADT. And that Abstract Data Types have layers of abstraction, each defining a behavior that is more specific than the idea that came before it.

Tries

These will serve as a new implementation (from what we have previously learned) of a Set and a Map that has some special functionality for certain types of data and information.

Inventing the Trie

Here are some key ideas that we will use:

Every node stores only one letter
Nodes can be shared by multiple keys

Therefore, we can insert “sam”, “sad”, “sap”, “same”, “a”, and “awls” into a tree structure that contains single character nodes. An important observation to make is that most of these words share the same prefixes, therefore we can utilize these similarly structured strings for our structure. In other words we don’t store the same prefixes (e.g. “sa-”) multiple times.
Take a look at the graphic below to see how a trie would look like:

Tries work by storing each ‘character’ of our keys as a node. Keys that share common prefixes share the same nodes. To check if the trie contains a key, walk down the tree from the root along the correct nodes.

Since we are going to share nodes, we must figure out some way to represent which strings belong in our set and which don’t. We will solve this problem through marking the color of the last character of each string to be blue. Observe our final strategy below.

Suppose we have inserted strings into our set and we end up with the trie above, we must figure out how searching will work in our current scheme. To search, we will traverse our trie and compare to each character of the string as we go down. Thus, there are only two cases when we wouldn’t be able to find a string; either the final node is white or we fall off the tree

contains(“sam”) : true, blue node
contains(“sa”) : false, white node
contains(“a”) : true, blue node
contains(“saq”) : false, fell off tree

Summary

A key takeaway is that we can often improve a general-purpose data structure when we add specificity to our problem, often by adding additional constraints. For example, we improved our implementation of HashMap when we restricted the keys to only be ASCII character, creating extremely efficient data structure.

There is a distinction between ADTs and specific implementations.As an example, Disjoint Sets is an ADT: any Disjoint Sets has the methods connect(x, y) and isConnected(x, y) . There are four different ways to implement those methods: Quick Find, Quick Union, Weighted QU, and WQUPC.
The Trie is a specific implementation for Sets and Maps that is specialized for strings.
- We give each node a single character and each node can be a part of several keys inside of the trie.
- Searching will only fail if we hit an unmarked node or we fall off the tree
- Short for Retrieval tree, almost everyone pronounces it as “try” but Edward Fredkin suggested it be pronounced as “tree”(命名规则)

Implementation

Approach 1

We’ll take a first approach with the idea that each node stores a letter, its children, and a color. Since we know each node key is a character, we can use our DataIndexedCharMap class we defined earlier to map to all of a nodes’ children. Remember that each node can have at most the number of possible characters as its number of children.

public class TrieSet {private static final int R = 128; // ASCIIprivate Node root;  // root of trieprivate static class Node {private char ch;  private boolean isKey;   private DataIndexedCharMap next;private Node(char c, boolean b, int R) {ch = c; isKey = b;next = new DataIndexedCharMap<Node>(R);}}
}public class DataIndexedCharMap<V> {private V[] items;public DataIndexedCharMap(int R) {items = (V[]) new Object[R];}public void put(char c, V val) {items[c] = val;}public V get(char c) {return items[c];}
}

Improvement

But we can make an important observation: each link corresponds to a character if and only if that character exists. Therefore, we can remove the Node’s character variable and instead base the value of the character from its position in the parent DataIndexedCharMap

public class TrieSet {private static final int R = 128; // ASCIIprivate Node root;  // root of trieprivate static class Node { private boolean isKey;   private DataIndexedCharMap next;private Node( boolean b, int R) {isKey = b;next = new DataIndexedCharMap<Node>(R);}}
}

Problem: Zooming in on a single node with one child we can observe that its next variable, the DataIndexedCharMap object, will have mostly null links if nodes in our tree have relatively few children. We will have 128 links with 127 equal to null and 1 being used. This means that we are wasting a lot of excess space! We will explore alternative representations further on.

Performance

Given a Trie with N keys the runtime for our Map/Set operations are as follows:

add : Θ(1)
contains : Θ(1)

Why is this the case? It doesn’t matter how many items we have in our trie, the runtime will always be independent of the number of keys. This is because we only traverse the length of one key in the worst case ever, which is never related to the number of keys in the trie.

as we mentioned above our current design is extremely wasteful since each node contains an array for every single character even if that character doesn’t exist.

Child Tracking

The problem we were running into was waste of space from our implementation of a DataIndexedCharMap object to track each node’s children. The problem with this approach was that we would have initialized many null spots that don’t contain any children.

Alternate Idea #1: Hash-Table based Trie. This won’t create an array of 128 spots, but instead initialize the default value and resize the array only when necessary with the load factor.
Alternate Idea #2: BST based Trie. Again this will only create children pointers when necessary, and we will store the children in the BST. Obviously, we still have the worry of the runtime for searching in this BST, but this is not a bad approach.
DataIndexedCharMap Space:
- 128 links per node
- Runtime: Θ(1)
BST
- Space: C links per node, where C is the number of children
- Runtime: O(logR), where R is the size of the alphabet
Hash Table
- Space: C links per node, where C is the number of children
- Runtime: O®, where R is the size of the alphabet

. There is a slight memory and efficiency trade off (with BST/Hash Tables vs. DataIndexedCharMap). The runtimes for Trie operations are still constant without any caveats. Tries will especially thrive with some special operations

Trie String Operations

We can see that Tries offer us constant time lookup and insertion, but do they actually perform better than BSTs or Hash Tables? Possibly not. For every string we have to traverse through every character, whereas in BSTs we have access to the entire string immediately. So what are Tries good for then?

Prefix Matching

The main appeal of tries is the ability to efficiently support specific string operations like prefix matching., if we wanted keyWithPrefix, we can traverse to the end of the prefix and return all remaining keys in the trie

Challenging Warmup Exercise: Collecting Trie Keys

collect():Create an empty list of results x.For character c in root.next.keys():Call colHelp(“c”, x, root.next.get(c)).Return x.colHelp(String s, List<String> x, Node n):If n.isKeyx.add(s).For character c in n.next.keys():Call colHelp(s + c, x, n.next.get(c))

Challenge: Give an algorithm for keysWithPrefix.

Autocomplete

When you type into any search browser, for example Google, there are always suggestions of what you are about to type. This is extremely helpful and convenient. Say we were searching “How are you doing”, if we just type in “how are” into google, we will see that it suggests this exact query.
One way to achieve this is using a Trie! We will build a map from strings to values.

Values will represent how important Google thinks that string is (Probably frequency)
Store billions of strings efficiently since they share nodes, less wasteful duplicates
When a user types a query, we can call the method keysWithPrefix(x) and return the 10 strings with the highest value

One major flaw with this system is if the user types in short length strings. You can imagine that the number of keys with the prefix of the input is in the millions when in reality we only want 10. A possible solution to this issue is to store the best value of a substring in each node. We can then consider children in the order of the best value.

Another optimization is to merge nodes that are redundant. This would give us a “radix trie”, which holds characters as well as strings in each node. We won’t discuss this in depth.

Summary

Knowing the types of data that you are storing can give you great power in creating efficient data structures. Specifically for implementing Maps and Sets, if we know that all keys will be Strings, we can use a Trie:

Tries theoretically have better performances for searching and insertion than hash tables or balanced search trees
There are more implementations for how to store the children of every node of the trie, specifically three. These three are all fine, but hash table is the most natural
- DataIndexedCharMap (Con: excessive use of space, Pro: speed efficient)
- Bushy BST (Con: slower child search, Pro: space efficient)
- Hash Table (Con: higher cost per link, Pro: space efficient)
Tries may not actually be faster in practice, but they support special string operations that other implementations don’t
- longestPrefixOf and keysWithPrefix are easily implemented since the trie is stored character by character
- keysWithPrefix allows for algorithms like autocomplete to exist, which can be optimized through use of a priority queue.=