Data Structures Summary

The problem we are presented: Given a stream of data, retrieve(搜索) information of interest.

All of the data structures we have discussed so far have been to solve the search problem. How you might ask? Each of the data structures we’ve learned are used for storing information in schemes that make searching efficient in specific scenarios.

Remember that these are Abstract Data Types. This means that we define the behavior, not the implementation

the possible implementations in previous chapters for Abstract Data Types above

Chaining HT: Chaining Hash table

we have devised in earlier chapters can be used in implementing many different ADTs. You’ll also notice the red colored implementations (meaning poor performance), telling us that not all implementations are optimal for the behavior we are trying to achieve.

Abstraction

Abstraction often happens in layers. Abstract Data Types can often contain two abstract ideas boiling down to one implementation. Let’s consider some examples:

  • If we remembered the Priority Queue ADT, we were attempting to find an implementation that would be efficient for PQ operations. We decided that our Priority Queue would be implemented using a Heap Ordered Tree, but as we saw we had several approaches (1A, 1B, 1C, 2, 3) of representing a tree for heaps.
  • A similar idea is an External Chaining Hash Table. This data structure is implemented using an array of buckets, but these buckets can be done using either an ArrayList, Resizing Array, Linked List, or BST.

These two examples tell us that we can often think of an ADT by the use of another ADT. And that Abstract Data Types have layers of abstraction, each defining a behavior that is more specific than the idea that came before it.

Tries

These will serve as a new implementation (from what we have previously learned) of a Set and a Map that has some special functionality for certain types of data and information.

Inventing the Trie

Here are some key ideas that we will use:

  • Every node stores only one letter
  • Nodes can be shared by multiple keys

Therefore, we can insert “sam”, “sad”, “sap”, “same”, “a”, and “awls” into a tree structure that contains single character nodes. An important observation to make is that most of these words share the same prefixes, therefore we can utilize these similarly structured strings for our structure. In other words we don’t store the same prefixes (e.g. “sa-”) multiple times.
Take a look at the graphic below to see how a trie would look like:


Tries work by storing each ‘character’ of our keys as a node. Keys that share common prefixes share the same nodes. To check if the trie contains a key, walk down the tree from the root along the correct nodes.

Since we are going to share nodes, we must figure out some way to represent which strings belong in our set and which don’t. We will solve this problem through marking the color of the last character of each string to be blue. Observe our final strategy below.

Suppose we have inserted strings into our set and we end up with the trie above, we must figure out how searching will work in our current scheme. To search, we will traverse our trie and compare to each character of the string as we go down. Thus, there are only two cases when we wouldn’t be able to find a string; either the final node is white or we fall off the tree

  • contains(“sam”) : true, blue node
  • contains(“sa”) : false, white node
  • contains(“a”) : true, blue node
  • contains(“saq”) : false, fell off tree

Summary

A key takeaway is that we can often improve a general-purpose data structure when we add specificity to our problem, often by adding additional constraints. For example, we improved our implementation of HashMap when we restricted the keys to only be ASCII character, creating extremely efficient data structure.

  • There is a distinction between ADTs and specific implementations.As an example, Disjoint Sets is an ADT: any Disjoint Sets has the methods connect(x, y) and isConnected(x, y) . There are four different ways to implement those methods: Quick Find, Quick Union, Weighted QU, and WQUPC.
  • The Trie is a specific implementation for Sets and Maps that is specialized for strings.
    • We give each node a single character and each node can be a part of several keys inside of the trie.
    • Searching will only fail if we hit an unmarked node or we fall off the tree
    • Short for Retrieval tree, almost everyone pronounces it as “try” but Edward Fredkin suggested it be pronounced as “tree”(命名规则)

Implementation

Approach 1

We’ll take a first approach with the idea that each node stores a letter, its children, and a color. Since we know each node key is a character, we can use our DataIndexedCharMap class we defined earlier to map to all of a nodes’ children. Remember that each node can have at most the number of possible characters as its number of children.

public class TrieSet {private static final int R = 128; // ASCIIprivate Node root;  // root of trieprivate static class Node {private char ch;  private boolean isKey;   private DataIndexedCharMap next;private Node(char c, boolean b, int R) {ch = c; isKey = b;next = new DataIndexedCharMap<Node>(R);}}
}public class DataIndexedCharMap<V> {private V[] items;public DataIndexedCharMap(int R) {items = (V[]) new Object[R];}public void put(char c, V val) {items[c] = val;}public V get(char c) {return items[c];}
}

Improvement

But we can make an important observation: each link corresponds to a character if and only if that character exists. Therefore, we can remove the Node’s character variable and instead base the value of the character from its position in the parent DataIndexedCharMap

public class TrieSet {private static final int R = 128; // ASCIIprivate Node root;  // root of trieprivate static class Node { private boolean isKey;   private DataIndexedCharMap next;private Node( boolean b, int R) {isKey = b;next = new DataIndexedCharMap<Node>(R);}}
}

Problem: Zooming in on a single node with one child we can observe that its next variable, the DataIndexedCharMap object, will have mostly null links if nodes in our tree have relatively few children. We will have 128 links with 127 equal to null and 1 being used. This means that we are wasting a lot of excess space! We will explore alternative representations further on.

Performance

Given a Trie with N keys the runtime for our Map/Set operations are as follows:

  • add : Θ(1)
  • contains : Θ(1)

Why is this the case? It doesn’t matter how many items we have in our trie, the runtime will always be independent of the number of keys. This is because we only traverse the length of one key in the worst case ever, which is never related to the number of keys in the trie.

as we mentioned above our current design is extremely wasteful since each node contains an array for every single character even if that character doesn’t exist.

Child Tracking

The problem we were running into was waste of space from our implementation of a DataIndexedCharMap object to track each node’s children. The problem with this approach was that we would have initialized many null spots that don’t contain any children.

  • Alternate Idea #1: Hash-Table based Trie. This won’t create an array of 128 spots, but instead initialize the default value and resize the array only when necessary with the load factor.

  • Alternate Idea #2: BST based Trie. Again this will only create children pointers when necessary, and we will store the children in the BST. Obviously, we still have the worry of the runtime for searching in this BST, but this is not a bad approach.

  • DataIndexedCharMap Space:

    • 128 links per node
    • Runtime: Θ(1)
  • BST

    • Space: C links per node, where C is the number of children
    • Runtime: O(logR), where R is the size of the alphabet
  • Hash Table

    • Space: C links per node, where C is the number of children
    • Runtime: O®, where R is the size of the alphabet

. There is a slight memory and efficiency trade off (with BST/Hash Tables vs. DataIndexedCharMap). The runtimes for Trie operations are still constant without any caveats. Tries will especially thrive with some special operations

Trie String Operations

We can see that Tries offer us constant time lookup and insertion, but do they actually perform better than BSTs or Hash Tables? Possibly not. For every string we have to traverse through every character, whereas in BSTs we have access to the entire string immediately. So what are Tries good for then?

Prefix Matching

The main appeal of tries is the ability to efficiently support specific string operations like prefix matching., if we wanted keyWithPrefix, we can traverse to the end of the prefix and return all remaining keys in the trie

Challenging Warmup Exercise: Collecting Trie Keys

collect():Create an empty list of results x.For character c in root.next.keys():Call colHelp(“c”, x, root.next.get(c)).Return x.colHelp(String s, List<String> x, Node n):If n.isKeyx.add(s).For character c in n.next.keys():Call colHelp(s + c, x, n.next.get(c))

Challenge: Give an algorithm for keysWithPrefix.

Autocomplete

When you type into any search browser, for example Google, there are always suggestions of what you are about to type. This is extremely helpful and convenient. Say we were searching “How are you doing”, if we just type in “how are” into google, we will see that it suggests this exact query.
One way to achieve this is using a Trie! We will build a map from strings to values.

  • Values will represent how important Google thinks that string is (Probably frequency)
  • Store billions of strings efficiently since they share nodes, less wasteful duplicates
  • When a user types a query, we can call the method keysWithPrefix(x) and return the 10 strings with the highest value

One major flaw with this system is if the user types in short length strings. You can imagine that the number of keys with the prefix of the input is in the millions when in reality we only want 10. A possible solution to this issue is to store the best value of a substring in each node. We can then consider children in the order of the best value.

Another optimization is to merge nodes that are redundant. This would give us a “radix trie”, which holds characters as well as strings in each node. We won’t discuss this in depth.

Summary

Knowing the types of data that you are storing can give you great power in creating efficient data structures. Specifically for implementing Maps and Sets, if we know that all keys will be Strings, we can use a Trie:

  • Tries theoretically have better performances for searching and insertion than hash tables or balanced search trees
  • There are more implementations for how to store the children of every node of the trie, specifically three. These three are all fine, but hash table is the most natural
    • DataIndexedCharMap (Con: excessive use of space, Pro: speed efficient)
    • Bushy BST (Con: slower child search, Pro: space efficient)
    • Hash Table (Con: higher cost per link, Pro: space efficient)
  • Tries may not actually be faster in practice, but they support special string operations that other implementations don’t
    • longestPrefixOf and keysWithPrefix are easily implemented since the trie is stored character by character
    • keysWithPrefix allows for algorithms like autocomplete to exist, which can be optimized through use of a priority queue.=

Prefix Operations and Tries | CS 61B Data Structures, Spring 2019相关推荐

  1. Persistent Data Structures(可持久化的数据结构)

    Persistent Data Structures 可持久化的数据结构 Contents 内容 Introduction                          介绍 Persistent ...

  2. Algorithms and Data Structures I

    Vintage 80s office by Mohamed Chahin 更好的阅读体验请访问 Algorithms and Data Structures I Acquire fundamental ...

  3. pandas笔记(pandas Data Structures)

    pandas笔记(pandas Data Structures) 生信start_site已关注 32020.06.15 03:02:37字数 766阅读 509 pandas包含数据结构和数据操作工 ...

  4. python 科学计算设计_Python科学计算——Data Structures

    为什么选择Python作为科学计算语言? 有关于Matlab和Python哪个更适合作为科学计算语言的争论已久,之所以选择Python作为首选的科学计算语言,不仅仅是因为它免费,开源,有很多优秀的库和 ...

  5. Data Structures with C++ Using STL Chapter 3算法概述---笔记

    <Data Structures with C++ Using STL Chapter 3算法概述---笔记>,作者:茉莉花茶,原文链接:http://www.cnblogs.com/yc ...

  6. 20162314 《Program Design Data Structures》Learning Summary Of The First Week

    20162314 2017-2018-1 <Program Design & Data Structures>Learning Summary Of The First Week ...

  7. 【Python学习笔记】Coursera课程《Python Data Structures》 密歇根大学 Charles Severance——Week6 Tuple课堂笔记...

    Coursera课程<Python Data Structures> 密歇根大学 Charles Severance Week6 Tuple 10 Tuples 10.1 Tuples A ...

  8. 2014 UESTC Training for Data Structures D - 长使英雄泪满襟

    以下内容来自ShallWe's blog 题目 2014 UESTC Training for Data Structures D - 长使英雄泪满襟 看出司马懿在等蜀军粮草不济,孔明于是下令分兵屯田 ...

  9. Data Structures[翻译]

    Data Structures                                      [原文见:http://www.topcoder.com/tc?module=Static&a ...

最新文章

  1. activity 流程编辑器_如何读取APK的Activity(Python实现)
  2. python使用手册-python(自用手册)
  3. Spring AOP编程-传统aop开发总结
  4. java 链表删除头结点,删除链表的倒数第N个节点,并返回链表的头节点
  5. 如何使用HTML5,JavaScript和Bootstrap构建自定义文件上传器
  6. Eclipse 插件用法:Eclipse 利用 Amateras UML 生成 Java 类图、时序图和 UML 类图
  7. 马斯克称下一代超级工厂占地可能没必要更大 但可能更先进
  8. 引用和指针自增的不同
  9. Gson转Map时,Int会变成double解决方法
  10. Sql Server 性能优化之包含列
  11. Java多线程实现-线程池
  12. SSH 登录失败:Host key verification failed 的处理方法
  13. java实验原理_java实验报告实验原理.doc
  14. 科学研究设计二:定量分析和定性分析
  15. python多层bp网络_多层bp神经网络 python
  16. Python开源BI工具Superset的搭建与使用
  17. 机器人动力学与控制_机器人动力学模型角色
  18. google paly发布app设备兼容性的识别
  19. 微信浏览器中进行支付宝支付
  20. oracle ORA-12543

热门文章

  1. YAMAHA四轴机械手常用指令笔记-托盘运动指令
  2. 铁马冰河入梦来——从源文件到可执行文件(待后续)
  3. 使用百度大脑EasyDL创建吸烟监控模型
  4. 网络布线还可以这样,学习一下
  5. EC20连接阿里云操作流程,AT_MQTT协议连接,详细
  6. 出现“no CUDA-capable device is detected”报错解决
  7. 【记录】iOS上架审核被拒
  8. html5游戏引擎推荐
  9. 怎么用Python计算地球上任意两个用经纬度表示的点的弧面距离?
  10. 敲定了,冰河要搞大事情了!