Introduction to Tries

HOLIDAYCOLLAB2017

BLOG 2

Hi Guys! After weeks of being unwell, I am finally back to doing what I love. Blogging here, of course! But, this time again I am not the author. This is a continuation of my holiday collab 2017 blogs. In the last blog, Roksi introduced us to Binary trees.

Author of this blog is : Mayak Patel  (pic below)

(Find him on Instagram : @you._know.who)

mayank.jpg

Without further ado, let’s dive into Mayank’s introduction to Tries write-up!


Introduction

Hello code-heads!Welcome back to the blog. As the title goes, this blog is about a data structure called “Trie”. A Trie data structure turns out to be handy in many situations, some of which will be discussed as you read further. There are many Algorithms as well as data structures for searching particular words in a given text. There are some of these which are included in the standard libraries, but there are many which are NOT. Trie is a good example of one that is NOT included in standard libraries of any language.

OKAY, BUT WHY DO WE NEED TRIES ?

Let’s dive right in to the terminologies of a Trie data structure. Suppose you have a Word and and a Dictionary. A dictionary can be thought of a collection of many words. Now, if we want to know, if a single word is present in the dictionary or not, Tries are helpful. The problem we are addressing here can be solved in many ways, like using HashSets, Hash Tables etc. Now the question that comes up is,

Why do we need to learn about and use a completely new data structure (Trie), when there exist multiple solutions that we already know of?

Here’s why :

  1. The time taken to insert a new word to the dictionary in O(length), where length refers to the length of the new word. This time is a lot better than that of a Set, and a a little bit better than that of a hash table.
  2. Sets and the hash tables can only find in a dictionary words that match exactly with the single word that we are finding; but tries allow us to find words that have a single character different, a prefix in common, a character missing, etc. Interesting right ?

BUT, ARE THEY EVEN USED IN REALITY ?

The answer is YES ! Below are some of the applications of the Trie data structure in the Industry:

  1. A web browser : When you start typing a url, say google.com, the browser can auto complete your text, or can show possibilities of text that you could be writing. This can be done using a Trie very fast !
  2. Word search based on a prefix : Suppose, we want to query our dictionary for all the words starting with the prefix “Ma”, it can be very easily done with the help of tries.
  3. Do you know how an orthographic corrector can check that every word that you type is in a dictionary? Again a trie.
  4. Suggested corrections of the words that are present in the text but not in the dictionary.

What is a TRIE ?

So, what exactly is a Trie?

Now, that we have had a good introduction about tries answering how they are used, how are they better than sets and hash tables in some scenarios etc, maybe you would want to get into the details of  what tries are, starting with why tries have this name. The word ‘Trie’ is an infix of the word “retrieval”, because Trie can find a single word in a dictionary with only a prefix of the word. The main idea behind the Trie data structure is the following :

  1. Trie is a tree where each vertex represents a single word or a prefix.
  2. The root represents an empty string (“”), the vertices that are direct sons of the root represent prefixes of length 1, the vertices that are 2 edges of distance from the root represent prefixes of length 2, the vertices that are 3 edges of distance from the root represent prefixes of length 3 and so on. In other words, a vertex that are k edges of distance of the root have an associated prefix of length k.
  3. The next figure shows a trie with the words “tree”, “trie”, “algo”, “all”, and “also.”
trie-01.jpg
Fig 1. TRIE representation example

Representing a TRIE

The representation of a Trie is very similar to that of a Tree upto an extent. Just like trees, tries are also collections of nodes.

So, what sets Tries apart from Trees?

The answer is the Structure of the nodes!

In a tree the nodes generally contain data and the links to the children nodes. But, each Trie node contains a list of characters, each of which can have multiple children. For implementing a Trie, we need to code a structure with some fields that indicate the values stored at each vertex. As we want to know the number of words that match with a given string, every vertex should have a field to indicate that this vertex represents a complete word or only a prefix (for simplicity, a complete word is considered also a prefix) and how many words in the dictionary are represented by that prefix (there can be repeated words in the dictionary). This task can be done with only one integer field word.

Because we want to know the number of words that have prefix like a given string, we need another integer field prefixes that indicates how many words have the prefix of the vertex. Also, each vertex must have references to all his possible sons (26 references). Knowing all these details, our structure should have the following members:

structure Trie
   integer word;
   integer prefixes;
   reference edges[26];

Basic operations on a Trie

Now that we know learned about how a Trie structure can be implemented, let us have a look at the operations we perform on a trie.

There are 3 of such basic functions:

  1. Adding a new word to the dictionary: This function will accept a string argument(word) and add that word to the dictionary.

    trie02.png
    Fig 2. addWord operation
  2. Counting the number of words starting with a given prefix: This function will accept a String prefix and return an Integer denoting the number of words in the dictionary that start with the given prefix.
  3. Counting the number of words exactly matching with a given string: This function will accept a String word and return an Integer denoting the number of words in the dictionary that match exactly with the word.

    trie03.png
    Fig 3. countWords and countPrefixes operations

COMPLEXITY ANALYSIS

Trie is an efficient information retrieval data structure. The complexity for searching can be brought down to optimal limit using tries. Using Trie, we can search a key in O(M) time, where M is the maximum key length.

But this optimal search complexity comes with a penalty. The penalty is on Trie storage requirements. The storage requirements are high because every node in a Trie consists of multiple branches (26), and so on.


Conclusion

Hope this blog has given you an idea about what a Trie is and why do we even need a Trie. By now, you must also have an idea about how a Trie is implemented and what are the various operations that can be performed. Listed next are the links to some of the practice problems that might help you master the application of tries and hence enable you to use them in solving real-world problems. Have fun! That’s all for now.


FURTHER REFERENCES

  1. Interactive application to learn about Trie
  2. Top coder notes on Trie
  3. Solve problems on Trie
  4. one more Trie problem

Let us know what you think. Was this helpful?

Don’t forget to comment below and encourage us, if it was helpful to you. 🙂


Roksi talks Binary Trees!

HOLIDAYCOLLAB2017

BLOG 1

This is the first blog in my #holidaycollab series.

Author of this blog is Roksi Freeman (pic below).

(Find her on Instagram: @byproxyroksi)

IMG_0061.JPG

It’s the season of trees! What better time to discuss the tree data structures than now, right? ;-). So, here’s Roksi bringing to us the geeky joy of this festive season with her blog write-up below on Binary Search Trees! Hope you guys enjoy reading as well as learning!


bst_pic.png
Pic1. xkcd comic about the essence of BST and holiday spirit

“Not only is that terrible in general, but you KNOW Billy’s going to open the root present first, and then everyone is going to have to wait while the heap is rebuilt.” credit: xkcd.com

Hello and happy holidays! What a great time to talk about the most festive data structure there is, trees! Today the festive trees of choice are binary search trees (BST), also called sorted binary search trees. Binary search trees are a data structure representing a boolean function, which is useful in computer science, or just for fun.


Background

Data trees originally came from the basic idea of Shannon expansion, in which a function is split into two sub-functions by assigning a variable. Binary search trees showed up in 1960, invented by computer scientists Windley, Booth, Colin, Douglass & Hibbard. At that time they were invented, binary search trees were used for sorting, searching and file maintenance. The invention of binary search trees was a collaborative effort over the course of around three years between 1959 and 1962, and now they’re an important data structure in computer science useful for search, insertion, and deletion in enormous sets of data. Other than the published papers, there isn’t that much out there about how binary search trees were invented, so I’ve cited some of the early work at the end of this article for those who want to get into the weeds with BSTs.


Binary Search Tree

A binary search tree is a rooted, sorted binary tree. General binary trees consist of a root node at the top, followed by “children” nodes in the following layers. Terminal nodes at the bottom layer are called leaves. With a binary search tree, the values of the right subtree of each node must be greater than the root, the left must be lesser than the root. While binary search trees can have at most two children, a node with no children can still be called a binary tree. The layers of a tree are zero indexed, starting with layer 0, followed by layer 1 and 2 and so on, making the maximum nodes you can have at the level i, 2^i. A binary search tree has multiple states; proper, balanced, complete and perfect. A totally useless binary search tree can have only one root node and be improper. A binary tree can also only have one child and be improper, but it must be left-leaning.

Proper Balanced Complete Perfect
Every node has 0 or 2 children Sub-trees of any node has a height difference no greater than 1 Every layer full, and as left as possible Every layer is full, including the leaves

Usage

Binary trees are useful if you want to store a modifiable (and potentially tremendous) collection of data in a computer memory, and be able to search it quickly as well as efficiently insert and remove elements. For comparison, consider arrays and linked lists. You may have been coding and noticed that in a for loop if you want to add data to an array, you either have to copy an array and rebuild it to add on, or you can create an empty array with x spaces to insert elements, instead of building one cell at a time. Efficiency-wise, it’s better to create a large empty array than to copy and rebuild the same one over and over again. The reason to use a binary search tree is efficiency, just like most searches are faster with a sorted array than a default unsorted array. The next logical step is to explain a binary search tree as a linked list, which is also a BST, but not a perfect one or even a really good one.

Analysis of various operations

Comparision with other data structures such as arrays and linked lists.

The table below lists the various complexities for different operations of data structures:

Array Linked list Sorted array BST
Searching O(n) O(n) O(log n) O(log n)
Insertion O(1) O(1) O(n) O(log n)
Deletion O(n) O(n) O(n) O(log n)

 

p1.png
Fig1. Linked list or unbalanced Binary search tree

While a linked list is technically a binary tree, it’s not efficient to search. To minimize the time to search a tree it should be dense, each node having two children, so you want it as close to perfect as possible because a perfect binary tree stores the nodes at a minimum height, which allows for minimized traversal time. Height is key here. Binary search trees are measured by height and density, so the above list has a height of 7. With a linked list it takes up to O(n) time to search, in this case 7, whereas if you had a denser tree, the height could be as low as 3 with for a tree with 7 nodes.

p2.png
Fig 2. Balanced Binary search tree

The dense tree has a search time from O(log n) (Fig 2.), since like a binary search each traversal discards half the search space, to at worst O(n)(Fig 1.), which is much preferable if you don’t want a search to take ages. A binary search tree should be balanced to limit the amount you have to traverse because you have to traverse the height of the tree. In intro to computer science when the professor talks about binary searches, what makes a binary search tree more efficient is that the left child is always lesser than the root and the right child is always greater, so when searching you know that you don’t have to go down the right subtree if your value is smaller than the root. So in a binary search tree, whatever selection you make at the time is guaranteed to be the efficient path to your search. Unlike a greedy algorithm that always chooses the best option at the time, but may not be the most efficient choice long term.

A binary search tree may not be able to clean your room, but when it comes to search algorithms they are pretty spiffy. Donald Knuth, a seminal voice in computer science and a wiz on binary trees, says in his 17th Annual Christmas Tree Lecture, that trees are the most interesting thing to come out of computer science in the last 25 years.

Binary trees of all varieties are also useful for doodling in math class as Vi Hart could tell you. Or, in the spirit of the holidays, say Santa wants to look up child #345,575,423 to pull their naughty/nice records. An array containing hundreds of millions of records might take hundreds of seconds to search, so instead Santa could implement a data structure like a binary search tree and he wouldn’t have to waste so much time going over his list (twice)


Caveats

Binary search trees are awesome and infinitely useful, but not flawless. It’s less efficient if you don’t have a perfect and balanced tree. The shape of the tree also depends on the order of insertion and can be degenerated.

In a binary search tree you can search quite easily, and although you can make deletions easily you have to rebuild the tree at the point after the deletion, so, in the xkcd comic above, opening the root present first is a deletion, so the whole tree has to be rebuilt. This wouldn’t be seen as a problem for many younger siblings the world over, but thankfully most kid siblings don’t know about data structures and binary search trees. There are self-balancing trees which help avoid this issue, but that’s more a question of implementation.


In Conclusion

Hopefully, the awesome people reading about data structures over the holidays have enjoyed making the festivities a little bit more technical. I for one, have had a major breakthrough with the comparison of data trees and holiday decorations, and I’m sure my relatives will also be rolling their eyes at me with whatever math inspired holiday themes to come up next year. Cheers!


Further References

Here are some references if you want to get deeper into the weeds with data structures and binary search trees:
[1] MIT OCW’s materials for Introduction to Algorithms
[2] Nick Parlante on Binary Search Trees
[3] Hibbard, Thomas N. “Some combinatorial properties of certain trees with applications to searching and sorting.” Journal of the ACM (JACM) 9.1 (1962): 13-28.
[4] Douglas, Alexander S. “Techniques for the recording of, and reference to data in a computer.” The Computer Journal 2.1 (1959): 1-9.
[5] Windley, Peter F. “Trees, forests and rearranging.” The Computer Journal 3.2 (1960): 84-88.


What did you think of this blog? Let us know at decodergirlblog@gmail.com

Happy Holidays from both of us!

Thank you Roksi! You are awesome! 🙂


 

#Holidaycollab


What is Holiday collab?

Holidays is the season of giving. Keeping this spirit in mind, I decided to take the social media game beyond just sharing and liking pictures!

So, this Holiday season of 2017, I have tried to collaborate with folks I connected with on Instagram – to gift you all some exciting geeky content here. The holiday blogs section shall encompass this and much more.

Join in to explore and read some amazing blogs written by me and various enthusiastic folks pertaining to data structures (isn’t it the perfect time to talk about trees? ;)), getting involved in STEM efforts, introduction to cloud computing and IoT and much more.


Big shout out to all the folks who spent a part of their holidays creating this for us. Thank you, guys! 🙂

Here is the holiday collab squad :


IMG_0061Roksi Freeman : Roksi has studied cognitive neuroscience at Hampshire, and is currently work at an open source education non-profit run out of MIT (OCW), where she is responsible for their neuroscience, linguistics and Bio-engineering departments. You can check out some other past and current projects, or her cv. She eventually plans to go to graduate school to study more neuroscience.


mayankMayak Patel: Currently a student in Computer Science. He is also a Machine learning enthusiast. He loves solving algorithmic problems and has developed a knack for the same. He looks forward to learning and growing in the field of machine learning and Artificial Intelligence.

 


 

Welcome 2018, the geeky way! 🙂

HAPPY NEW YEAR!

Cheers!

geekgirl and the #Holidaycollab squad! 🙂