Introduction to Tries

HOLIDAYCOLLAB2017

BLOG 2

Hi Guys! After weeks of being unwell, I am finally back to doing what I love. Blogging here, of course! But, this time again I am not the author. This is a continuation of my holiday collab 2017 blogs. In the last blog, Roksi introduced us to Binary trees.

Author of this blog is : Mayak Patel  (pic below)

(Find him on Instagram : @you._know.who)

mayank.jpg

Without further ado, let’s dive into Mayank’s introduction to Tries write-up!


Introduction

Hello code-heads!Welcome back to the blog. As the title goes, this blog is about a data structure called “Trie”. A Trie data structure turns out to be handy in many situations, some of which will be discussed as you read further. There are many Algorithms as well as data structures for searching particular words in a given text. There are some of these which are included in the standard libraries, but there are many which are NOT. Trie is a good example of one that is NOT included in standard libraries of any language.

OKAY, BUT WHY DO WE NEED TRIES ?

Let’s dive right in to the terminologies of a Trie data structure. Suppose you have a Word and and a Dictionary. A dictionary can be thought of a collection of many words. Now, if we want to know, if a single word is present in the dictionary or not, Tries are helpful. The problem we are addressing here can be solved in many ways, like using HashSets, Hash Tables etc. Now the question that comes up is,

Why do we need to learn about and use a completely new data structure (Trie), when there exist multiple solutions that we already know of?

Here’s why :

  1. The time taken to insert a new word to the dictionary in O(length), where length refers to the length of the new word. This time is a lot better than that of a Set, and a a little bit better than that of a hash table.
  2. Sets and the hash tables can only find in a dictionary words that match exactly with the single word that we are finding; but tries allow us to find words that have a single character different, a prefix in common, a character missing, etc. Interesting right ?

BUT, ARE THEY EVEN USED IN REALITY ?

The answer is YES ! Below are some of the applications of the Trie data structure in the Industry:

  1. A web browser : When you start typing a url, say google.com, the browser can auto complete your text, or can show possibilities of text that you could be writing. This can be done using a Trie very fast !
  2. Word search based on a prefix : Suppose, we want to query our dictionary for all the words starting with the prefix “Ma”, it can be very easily done with the help of tries.
  3. Do you know how an orthographic corrector can check that every word that you type is in a dictionary? Again a trie.
  4. Suggested corrections of the words that are present in the text but not in the dictionary.

What is a TRIE ?

So, what exactly is a Trie?

Now, that we have had a good introduction about tries answering how they are used, how are they better than sets and hash tables in some scenarios etc, maybe you would want to get into the details of  what tries are, starting with why tries have this name. The word ‘Trie’ is an infix of the word “retrieval”, because Trie can find a single word in a dictionary with only a prefix of the word. The main idea behind the Trie data structure is the following :

  1. Trie is a tree where each vertex represents a single word or a prefix.
  2. The root represents an empty string (“”), the vertices that are direct sons of the root represent prefixes of length 1, the vertices that are 2 edges of distance from the root represent prefixes of length 2, the vertices that are 3 edges of distance from the root represent prefixes of length 3 and so on. In other words, a vertex that are k edges of distance of the root have an associated prefix of length k.
  3. The next figure shows a trie with the words “tree”, “trie”, “algo”, “all”, and “also.”
trie-01.jpg
Fig 1. TRIE representation example

Representing a TRIE

The representation of a Trie is very similar to that of a Tree upto an extent. Just like trees, tries are also collections of nodes.

So, what sets Tries apart from Trees?

The answer is the Structure of the nodes!

In a tree the nodes generally contain data and the links to the children nodes. But, each Trie node contains a list of characters, each of which can have multiple children. For implementing a Trie, we need to code a structure with some fields that indicate the values stored at each vertex. As we want to know the number of words that match with a given string, every vertex should have a field to indicate that this vertex represents a complete word or only a prefix (for simplicity, a complete word is considered also a prefix) and how many words in the dictionary are represented by that prefix (there can be repeated words in the dictionary). This task can be done with only one integer field word.

Because we want to know the number of words that have prefix like a given string, we need another integer field prefixes that indicates how many words have the prefix of the vertex. Also, each vertex must have references to all his possible sons (26 references). Knowing all these details, our structure should have the following members:

structure Trie
   integer word;
   integer prefixes;
   reference edges[26];

Basic operations on a Trie

Now that we know learned about how a Trie structure can be implemented, let us have a look at the operations we perform on a trie.

There are 3 of such basic functions:

  1. Adding a new word to the dictionary: This function will accept a string argument(word) and add that word to the dictionary.

    trie02.png
    Fig 2. addWord operation
  2. Counting the number of words starting with a given prefix: This function will accept a String prefix and return an Integer denoting the number of words in the dictionary that start with the given prefix.
  3. Counting the number of words exactly matching with a given string: This function will accept a String word and return an Integer denoting the number of words in the dictionary that match exactly with the word.

    trie03.png
    Fig 3. countWords and countPrefixes operations

COMPLEXITY ANALYSIS

Trie is an efficient information retrieval data structure. The complexity for searching can be brought down to optimal limit using tries. Using Trie, we can search a key in O(M) time, where M is the maximum key length.

But this optimal search complexity comes with a penalty. The penalty is on Trie storage requirements. The storage requirements are high because every node in a Trie consists of multiple branches (26), and so on.


Conclusion

Hope this blog has given you an idea about what a Trie is and why do we even need a Trie. By now, you must also have an idea about how a Trie is implemented and what are the various operations that can be performed. Listed next are the links to some of the practice problems that might help you master the application of tries and hence enable you to use them in solving real-world problems. Have fun! That’s all for now.


FURTHER REFERENCES

  1. Interactive application to learn about Trie
  2. Top coder notes on Trie
  3. Solve problems on Trie
  4. one more Trie problem

Let us know what you think. Was this helpful?

Don’t forget to comment below and encourage us, if it was helpful to you. 🙂


Roksi talks Binary Trees!

HOLIDAYCOLLAB2017

BLOG 1

This is the first blog in my #holidaycollab series.

Author of this blog is Roksi Freeman (pic below).

(Find her on Instagram: @byproxyroksi)

IMG_0061.JPG

It’s the season of trees! What better time to discuss the tree data structures than now, right? ;-). So, here’s Roksi bringing to us the geeky joy of this festive season with her blog write-up below on Binary Search Trees! Hope you guys enjoy reading as well as learning!


bst_pic.png
Pic1. xkcd comic about the essence of BST and holiday spirit

“Not only is that terrible in general, but you KNOW Billy’s going to open the root present first, and then everyone is going to have to wait while the heap is rebuilt.” credit: xkcd.com

Hello and happy holidays! What a great time to talk about the most festive data structure there is, trees! Today the festive trees of choice are binary search trees (BST), also called sorted binary search trees. Binary search trees are a data structure representing a boolean function, which is useful in computer science, or just for fun.


Background

Data trees originally came from the basic idea of Shannon expansion, in which a function is split into two sub-functions by assigning a variable. Binary search trees showed up in 1960, invented by computer scientists Windley, Booth, Colin, Douglass & Hibbard. At that time they were invented, binary search trees were used for sorting, searching and file maintenance. The invention of binary search trees was a collaborative effort over the course of around three years between 1959 and 1962, and now they’re an important data structure in computer science useful for search, insertion, and deletion in enormous sets of data. Other than the published papers, there isn’t that much out there about how binary search trees were invented, so I’ve cited some of the early work at the end of this article for those who want to get into the weeds with BSTs.


Binary Search Tree

A binary search tree is a rooted, sorted binary tree. General binary trees consist of a root node at the top, followed by “children” nodes in the following layers. Terminal nodes at the bottom layer are called leaves. With a binary search tree, the values of the right subtree of each node must be greater than the root, the left must be lesser than the root. While binary search trees can have at most two children, a node with no children can still be called a binary tree. The layers of a tree are zero indexed, starting with layer 0, followed by layer 1 and 2 and so on, making the maximum nodes you can have at the level i, 2^i. A binary search tree has multiple states; proper, balanced, complete and perfect. A totally useless binary search tree can have only one root node and be improper. A binary tree can also only have one child and be improper, but it must be left-leaning.

Proper Balanced Complete Perfect
Every node has 0 or 2 children Sub-trees of any node has a height difference no greater than 1 Every layer full, and as left as possible Every layer is full, including the leaves

Usage

Binary trees are useful if you want to store a modifiable (and potentially tremendous) collection of data in a computer memory, and be able to search it quickly as well as efficiently insert and remove elements. For comparison, consider arrays and linked lists. You may have been coding and noticed that in a for loop if you want to add data to an array, you either have to copy an array and rebuild it to add on, or you can create an empty array with x spaces to insert elements, instead of building one cell at a time. Efficiency-wise, it’s better to create a large empty array than to copy and rebuild the same one over and over again. The reason to use a binary search tree is efficiency, just like most searches are faster with a sorted array than a default unsorted array. The next logical step is to explain a binary search tree as a linked list, which is also a BST, but not a perfect one or even a really good one.

Analysis of various operations

Comparision with other data structures such as arrays and linked lists.

The table below lists the various complexities for different operations of data structures:

Array Linked list Sorted array BST
Searching O(n) O(n) O(log n) O(log n)
Insertion O(1) O(1) O(n) O(log n)
Deletion O(n) O(n) O(n) O(log n)

 

p1.png
Fig1. Linked list or unbalanced Binary search tree

While a linked list is technically a binary tree, it’s not efficient to search. To minimize the time to search a tree it should be dense, each node having two children, so you want it as close to perfect as possible because a perfect binary tree stores the nodes at a minimum height, which allows for minimized traversal time. Height is key here. Binary search trees are measured by height and density, so the above list has a height of 7. With a linked list it takes up to O(n) time to search, in this case 7, whereas if you had a denser tree, the height could be as low as 3 with for a tree with 7 nodes.

p2.png
Fig 2. Balanced Binary search tree

The dense tree has a search time from O(log n) (Fig 2.), since like a binary search each traversal discards half the search space, to at worst O(n)(Fig 1.), which is much preferable if you don’t want a search to take ages. A binary search tree should be balanced to limit the amount you have to traverse because you have to traverse the height of the tree. In intro to computer science when the professor talks about binary searches, what makes a binary search tree more efficient is that the left child is always lesser than the root and the right child is always greater, so when searching you know that you don’t have to go down the right subtree if your value is smaller than the root. So in a binary search tree, whatever selection you make at the time is guaranteed to be the efficient path to your search. Unlike a greedy algorithm that always chooses the best option at the time, but may not be the most efficient choice long term.

A binary search tree may not be able to clean your room, but when it comes to search algorithms they are pretty spiffy. Donald Knuth, a seminal voice in computer science and a wiz on binary trees, says in his 17th Annual Christmas Tree Lecture, that trees are the most interesting thing to come out of computer science in the last 25 years.

Binary trees of all varieties are also useful for doodling in math class as Vi Hart could tell you. Or, in the spirit of the holidays, say Santa wants to look up child #345,575,423 to pull their naughty/nice records. An array containing hundreds of millions of records might take hundreds of seconds to search, so instead Santa could implement a data structure like a binary search tree and he wouldn’t have to waste so much time going over his list (twice)


Caveats

Binary search trees are awesome and infinitely useful, but not flawless. It’s less efficient if you don’t have a perfect and balanced tree. The shape of the tree also depends on the order of insertion and can be degenerated.

In a binary search tree you can search quite easily, and although you can make deletions easily you have to rebuild the tree at the point after the deletion, so, in the xkcd comic above, opening the root present first is a deletion, so the whole tree has to be rebuilt. This wouldn’t be seen as a problem for many younger siblings the world over, but thankfully most kid siblings don’t know about data structures and binary search trees. There are self-balancing trees which help avoid this issue, but that’s more a question of implementation.


In Conclusion

Hopefully, the awesome people reading about data structures over the holidays have enjoyed making the festivities a little bit more technical. I for one, have had a major breakthrough with the comparison of data trees and holiday decorations, and I’m sure my relatives will also be rolling their eyes at me with whatever math inspired holiday themes to come up next year. Cheers!


Further References

Here are some references if you want to get deeper into the weeds with data structures and binary search trees:
[1] MIT OCW’s materials for Introduction to Algorithms
[2] Nick Parlante on Binary Search Trees
[3] Hibbard, Thomas N. “Some combinatorial properties of certain trees with applications to searching and sorting.” Journal of the ACM (JACM) 9.1 (1962): 13-28.
[4] Douglas, Alexander S. “Techniques for the recording of, and reference to data in a computer.” The Computer Journal 2.1 (1959): 1-9.
[5] Windley, Peter F. “Trees, forests and rearranging.” The Computer Journal 3.2 (1960): 84-88.


What did you think of this blog? Let us know at decodergirlblog@gmail.com

Happy Holidays from both of us!

Thank you Roksi! You are awesome! 🙂


 

#Holidaycollab


What is Holiday collab?

Holidays is the season of giving. Keeping this spirit in mind, I decided to take the social media game beyond just sharing and liking pictures!

So, this Holiday season of 2017, I have tried to collaborate with folks I connected with on Instagram – to gift you all some exciting geeky content here. The holiday blogs section shall encompass this and much more.

Join in to explore and read some amazing blogs written by me and various enthusiastic folks pertaining to data structures (isn’t it the perfect time to talk about trees? ;)), getting involved in STEM efforts, introduction to cloud computing and IoT and much more.


Big shout out to all the folks who spent a part of their holidays creating this for us. Thank you, guys! 🙂

Here is the holiday collab squad :


IMG_0061Roksi Freeman : Roksi has studied cognitive neuroscience at Hampshire, and is currently work at an open source education non-profit run out of MIT (OCW), where she is responsible for their neuroscience, linguistics and Bio-engineering departments. You can check out some other past and current projects, or her cv. She eventually plans to go to graduate school to study more neuroscience.


mayankMayak Patel: Currently a student in Computer Science. He is also a Machine learning enthusiast. He loves solving algorithmic problems and has developed a knack for the same. He looks forward to learning and growing in the field of machine learning and Artificial Intelligence.

 


 

Welcome 2018, the geeky way! 🙂

HAPPY NEW YEAR!

Cheers!

geekgirl and the #Holidaycollab squad! 🙂


 

Writing an effective Résumé

Hello World! Hope it’s going well for you guys. As the blog title reads, this write-up is about some simple ways to make an effective resume. Personally, I have been on both the sides- the applicant as well as the interviewer. So, I can say that I have a fair idea of what would make a good resume :). The aim of this blog is to share what I think may be helpful to you guys. So, without further ado let’s start decoding easy ways to make a better resume.


I. Use a good editing tool

Let’s start with the basics ! This is the first thing that made my resume look and feel 100 times better.  Earlier I would use Microsoft word for my editing. And, psst, I sucked at it! I do not know the shortcuts to getting the right boundaries, page offsets and also getting alignments I want. (Sure, judge me all you want! :D). Then I started using online editors (such as overleaf) that allow me to build resume using Latex.

What better way than to “code” your resume ;). This trick has been so useful. Sure, it needs a little more effort than word. But, the outcome is worth the effort. It is easy to use them, they allow for anytime-anywhere access, since they store your profile online.

There are many templates available. I would say – do check them out~ .I have added my template picks in case you guys decide to give it a try (at the end of this blog). Let’s now focus on what should be the content, in the following two tricks.


II. Focus on your strengths and interests

This tricks needs your answer to two important questions. Firstly, what you are confident about and secondly how is it relevant to what you want out of this job.

One way to finding answer for these is, to write down what is it that you enjoy and are confident about along with how this can be aligned with the job requirements (the job that you are applying for).

For example – it can be programming in Python or Java or may be you have a good understanding of machine learning algorithms or you love to design new algorithms. Whatever it is, just write down on a paper first.

Then, begin to build your resume around this topic of interest. What is your relevant experience or how have you prepared for a career in it (do you have hobby projects, coursework etc), try and list them. Keep the focus around this through your resume.

Why? Because, the content on your resume to some extent dictates what you may be questioned about. So, when you focus on your strengths while adding content, it automatically translates to confidence in handling the questions thrown at you, during the course of your interview.


III. Avoid this !

I understand the urge to stand out, with your resume. I also understand the pressure of a job search. But, I cannot stress here enough that you should put things in your resume that is truly your work. It’s not too difficult for a person in the industry to identify when you blatantly lie or go overboard on exaggerating. Do not write about things that you have not done or do not understand completely. And, when you do put up impressive numbers or things you have done, you should be able to explain as to how you achieved that. Else, this simply adds unnecessary pressure when you are at the interview, in case you are asked about it.

For example, if you want to write about how you improved the reliability of a software by 15%, then you should be ready to justify how you achieved that number. Just throwing some metrics in the resume is not going to take you a long way. So, focus on what you have done and be prepared to answer how you achieved those numbers as well.


Few more good points I can think of :

  • Put your educational qualification (if experienced, add the highest level degree)
  • Keep your sentences short and effective (long sentences make the reader lose interest)
  • Put other work experience or projects that are not directly relevant to the job you are seeking in a separate section (For example – “Additional Experience“)
  • If you are sharing your resume online, you could add clickable links to your website or project pages, conference journals etc, throughout your resume
  • You could also mention about interpersonal skills (good team player etc). But, also be prepared to prove it or provide examples if needed

And finally proof read, also get it reviewed before using it !

Good luck !


Templates

Have selected two templates that you could start with, as listed below :

Template 1  Recent College graduate resume

  • contains all the information needed – fits into one page and is also not cluttered
  • The sections seem sufficient and the content is crisp

Template 2 Experienced candidate resume

  • clearly lists all the prior experience in a concise format

When you open the template with an overleaf account, it opens up the display to edit the contents. Left view has the Latex code that can be edited and the right side shows the output resume document, as in the snapshot below.

overleaf_snapshot

And, that’s all I had for this tidbit. Hope these simple tricks helped you guys make a better resume. As always let me know what you feel. You can reach me at decodergirlblog@gmail.com.

Note : This is not sponsored. Overleaf was useful to me and hence I am sharing info about it.

 

Linked Lists

Data structures Series  – Blog 02


Linked Lists

Welcome back! Let’s continue with the data structures theme. Here’s the blog on the next data structure – that would solve the problem we had with arrays. Essentially, these two problems :

a. Arrays had fixed size , that needed to be specified when we define them. Resizing is an issue. So, developer might allocate a worst case array size, that may not be efficient if the number of elements to be held varies.

b. When we want to rearrange data in an array or insert data in between or front of an array, it  gets expensive.

Linked lists are the data structures which will help us in addressing these issues. Arrays and linked lists both essentially provide a way to store data for a client. Although linked lists to be discussed today, are good for cases arrays may not be great, they do come with their own limitations. We will cover it all in this blog, keep on reading.


Basic Stuff

If you have ever wondered why people often quiz you about linked lists? Wonder no more. It’s  because these are the most basic data structures that form the foundation for understanding more advanced data structures ( such as: binary trees, queues) etc.

Let’s suppose you need to write some basic C program to read all words (strings) contained in an input file. The first question that we would have is – How would you store the words read by the program? If we used a string array here, what would be the size to be allocated for the array? The number of words in the file is not known.

In this case, if we used a linked list, we can read words without wasting memory or be running out of memory (as would be the case, if we used arrays). This is because, as the name implies, linked lists are nothing but a list of elements linked together using a pointer parameter, in addition to the data that needs to be held – words in this case. So, the items to be inserted are just linked to the list, dynamically as and when required.

Now that we have a use case in mind, let’s learn about the internals of linked lists, operations supported and some overheads associated with their usage.


Internals

Linked list data structure are similar to arrays, but use an extra parameter – a pointer to link the items together. With this additional overhead of pointer per entry of data to be held, we can overcome the disadvantages of arrays and use memory efficiently. Linked lists use a different strategy for allocating memory for elements as compared to arrays. They allocate memory per element, when a new element needs to be inserted, not as a block as in case of arrays.

As shown below, a set of integer elements are organized in a linked list as shown on the right. We can note that there is an extra address field, that links the items together in the list. Values {2,3,5,7,8} are the data elements. We can clear;y see there is an overhead per element, to hold the pointer to next data element, in case of linked lists. But, even with this, we may be more memory efficient than arrays, when we perform operations such as always insert to the top of the list or storing elements in memory when the number of elements is not known apriori.

arr_ll
Fig 1. Arrays and linked lists elements in memory

We define a linked list element/node and a create a linked list, as below :

/* define node element for a Singly linked list*/
typedef struct Node{
int data;
Node* next;
}Node;

/* function to create and return linked list node */
Node* create_node(int data){
Node* n = (Node*) malloc(sizeof(Node));
n->data = data; // store the data
n->next = NULL; // initialize next address to NULL
return n;
}

/* Example code to create a linked list of nodes 
*  Creates a linked list with three data elements/nodes 
*  Linked list : n1->n2->n3->NULL */ 

Node* create_test_list(){

Node* n1 = create_node(10);
Node* n2 = create_node(20);
Node* n3 = create_node(30);

// update pointers
n1->next = n2; // n1 holds pointer to n2
n2->next = n3; //n2  holds pointer to n3

// n3 is the last node of this list and 
// next shall be stored as NULL to imply this.

}

Operations supported

We can implement various operations on linked lists such as insert(), delete(), lookup(), hasloop(), removeloop()  and so on. It would be a good idea to write code for these operations, to understand linked lists better. Check [1] and [2] under references below to get started.


Doubly Linked List

These are similar to singly linked lists discussed above. But, instead of holding pointer only to the next element, there is an additional pointer per node to point to the previous element as well. Why do we need them? Wait for next week’s blog 🙂 ll1.png

/* Singly Linked list node */
typedef struct Node{
int data;
Node* next;
} Node;

/* Doubly Linked list node */
typedef struct d_Node{
int data;
Node* next ; 
Node* previous; // extra pointer in doubly linked list
} d_Node;


Drawbacks

What do you think would be the drawback of a linked list, other than the storage of extra parameter – i.e., a pointer per element. Yes! It is the insert at the nth location and also lookup nth element operations. In arrays, we had an O(1) look up and also O(1) updates.

But, with linked lists, we have to iterate all the n-1 nodes, to find the place to insert nth node to the list. Also, if we want to look-up nth node, we need to traverse n-1 nodes, before we get the value of nth node- resulting in O(n) cost.

Visual is always better to understand. Please check the picture below.

ll1
Fig 2. Insert at nth position, operation on arrays and linked list

 


Further Reading

  1. Linked list problems (Stanford)
  2. Solve some problems related to linked lists

Next, we shall explore the Binary Search Tree data structure which is realized using a doubly linked list discussed here. As always, write to me at decodergirlblog@gmail.com if you have any queries or suggestions. I would love to hear from you. Until then, keep decoding and enjoying your coffee ! 😉

 

Did you know?

More coffee you donate, higher is your coder karma! Oh! there shall also be more blogs published here. Thank you for being awesome! 😀

$1.00