Home CPSC 240

Lab 14: Hash Tables

 

Objective

To get experience using the Hash Table data structure.


 

Task

For this program, you will write a program which does textual analysis on an input file to determine which words are most frequently used. Given an input file and a number $N$, your program will print the $N$ most commonly used words in the file, along with how many times they have been used.

The best way to do this is to use a Hash Table (with Java's HashMap class). Create a table that maps strings (the words) onto integers (how many times the word has appeared).

When you read each word of the input file, check if it is in the file already. If it is, look up its value, add one to it, and put it back in. If it's not, insert it with a value of one.

Then you'll need to loop through the table to find the word which has the highest count. Keep track of the highest count as you loop through. At the end of the loop, print this value and remove it from the table. Repeat this process $N$ times, to get the top $N$ words.

(You don't need to worry about counting different cases the same, or worrying about words with punctuation at the end.)


 

Example Run

When you are first testing your program, you can use the very short test.txt file, which should output the following with a value of 2:

Please enter the file name: test.txt
Please enter the number of words to view: 2
#1: this (4 uses)
#2: is (3 uses)

Then you can test your program on shakespeare.txt which contains the complete works of William Shakespeare (from Project Gutenberg). It should output the following with a value of 10:

Please enter the file name: ../shakespeare.txt
Please enter the number of words to view: 10
#1: the (23197 uses)
#2: I (19540 uses)
#3: and (18263 uses)
#4: to (15592 uses)
#5: of (15507 uses)
#6: a (12516 uses)
#7: my (10824 uses)
#8: in (9565 uses)
#9: you (9059 uses)
#10: is (7831 uses)

 

Submitting

When you are done, please submit the Java code under the assignment in Canvas.

Copyright © 2022 Ian Finlayson | Licensed under a Attribution-NonCommercial 4.0 International License.