To get experience using the Hash Table data structure.
For this program, you will write a program which does textual analysis on an input file to determine which words are most frequently used. Given an input file and a number $N$, your program will print the $N$ most commonly used words in the file, along with how many times they have been used.
The best way to do this is to use a hash table (with Java's HashMap class). Create a table that maps strings (the words) onto integers (how many times the word has appeared).
When you read each word of the input file, check if it is in the hash table already. If it is, look up its value, add one to it, and put it back in. If it's not, insert it with a value of one.
Then you'll need to loop through the table to find the word which has the highest count. Keep track of the highest count as you loop through. At the end of the loop, print this value and remove it from the table. Repeat this process $N$ times, to get the top $N$ words.
(You don't need to worry about counting different cases the same, or worrying about words with punctuation at the end.)
When you are first testing your program, you can use the very short test.txt file, which should output the following with a value of 2:
Please enter the file name: test.txt Please enter the number of words to view: 2 #1: this (4 uses) #2: is (3 uses)
Then you can test your program on shakespeare.txt which contains the complete works of William Shakespeare (from Project Gutenberg). The punctuation has been removed and all words were converted to lower-case. It should output the following with a value of 10:
Please enter the file name: shakespeare.txt Please enter the number of words to view: 10 #1: the (27377 uses) #2: and (26081 uses) #3: i (20716 uses) #4: to (19661 uses) #5: of (17473 uses) #6: a (14722 uses) #7: you (13630 uses) #8: my (12489 uses) #9: in (10996 uses) #10: that (10915 uses)
When you are done, please submit the Java code under the assignment in Canvas.
Copyright © 2024 Ian Finlayson | Licensed under a Creative Commons BY-NC-SA 4.0 License.