Home CPSC 425

String Matching

Overview

String matching is the problem of searching for a string inside a larger piece of text. Many programs require string matching including text editors and web browsers.

The C library provides the strstr function which performs string matching.

We will write our own solutions to the string matching problem sequentially, and also in parallel.


Program Base

The following base C program reads in a text file, then asks the user what string to search for:


#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>

int string_search(char* data, char* target) {
    /* TODO */


}

unsigned long ms_diff(struct timeval a, struct timeval b) {
    unsigned long msa = (a.tv_sec * 1000) + (a.tv_usec / 1000);
    unsigned long msb = (b.tv_sec * 1000) + (b.tv_usec / 1000);
    return msb - msa;
}

int main(int argc, char** argv) {
    FILE* file;
    unsigned int length;
    char* data;
    char target[256];
    int found;
    struct timeval a, b;

    /* get start time */
    gettimeofday(&a, NULL);

    /* open the file */
    if (argc lt; 2) {
        printf("Please pass an input file.\n");
        return 0;
    }
    file = fopen(argv[1], "r");
    if (!file) {
        printf("Could not open %s for reading.\n", argv[1]);
        return 0;
    }

    /* find the length of the file */
    fseek(file, 0L, SEEK_END);
    length = ftell(file);
    fseek(file, 0L, SEEK_SET);

    /* read the file into memory */
    data = malloc(length * sizeof(char) + 1);
    memset(data, 0, length);
    fread(data, sizeof(char), length, file);

    gettimeofday(&b, NULL);

    unsigned long read = ms_diff(a, b);
    printf("Reading took %lu ms.\n", read);

    /* ask what should be searched for */
    printf("Enter search term: ");
    scanf("%s", target);

    gettimeofday(&a, NULL);
    /* now do the searching */
    found = string_search(data, target);
    gettimeofday(&b, NULL);
    unsigned long search = ms_diff(a, b);
    printf("Searching took %lu ms.\n", search);


    if (found) {
        printf("Found it!\n");
    } else {
        printf("Not found...\n");
    }

    return 0;
}

We only need to complete the string_search function which should return 1 if the target is found, and 0 if not. .


Sequential Solution

How can we write a function which will solve this problem sequentially?


int string_search(char* data, char* target, int data_length) {
    int target_length = strlen(target);
    for (int i = 0; i < data_length; i++) {
        for (int c = 0; c < target_length; c++) {
            if (target[c] != data[i + c]) {
                break;
            }

            if (c == target_length - 1) {
                return 1;
            }
        }
    }   

    return 0;
}

Test Files

We can use the following files to test our program:


Applying Amdahl's Law

Amdahl's law tells us how much speedup we can expect, given how much of our program must be executed sequentially.

Note that reading a file the first time can be much slower than reading it again immediately afterwards. This is because the OS caches the data from the hard drive in memory making it much faster to access!


Parallelization

How can we parallelize this program?

Using MPI, we used multiple processes to solve the problem. First, process 0 distributed the search string with MPI_Bcase:


    /* ask what should be searched for */
    if (rank == 0) {
        printf("Enter search term: ");
        fflush(stdout);
        scanf("%s", target);
    }

    /* broadcast the data from process 0 to all others */
    MPI_Bcast(target, 256, MPI_CHAR, 0, MPI_COMM_WORLD);

Then we have each process split up the range of starting points, and perform the search as before, but on the smaller range:


int string_search(char* data, char* target, int data_length, int rank, int size) {
    /* find our starting and ending points */
    int start = rank * (data_length / size);
    int end = (rank + 1) * (data_length / size);
    int target_length = strlen(target);

    /* loop through each possible starting point */
    for (int i = start; i < end; i++) {
        /* loop through whole target string */
        for (int c = 0; c < target_length; c++) {
            /* if this character does not match, break */
            if (target[c] != data[i + c]) {
                break;
            }

            /* if we've gotten through all of target, it matches */
            if (c == target_length - 1) {
                return 1;
            }
        }
    }   

    /* if we get down here, we found no match */
    return 0;
}

Finally, all the processors reduce the results of the search down to one process. This is done with the MPI_LOR, logical or operation. Then process 0, which is the root of the reduction, prints the answer:


    /* reduce the results of all searches into process 0 */
    int result;
    MPI_Reduce(&found, &result, 1, MPI_INT, MPI_LOR, 0, MPI_COMM_WORLD);

    /* print final answer */
    if (rank == 0) {
        if (result) {
            printf("Found it!\n");
        } else {
            printf("Not found...\n");
        }
    }

Copyright © 2018 Ian Finlayson | Licensed under a Creative Commons Attribution 4.0 International License.