Monday, September 27, 2010

Stack Overflow: String Replace

I've recently begun exploring Stack Overflow.  Helping to answer people's questions has a considerable appeal to one, such as I, who enjoys teaching.  Sometimes you can find a question that no one has answered, as few know the answer.  Other times, almost any professional programmer can answer the question, and often these are from CS students trying to solve their homework.  On occasion, I may repost interesting questions here.

For my first repost, we have a StringReplace function that needs optimizing.  Independent of the questioner's implementation, let's consider the core algorithm.

StringReplace(OldText, NewText, BaseString)
    for SubString of length OldText in BaseString
        if (OldText == SubString) {Replace text}

This is how we want to replace.  Find each instance of OldText in BaseString and replace.  Given the nature of the question, our implementation will be written in C and not use any libraries (like regex, CRT, etc).

Working backwards, we'll need to return a final string of some indeterminate size.  Then the BaseString, as it is modified, is stored in this final string, in the following form, where each base is a subset of the string:

When the problem is viewed this way, an implementation is suggested, recursive.  The following code expands upon this approach (though avoiding some casting requirements, necessities of sizeof(char), and possibility of unicode support).

#define StringReplace(x, y, z) StringReplaceHelper(x, y, z, 0)
char* StringReplaceHelper(char* OldText, 
                          char* NewText, 
                          char* BaseString,
                          unsigned long space)
{
    char* start = BaseString, ret = NULL;
    unsigned long offset = 0;

    while (*BaseString != '\0')
    {
        if (Match(OldText, BaseString))
        {
             offset = (BaseString - start);
             // The next search will begin after 
             // the replacement text
             ret = StringReplaceHelper(OldText, NewText,
                                 BaseString + strlen(OldText), 
                                 space + offset + strlen(NewText));
             break;
        }

        BaseString++;
    }

    // If the end of string, then this is the last piece
    // Else copy in subset and replacement piece
    if (*BaseString == '\0')
    {
        offset = (BaseString - start);
        ret = (char*) malloc((space + offset));
        memcpy(ret + space, start, offset));
    }
    else
    {
        memcpy(ret + space, start, offset);

        // Don't actually do the following, the processor
        // will have to walk NewText twice
        memcpy(ret + offset, NewText, strlen(NewText));
    }

    return ret;
}

First, the routine Match() is basically a string compare, which can have its own interesting optimizations including partial skip-ahead.  Second, a final version wouldn't use strlen, except perhaps at the start.  The lengths can be passed between iterations.  Finally, using memcpy and malloc aren't cheating as I've written my own implementations in the past, and therefore their omission in the present is just for the sake of brevity.

But this is just what I'd currently write for a string replace.  You can follow the link to Stack Overflow and see the original request (and my suggestion for improvement).  Or perhaps consider your own.

No comments: