Index Home About Blog
From: torek@elf.bsdi.com (Chris Torek)
Newsgroups: comp.lang.c
Subject: Re: strsep question
Date: 8 Apr 1999 14:43:24 -0700

In article <slrn7gpsgc.quh.pds@uberhacker.org> Paul Schmidt
<pds@uberhacker.org> writes:
>How do you tell how many string strsep was able to parse?

I cannot quite figure out what this question *means*.

First, though, a bit of background:  The strsep() function is
something I invented, rather reluctantly, during the conversion of
the 4.3BSD C library to ANSI C.  Various (mostly non-ANSI) bits of
this library have reason to read a line from a text file and break
it up into token.  ISO 9899 provides a strtok() function, taken
from System V, that does a horribly inadequate job of this.  For
instance, suppose you have a C string (in a modifiable array) of
the form:

	"  this    is a\tset of\t\twords\n"

The strtok() function is good for breaking this line into the word
sequence: <this>, <is>, <a>, <set>, <of>, <words>, using code of
the form:

	for (p = strtok(line, " \t\n"); p != NULL; p = strtok(NULL, " \t\n"))
		printf("<%s>\n", p);

But now suppose that the line consists of possibly-empty fields
delimited by vertical bars:

    # format: ID-tag|manufacturer|model|serial number|location
	B5343|Acme|road runner trap|51298LX9|back yard of building 7
	B9159|Ace|bandage||coyote's leg

Note that the second item, the Ace bandage the coyote wears, does not
*have* a serial number.  If these lines are parsed with:

	id = strtok(line, "|");
	mfg = strtok(NULL, "|");
	model = strtok(NULL, "|");
	serialno = strtok(NULL, "|");
	location = strtok(NULL, "|");
	junk = strtok(NULL, "|");
	if (id == NULL || mfg == NULL || model == NULL || serialno == NULL ||
	    location == NULL || junk != NULL) {
		fprintf(stderr, "badly formatted line (%d)", lineno);
		abort();
	}
	printf("The %s %s is at/in/on %s\n", manufacturer, model, location);

it will claim that the Ace bandage line is badly formatted.  Thus,
strtok() is useless for this particular application, where serial
numbers may be omitted.

But suppose we are only parsing whitespace-separated text.  This,
obviously, is the job strtok() is designed for.  So, suppose we
put this in the C library, inside some routine.  Then you, the
applications programmer, have something like this:

	for (p = strtok(line, " \t\n"); p != NULL; p = strtok(NULL, " \t\n")) {
		printf("<%s>\n", p);
		...
		library_function();
		...
	}

Now, it turns out that library_function() sometimes eventually
calls our C-library routine that in turn calls strtok().  What do
you think will happen to <this>, <is>, <a>, <set>, <of>, <words>?
Remember that to pluck out the full set of words, you must NOT pass
strtok() anything other than NULL as its first argument.  But alas!
you called the library_function(), which resulted in an internal
call to strtok() that passed some sort of "buf" or "line" or other
non-NULL parameter, and now strtok() has forgotten about *your*
line.  So now you get:  <this>, <is>, <a>; the last three words
mysteriously vanish.

In a word, the problem with strtok() is that it is not "re-entrant".
It stores some secret internal state.

Now suppose we went to fix the first problem -- that strtok() cannot
handle empty fields -- while ignoring the second, more serious
problem.  The obvious way to do this is to change the definition
from, in effect:

	char *strtok(char *buf, const char *delim) {
		char *p;
		static char *secret_state;

		p = buf ? buf : secret_state;
		if (p == NULL)
			return (NULL);
		/*
		 * Skip (span) initial delimiters.  Find end of non-
		 * delimiter "word" part, and if there is anything after
		 * that make sure the next strtok(NULL) will pick up there.
		 */
		p += strspn(p, delim);
		secret_state = p + strcspn(p, delim);
		if (*secret_state)
			*secret_state++ = 0;
		else				/* System V ... */
			secret_state = NULL;	/* omits this part */
		return (p);
	}

to, in effect:

	char *improved_strtok(char *buf, const char *delim) {
		char *p;
		static char *secret_state;

		p = buf ? buf : secret_state;
		if (p == NULL)
			return (NULL);
		/*
		 * Do *not* skip (span) initial delimiters.  Find end of
		 * non-delimiter "word" part, and if there is anything after
		 * that make sure the next strtok(NULL) will pick up there.
		 */
		secret_state = p + strcspn(p, delim);
		if (*secret_state)
			*secret_state++ = 0;
		else
			secret_state = NULL;
		return (p);
	}

This improved version does everything the old one did, because if
we *want* to ignore empty "words", we can just use:

	for (p = improved_strtok(line, " \t\n"); p != NULL;
	     p = improved_strtok(NULL, " \t\n"))
		if (*p)
			printf("<%s>\n", p);

That is, we can insert our own test for "an empty word" and ignore
it.  Using the unimproved strtok, we cannot insert a test for having
accidentally skipped over an empty word, because the only thing
that remembers *where* (e.g.) the "serial number" field is, is the
"secret_state" variable -- and this is known only to strtok().

Now the only remaining problem is reentrancy.  Again, the obvious
solution is not to keep *secret* state, but rather to require that
the caller provide the state-space variable.  So:

	for (p = third_strtok(line, &state, " \t\n"); p != NULL;
	     p = third_strtok(NULL, &state, " \t\n"))
		if (*p)
			printf("<%s>\n", p);

This third variant can be simplified by requiring that "state" be
set up for the first pass, and removing the line-or-NULL argument:

	state = line;
	for (p = fourth_strtok(" \t\n", &state); p != NULL;
	     p = fourth_strtok(" \t\n", &state))
		if (*p)
			printf("<%s>\n", p);

or more simply:

	state = line;
	while ((p = strsep(" \t\n", &state)) != NULL)
		if (*p)
			printf("<%s>\n", p);

and this is in fact how I came up with strsep().  (The name "strsep"
is a squatter: it occupies the ANSI/ISO namespace without a permit.
This, of course, is the way one gets functions into future versions
of standards.  Alas, this particular attempt failed, so this
"improved strtok" -- which is still not really *much* of an
improvement; it is just a patch for a broken design -- will not
appear in C9X.  I had hoped that people would see it as the obvious
improvement, add it, and redefine strtok() as just:

	char *strtok(char *buf, const char *delim) {
		static char *state;
		if (buf)
			state = buf;
		while ((p = strsep(delim, &state)) != NULL)
			if (*p)
				break;
		return (p);
	}

Instead, we have the ugly function strtok_r() in POSIX; it looks
much like third_strtok() above.  It still spans leading delimiters,
and the parameters are not quite in the above order.)
--
In-Real-Life: Chris Torek, Berkeley Software Design Inc
El Cerrito, CA	Domain:	torek@bsdi.com	+1 510 234 3167
http://claw.bsdi.com/torek/  (not always up)	I report spam to abuse@.

Index Home About Blog