r – Test if characters are in a string

r – Test if characters are in a string

Use the grepl function

grepl( needle, haystack, fixed = TRUE)

like so:

grepl(value, chars, fixed = TRUE)
# TRUE

Use ?grepl to find out more.

Answer

Sigh, it took me 45 minutes to find the answer to this simple question. The answer is: grepl(needle, haystack, fixed=TRUE)

# Correct
> grepl(1+2, 1+2, fixed=TRUE)
[1] TRUE
> grepl(1+2, 123+456, fixed=TRUE)
[1] FALSE

# Incorrect
> grepl(1+2, 1+2)
[1] FALSE
> grepl(1+2, 123+456)
[1] TRUE

Interpretation

grep is named after the linux executable, which is itself an acronym of Global Regular Expression Print, it would read lines of input and then print them if they matched the arguments you gave. Global meant the match could occur anywhere on the input line, Ill explain Regular Expression below, but the idea is its a smarter way to match the string (R calls this character, eg class(abc)), and Print because its a command line program, emitting output means it prints to its output string.

Now, the grep program is basically a filter, from lines of input, to lines of output. And it seems that Rs grep function similarly will take an array of inputs. For reasons that are utterly unknown to me (I only started playing with R about an hour ago), it returns a vector of the indexes that match, rather than a list of matches.

But, back to your original question, what we really want is to know whether we found the needle in the haystack, a true/false value. They apparently decided to name this function grepl, as in grep but with a Logical return value (they call true and false logical values, eg class(TRUE)).

So, now we know where the name came from and what its supposed to do. Lets get back to Regular Expressions. The arguments, even though they are strings, they are used to build regular expressions (henceforth: regex). A regex is a way to match a string (if this definition irritates you, let it go). For example, the regex a matches the character a, the regex a* matches the character a 0 or more times, and the regex a+ would match the character a 1 or more times. Hence in the example above, the needle we are searching for 1+2, when treated as a regex, means one or more 1 followed by a 2… but ours is followed by a plus!

1+2

So, if you used the grepl without setting fixed, your needles would accidentally be haystacks, and that would accidentally work quite often, we can see it even works for the OPs example. But thats a latent bug! We need to tell it the input is a string, not a regex, which is apparently what fixed is for. Why fixed? No clue, bookmark this answer b/c youre probably going to have to look it up 5 more times before you get it memorized.

A few final thoughts

The better your code is, the less history you have to know to make sense of it. Every argument can have at least two interesting values (otherwise it wouldnt need to be an argument), the docs list 9 arguments here, which means theres at least 2^9=512 ways to invoke it, thats a lot of work to write, test, and remember… decouple such functions (split them up, remove dependencies on each other, string things are different than regex things are different than vector things). Some of the options are also mutually exclusive, dont give users incorrect ways to use the code, ie the problematic invocation should be structurally nonsensical (such as passing an option that doesnt exist), not logically nonsensical (where you have to emit a warning to explain it). Put metaphorically: replacing the front door in the side of the 10th floor with a wall is better than hanging a sign that warns against its use, but either is better than neither. In an interface, the function defines what the arguments should look like, not the caller (because the caller depends on the function, inferring everything that everyone might ever want to call it with makes the function depend on the callers, too, and this type of cyclical dependency will quickly clog a system up and never provide the benefits you expect). Be very wary of equivocating types, its a design flaw that things like TRUE and 0 and abc are all vectors.

r – Test if characters are in a string

You want grepl:

> chars <- test
> value <- es
> grepl(value, chars)
[1] TRUE
> chars <- test
> value <- et
> grepl(value, chars)
[1] FALSE

Leave a Reply

Your email address will not be published.