Is there a faster way to recode character data when it’s actually a factor?

I often deal with character data that need some recoding. One common scenario is that a character vector that's being recorded is a factor in essence, but not necessarily in class. Consider for example, a `chr` vector such as the following `vec`:

``````set.seed(2021)
vec <- sample(rep(c("animal_dog_xyz", "animal_cat_abc", "animal_alligator_tyl"), 10))
vec
#>  [1] "animal_dog_xyz"       "animal_alligator_tyl" "animal_cat_abc"
#>  [4] "animal_cat_abc"       "animal_alligator_tyl" "animal_alligator_tyl"
#>  [7] "animal_cat_abc"       "animal_cat_abc"       "animal_cat_abc"
#> [10] "animal_dog_xyz"       "animal_dog_xyz"       "animal_cat_abc"
#> [13] "animal_alligator_tyl" "animal_alligator_tyl" "animal_alligator_tyl"
#> [16] "animal_cat_abc"       "animal_dog_xyz"       "animal_alligator_tyl"
#> [19] "animal_alligator_tyl" "animal_cat_abc"       "animal_dog_xyz"
#> [22] "animal_cat_abc"       "animal_cat_abc"       "animal_dog_xyz"
#> [25] "animal_dog_xyz"       "animal_dog_xyz"       "animal_dog_xyz"
#> [28] "animal_dog_xyz"       "animal_alligator_tyl" "animal_alligator_tyl"
``````

Created on 2021-07-19 by the reprex package (v2.0.0)

If I want to recode this vector and extract only the animal name, I would go with a solution that is suited for character data:

``````library(stringr)

sapply(str_split(vec, "_",  n = 3), `[`, 2)
#>  [1] "dog"       "alligator" "cat"       "cat"       "alligator" "alligator"
#>  [7] "cat"       "cat"       "cat"       "dog"       "dog"       "cat"
#> [13] "alligator" "alligator" "alligator" "cat"       "dog"       "alligator"
#> [19] "alligator" "cat"       "dog"       "cat"       "cat"       "dog"
#> [25] "dog"       "dog"       "dog"       "dog"       "alligator" "alligator"
``````

The Problem

If the vector is very long, such recoding procedure takes a long time. R would iterate over each vector element and apply the procedure. This seems inefficient given that there are only 3 unique values in the vector. In other words, we don't need to go one-by-one over the elements and figure out what should be the recoded value.

Here, `vec_long` is of length 30000. This is how much it would take to recode it on my machine.

``````vec_long <- sample(rep(c("animal_dog_xyz", "animal_cat_abc", "animal_alligator_tyl"), 10000))
length(vec_long)
#> [1] 30000

library(microbenchmark)

microbenchmark(sapply(str_split(vec_long, "_",  n = 3), `[`, 2))
#> Unit: milliseconds
#>                                             expr     min       lq     mean
#>  sapply(str_split(vec_long, "_", n = 3), `[`, 2) 51.6972 52.66918 57.42299
#>    median      uq     max neval
#>  54.47867 58.7653 115.754   100
``````

Is there a way to leverage the fact that this vector is actually a `factor`? Thus to identify the unique values ("levels"), recode them, and re-deploy to the entire vector length? Is there such a procedure that will speed up processing time?

Thanks!

EDIT

I just want to summarize my testing based on @GKi's answer, @ThomasIsCoding's answer, and @user20650's comment.

``````## The Data
set.seed(2021)

unique_vals <- c("animal_dog_xyz", "animal_cat_abc", "animal_alligator_tyl")

vec <- sample(rep(unique_vals, 10))
vec_long <- sample(rep(unique_vals, 1000))
vec_very_long <- sample(unique_vals, 100000))
``````
``````## The functions

## function #1 -- as @user20650 proposed
via_fac_levels <- function(x) {
x_factor <- factor(x)
levels(x_factor) <- sapply(str_split(levels(x_factor), "_",  n = 3), `[`, 2)
as.character(x_factor)
}
####################

## function #2 --  as @GKi proposed
via_fac_no_levels <- function(x) {
x_factor <- as.factor(x)
x_factor <- sapply(strsplit(levels(x_factor), "_", TRUE), `[`, 2)[x_factor]
as.character(x_factor)
}
####################

## function #3 -- the original slow method shown in the question
via_chr_only <- function(x) {
sapply(str_split(x, "_",  n = 3), `[`, 2)
}

####################

## function #4 -- as @ThomasIsCoding proposed
read.table(text = paste0(x, collapse = "\n"), sep = "_", header = FALSE)\$V2
}

###################

## function #5 -- forcats::fct_relabel()
via_fct_relabel <- function(x) {
x_factor <- as.factor(x)
x_factor <- fct_relabel(x_factor, ~sapply(str_split(.x, "_",  n = 3), `[`, 2))
as.character(x_factor)
}
``````
``````## Performance assessment
### I ran it on Rstudio cloud
bm_short <- bench::mark(fac_levels = via_fac_levels(vec),
fac_no_levels = via_fac_no_levels(vec),
chr = via_chr_only(vec),
fct_relabel = via_fct_relabel(vec),
iterations = 1000)

bm_long <- bench::mark(fac_levels = via_fac_levels(vec_long),
fac_no_levels = via_fac_no_levels(vec_long),
chr = via_chr_only(vec_long),
fct_relabel = via_fct_relabel(vec_long),
iterations = 1000)

bm_very_long <- bench::mark(fac_levels = via_fac_levels(vec_very_long),
fac_no_levels = via_fac_no_levels(vec_very_long),
chr = via_chr_only(vec_very_long),
fct_relabel = via_fct_relabel(vec_very_long),
iterations = 1000)
``````
``````## visualize
library(ggplot2)
library(tidyr)
library(ggbeeswarm)
library(beeswarm)

autoplot(bm_short) + ggtitle("data of length 30")
autoplot(bm_long) + ggtitle("data of length 3000")
autoplot(bm_very_long) + ggtitle("data of length 300000")
``````

``````## verify all functions give the same output
v1 <- via_fac_levels(vec_long)
v2 <- via_fac_no_levels(vec_long)
v3 <- via_chr_only(vec_long)