This is a quick tutorial covering a couple functions you can use to clean and fix problems in your data.

These functions can be skipped, or used one after the other. It’s up to you. Basically, they are wrapper functions for cleaning your data while maintaining appropriate categorical (factor) levels.

This tutorial assumes that you already know how to load/import your data. We will be using the feedr package example data set: finches.

head(finches)
##    animal_id       date                time logger_id     species age sex    site_name       lon      lat
## 1 0620000514 2016-01-28 2016-01-28 12:34:25      2200 House Finch AHY   F Kamloops, BC -120.3612 50.66778
## 2 0620000514 2016-01-28 2016-01-28 12:34:28      2200 House Finch AHY   F Kamloops, BC -120.3612 50.66778
## 3 041868D861 2016-01-28 2016-01-28 12:35:41      2200 House Finch AHY   M Kamloops, BC -120.3612 50.66778
## 4 06200004F8 2016-01-28 2016-01-28 12:35:52      2200 House Finch AHY   F Kamloops, BC -120.3612 50.66778
## 5 06200004F8 2016-01-28 2016-01-28 12:35:59      2200 House Finch AHY   F Kamloops, BC -120.3612 50.66778
## 6 06200004F8 2016-01-28 2016-01-28 12:36:02      2200 House Finch AHY   F Kamloops, BC -120.3612 50.66778

## check_ids()

This function can be used to remove any animal_ids that are present, but which you know aren’t really animals. For example, if you use a ‘wand’ to test the deployment of your loggers, this is an animal_id that you should remove prior to analysis. Further, occasionally there are animal_ids that are error codes (e.g. 0000000000), you may wish to determine why these are present (you probably should!), but once again, for analysis they should be removed.

This function works by comparing the list of animal_ids in the data to an external, animal_id data set. The data set is expected to have at least two columns: animal_id and species. The species column should either contain species identity (e.g. House Finch or HOFI) or the error code (e.g. wand, or error).

In addition to removing error or wand ids, this function will also report which animal_ids are in your data sets, but not in your animal_id index, and which are in your animal_id index, but not in your data sets. This will help you determine whether you are detecting RFID tags that are not in your master index (weird!) and whether some RFID tags have never been recorded by a logger.

### With no error/wand ids

Let’s load an animal index file. Note that there are no errors or wands ids.

animal_index <- read.csv("./Data/animal_index.csv")
animal_index
##    animal_id     species
## 1 0620000514 House Finch
## 2 041868D861 House Finch
## 3 06200004F8 House Finch
## 4 062000043E House Finch
## 5 041868D396 House Finch

You need to give this function a data set, and it will return a cleaned data set. Here we’ll save it as r_clean:

r_clean <- check_ids(finches, ids = animal_index)
## All animal_ids in your data are also in your animal_id index
## All animal_ids in your animal_id index are also in your data
## animal_id index (id) data frame doesn't contain any animal_ids to omit
## No animal_ids have been omitted

This output shows that all the animal_ids in the data are also in the index and vice versa. Further, there were no omitted ids (error or wand ids).

Note: You can also skip loading the index and simply provide check_ids() with the location of the index file:

r_clean <- check_ids(finches, ids = "./Data/animal_index.csv")
## All animal_ids in your data are also in your animal_id index
## All animal_ids in your animal_id index are also in your data
## animal_id index (id) data frame doesn't contain any animal_ids to omit
## No animal_ids have been omitted

### With error/wand ids

Let’s see how it works if you did have a ‘wand’ or ‘error’ code in your index file that matched a animal_id in your data set.

animal_index
##    animal_id     species
## 1 0620000514        wand
## 2 041868D861       error
## 3 06200004F8 House Finch
## 4 062000043E House Finch
## 5 041868D396 House Finch
r_clean <- check_ids(finches, ids = animal_index)
## All animal_ids in your data are also in your animal_id index
## All animal_ids in your animal_id index are also in your data
## The following animal_ids have been omitted: 0620000514, 041868D861

Here we omitted two animal_ids, one associated with a wand (0620000514) and one with an error (041868D861).

Note that nothing else changed.

### animal_ids present in data set but not in the index

animal_index
##    animal_id     species
## 1 0620000514        wand
## 2 041868D861       error
## 3 06200004F8 House Finch
## 4 062000043E House Finch
r_clean <- check_ids(finches, ids = animal_index)
## Some animal_ids present in your data do not exist in the animal_id index: 041868D396
## All animal_ids in your animal_id index are also in your data
## The following animal_ids have been omitted: 0620000514, 041868D861

## check_problems()

This function is only necessary if, for some reason, you’re getting errors in the recorded animal_ids.

This function will correct all instances of an animal_id according to the list provided.

problems <- read.csv("./Data/problems.csv")
problems
##   original_id corrected_id
## 1  06200004F8   041B6BEF6B
## 2  041868D396   041B999F6B

Original animal_ids:

finches$animal_id[1:5] ## [1] 0620000514 0620000514 041868D861 06200004F8 06200004F8 ## Levels: 041868D396 041868D861 062000043E 06200004F8 0620000514 Fix problems and new animal_ids: r_clean <- check_problems(finches, problems = problems) ## The following animal ids have been corrected: ## 06200004F8 to 041B6BEF6B ## 041868D396 to 041B999F6B r_clean$animal_id[1:5]
## [1] 0620000514 0620000514 041868D861 041B6BEF6B 041B6BEF6B
## Levels: 041868D861 062000043E 0620000514 041B6BEF6B 041B999F6B

Note that the animal_ids have been modified, but also that the factor levels have been updated to match.

Now that your data has been cleaned of erroneous or problematic data, it is ready to be transformed.