This is a quick tutorial covering a couple functions you can use to clean and fix problems in your data.
These functions can be skipped, or used one after the other. It’s up to you. Basically, they are wrapper functions for cleaning your data while maintaining appropriate categorical (factor) levels.
This tutorial assumes that you already know how to load/import your data. We will be using the feedr
package example data set: finches.
head(finches)
## # A tibble: 6 × 10
## animal_id date time logger_id species age sex site_name lon lat
## <fct> <date> <dttm> <fct> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 0620000514 2016-01-28 2016-01-28 12:34:25 2200 House Finch AHY F Kamloops, BC -120. 50.7
## 2 0620000514 2016-01-28 2016-01-28 12:34:28 2200 House Finch AHY F Kamloops, BC -120. 50.7
## 3 041868D861 2016-01-28 2016-01-28 12:35:41 2200 House Finch AHY M Kamloops, BC -120. 50.7
## 4 06200004F8 2016-01-28 2016-01-28 12:35:52 2200 House Finch AHY F Kamloops, BC -120. 50.7
## 5 06200004F8 2016-01-28 2016-01-28 12:35:59 2200 House Finch AHY F Kamloops, BC -120. 50.7
## 6 06200004F8 2016-01-28 2016-01-28 12:36:02 2200 House Finch AHY F Kamloops, BC -120. 50.7
check_ids()
This function can be used to remove any animal_ids
that are present, but which you know aren’t really animals. For example, if you use a ‘wand’ to test the deployment of your loggers, this is an animal_id
that you should remove prior to analysis. Further, occasionally there are animal_ids
that are error codes (e.g. 0000000000), you may wish to determine why these are present (you probably should!), but once again, for analysis they should be removed.
This function works by comparing the list of animal_ids
in the data to an external, animal_id
data set. The data set is expected to have at least two columns: animal_id
and species
. The species
column should either contain species identity (e.g. House Finch or HOFI) or the error code (e.g. wand, or error).
In addition to removing error or wand ids, this function will also report which animal_id
s are in your data sets, but not in your animal_id
index, and which are in your animal_id
index, but not in your data sets. This will help you determine whether you are detecting RFID tags that are not in your master index (weird!) and whether some RFID tags have never been recorded by a logger.
Let’s load an animal index file. Note that there are no errors or wands ids.
animal_index <- read.csv("./Data/animal_index.csv")
animal_index
## animal_id species
## 1 0620000514 House Finch
## 2 041868D861 House Finch
## 3 06200004F8 House Finch
## 4 062000043E House Finch
## 5 041868D396 House Finch
You need to give this function a data set, and it will return a cleaned data set. Here we’ll save it as r_clean:
r_clean <- check_ids(finches, ids = animal_index)
## All animal_ids in your data are also in your animal_id index
## All animal_ids in your animal_id index are also in your data
## animal_id index (id) data frame doesn't contain any animal_ids to omit
## No animal_ids have been omitted
This output shows that all the animal_id
s in the data are also in the index and vice versa. Further, there were no omitted ids (error or wand ids).
Note: You can also skip loading the index and simply provide check_ids()
with the location of the index file:
r_clean <- check_ids(finches, ids = "./Data/animal_index.csv")
## All animal_ids in your data are also in your animal_id index
## All animal_ids in your animal_id index are also in your data
## animal_id index (id) data frame doesn't contain any animal_ids to omit
## No animal_ids have been omitted
Let’s see how it works if you did have a ‘wand’ or ‘error’ code in your index file that matched a animal_id
in your data set.
animal_index
## animal_id species
## 1 0620000514 wand
## 2 041868D861 error
## 3 06200004F8 House Finch
## 4 062000043E House Finch
## 5 041868D396 House Finch
r_clean <- check_ids(finches, ids = animal_index)
## All animal_ids in your data are also in your animal_id index
## All animal_ids in your animal_id index are also in your data
## The following animal_ids have been omitted: 0620000514, 041868D861
Here we omitted two animal_id
s, one associated with a wand (0620000514) and one with an error (041868D861).
Note that nothing else changed.
animal_id
s present in data set but not in the indexanimal_index
## animal_id species
## 1 0620000514 wand
## 2 041868D861 error
## 3 06200004F8 House Finch
## 4 062000043E House Finch
r_clean <- check_ids(finches, ids = animal_index)
## Some animal_ids present in your data do not exist in the animal_id index: 041868D396
## All animal_ids in your animal_id index are also in your data
## The following animal_ids have been omitted: 0620000514, 041868D861
check_problems()
This function is only necessary if, for some reason, you’re getting errors in the recorded animal_id
s.
This function will correct all instances of an animal_id
according to the list provided.
problems <- read.csv("./Data/problems.csv")
problems
## original_id corrected_id
## 1 06200004F8 041B6BEF6B
## 2 041868D396 041B999F6B
Original animal_id
s:
finches$animal_id[1:5]
## [1] 0620000514 0620000514 041868D861 06200004F8 06200004F8
## Levels: 041868D396 041868D861 062000043E 06200004F8 0620000514
Fix problems and new animal_id
s:
r_clean <- check_problems(finches, problems = problems)
## The following animal ids have been corrected:
## 06200004F8 to 041B6BEF6B
## 041868D396 to 041B999F6B
r_clean$animal_id[1:5]
## [1] 0620000514 0620000514 041868D861 041B6BEF6B 041B6BEF6B
## Levels: 041868D861 062000043E 0620000514 041B6BEF6B 041B999F6B
Note that the animal_id
s have been modified, but also that the factor levels have been updated to match.
Now that your data has been cleaned of erroneous or problematic data, it is ready to be transformed.
Back to top
Go back to home | Go back to loading/importing data | Continue with transformations