Fostering the next generation of open science with R

Karthik Ram (@_inundata)

Supported by:

These data are hard to get to

Open Science

Source:

Instructions for preparation of the Biographical Sketch have been revised to rename the "Publications" section to "Products" and amend terminology and instructions accordingly. This change makes clear that products may include, but are not limited to, publications, data sets, software, patents, and copyrights.

Why R?

The old way...

Why R?

A better way

glm(y ~ -1 + a + c + z + a:z, data = mydata, maxit = 30)

This is reproducible, repeatable and can serve as a analytic workflow.

Open Science needs open source tools

Source:

Open data + code

Source: .

Enable access to scientific data repositories, full-text of articles, and science metrics and also facilitate a culture shift in the scientific community.

More info @ ropensci.org/packages

 Data
Treebase, Fishbase, 
Flybase,
GBIF, Vertnet
Dryad, ITIS
NPN, Taxize
AntWeb

  Journals
PLOS
Springer
textmine
pensoft

  Hybrid
figshare
Mendeley
DataONE, 
rAltmetric, rEML,
rNEXML

Search full text of 100k+ open access articles - `rplos`

library(rplos)
plot_throughtime(list("reproducible science"), 500)

Accessing data behind papers - `dryad`

# Get the URL for a data file
dryaddat <- download_url("10255/dryad.1759")

# Get a file given the URL
file <- dryad_getfile(dryaddat)

Mapping biodiversity data - `rgbif`

distribution <- occurrencelist(sciname = "Danaus plexippus", coordinatestatus = TRUE, maxresults = 1000, latlongdf = TRUE)

Species distribution modeling

World Bank climate knowledge portal `rWBclimate`

library(rWBclimate)
cal_basin_id <- c(365, 280, 281, 282, 273, 274, 275)
cal_basin <- create_map_df(cal_basin_id)
cal_basin_dat <- get_ensemble_temp(cal_basin_id, "annualanom", 2080, 2100)

Full code example

Resolve taxonomic names

library(taxize)
splist <- c("Helanthus annuus", "Pinos contorta", "Collomia grandiflorra", "Abies magnificaa",
    "Rosa california", "Datura wrighti", "Mimulus bicolour", "Nicotiana glauca",
    "Maddia sativa", "Bartlettia scapposa")
splist_tnrs <- tnrs(query = splist, getpost = "POST", source = "iPlant_TNRS")

Taxize queries 11 different name resolution services

Encylopedia of Life
Taxonomic Name Resolution Service
Integrated Taxonomic Information Service
Phylomatic
uBio
Global Names Resolver
Global Names Index
IUCN Red List
Tropicos
Plantminer
Theplantlist dot org
Catalogue of Life
Global Invasive Species Database

Measure research impact in real time

Tracking altmetrics - `rAltmetric, ALM`

library(rAltmetric)

altmetrics("doi/10.1038/489201a")

## Altmetrics on: "Future impact: Predicting scientific success" with altmetric_id: 942310 published in Nature.
##   provider count
## 1 Facebook     1
## 2    Feeds    10
## 3  Google+     1
## 4    Cited   179
## 5   Tweets   159
## 6 Accounts   171

Sharing unpublished data - `(figshare)`

Using figshare's API it is now possible to share figures, data and any other object generated in `R` directly to any figshare account.

library(rfigshare)
fs_auth()
# uses api keys to login
id <- fs_create()
fs_upload(id, r_objects)

A guide to using the Antweb package

@_inundata
ropensci.github.io/antweb-guide

Installing the package

install.packages("AntWeb", dependencies = TRUE)
# Requires R version 3.0.1 or higher

Unique taxa in the data source

library(AntWeb)
families <- aw_unique(rank = "subfamily")
head(families)

      subfamily
  1      apidae
  2  bethylidae
  3  braconidae
  4   cynipidae
  5  diapriidae
  6 diaspididae

nrow(families)

  [1] 69

Unique taxa in the data source

library(AntWeb)
genera <- aw_unique(rank = "genus")
head(genera)

            genus
  1    Aenictinae
  2 Amblyoponinae
  3        Apidae
  4        Attini
  5  Basicerotini
  6    Bethylidae

nrow(genera)

  [1] 470

Unique taxa in the data source

library(AntWeb)
species <- aw_unique(rank = "species")
head(species)

       specificEpithet
  1       basicerotini
  2              indet
  3             indet.
  4 orizabanum_complex
  5         abbreviata
  6         abdelazizi

nrow(species)

  [1] 10413

Download data by any catalog number

data_by_code <- aw_code(occurrenceid = "CAS:ANTWEB:alas188691")
data_by_code

  [Total results on the server]: 1 
  [Args]: 
  occurrenceId = cas:antweb:alas188691 
  [Dataset]: [1 x 16] 
  [Data preview] :
                                                    specimens.url
  1  http://antweb.org/api/v2/?occurrenceId=CAS:ANTWEB:alas188691
  NA                                                         
     specimens.catalogNumber specimens.family specimens.subfamily
  1               alas188691       formicidae          myrmicinae
  NA                                                 
     specimens.genus specimens.specificEpithet  specimens.scientific_name
  1    Crematogaster              curvispinosa crematogaster curvispinosa
  NA                                                         
     specimens.typeStatus specimens.stateProvince specimens.country
  1                                       Heredia                  
  NA                                                   
     specimens.dateIdentified        specimens.habitat
  1                2000-06-06 Sendero Jaguar Hojarasca
  NA                                          
     specimens.minimumElevationInMeters specimens.geojson.type
  1                                  50                  point
  NA                                                  
     decimal_latitude decimal_longitude
  1          10.42118         -84.02531
  NA

Download data with bounding boxes

bbox <- "6.7117,-87.7867,13.2595,-76.9277" # Central America
crematogaster <- aw_data_all(genus = "crematogaster", bbox = bbox)
aw_map(crematogaster)

Download all available data on a taxonomic group

crematogaster <- aw_data_all(genus = "crematogaster", georeferenced = TRUE)

Search around specific coordinates with an error radius

san_andres <- aw_coords("12.54,-81.72", r = 10)
# Search around the San Andres Island.
nrow(san_andres$data)

  [1] 8

Find recent images

last_five_days <- aw_images(since = 5)

Species occurrence data (SPOCC)

Combine various data sources

Live Shiny app

Natively render geojson on GitHub

View on gists

Visualize data with Cartodb

Sharing robust data products

EML (Jones et al., 2001) is a comprehensive standard that has been adopted by a sector of the larger international ecological research community.

EML provides a common structure for these resources, to better enable ecologists to document, share, and interpret ecological data

EML standard enables data integration at the machine level (with little or no human intervention).

EML has four general descriptors at the top of the hierarchy. One can choose to describe a dataset, a protocol, a citation, or software.

Without metadata, a data table such as this one is useless.

A table with limited metadata

Valid EML


<?xml version="1.0"?>
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:ds="eml://ecoinformatics.org/dataset-2.1.1" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1" packageId="reml_3794487.58997023" system="reml">
  <dataset>
    <title>reml example</title>
    <creator>
      <individualName>
        <givenName>Karthik</givenName>
        <surName>Ram</surName>
      </individualName>
      <electronicMailAddress>karthik.ram@gmail.com</electronicMailAddress>
    </creator>

Valid EML


<?xml version="1.0"?>
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:ds="eml://ecoinformatics.org/dataset-2.1.1" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1" packageId="reml_3794487.58997023" system="reml">
  <dataset>
    <title>reml example</title>
    <creator>
      <individualName>
        <givenName>Karthik</givenName>
        <surName>Ram</surName>
      </individualName>
      <electronicMailAddress>karthik.ram@gmail.com</electronicMailAddress>
    </creator>

Units are well defined


 <attributeName>ct</attributeName>
          <attributeDefinition>count</attributeDefinition>
          <measurementScale>
            <nominal>
              <nonNumericDomain>
                <textDomain>
                  <definition>number</definition>
                </textDomain>
              </nonNumericDomain>
            </nominal>
          </measurementScale>
        </attribute>

Writing valid EML and uploading to a persistent repo is simple

require(EML)
description <- "Specimen records of genus Crematogaster from the AntWeb database"


eml_write(dat = dat, meta, title = "Crematogaster records", description = description,
    creator = "Karthik Ram <karthik@ropensci.org>", file = "crematogaster.xml")


eml_publish("crematogaster.xml", description = description, categories = "Ecology",
    tags = "entomology", destination = "figshare", visibility = "public")

A reproducible workflow in R

Load your own data

load all raw untransformed data.

→

Acquire additional data from the web

e.g. resolve taxonomic names, acquire additional datasets.

→

Document everything with metadata

The EML package makes it really easy to add valid EML to your data

→

Submit to a persistent repository

Share your data by submitting to figshare or one at your institution

Generate interactive maps, viewers.

Fostering the next generation of open science with R

Karthik Ram (@_inundata) Supported by:

These data are hard to get to

Open Science

Why R?

The old way...

Why R?

A better way

This is reproducible, repeatable and can serve as a analytic workflow.

Open Science needs open source tools

Open data + code

Search full text of 100k+ open access articles - rplos

Accessing data behind papers - dryad

Mapping biodiversity data - rgbif

Species distribution modeling

World Bank climate knowledge portal rWBclimate

Resolve taxonomic names

Taxize queries 11 different name resolution services

Measure research impact in real time

Tracking altmetrics - rAltmetric, ALM

Sharing unpublished data - (figshare)

Using figshare's API it is now possible to share figures, data and any other object generated in R directly to any figshare account.

A guide to using the Antweb package

@_inundata ropensci.github.io/antweb-guide

Installing the package

Unique taxa in the data source

Unique taxa in the data source

Unique taxa in the data source

Download data by any catalog number

Download data with bounding boxes

Download all available data on a taxonomic group

Search around specific coordinates with an error radius

Find recent images

Species occurrence data (SPOCC)

Combine various data sources

Natively render geojson on GitHub

Visualize data with Cartodb

Sharing robust data products

EML standard enables data integration at the machine level (with little or no human intervention).

EML has four general descriptors at the top of the hierarchy. One can choose to describe a dataset, a protocol, a citation, or software.

Without metadata, a data table such as this one is useless.

A table with limited metadata

Valid EML

Valid EML

Units are well defined

Writing valid EML and uploading to a persistent repo is simple

A reproducible workflow in R

ropensci.org/AntWeb

ropensci on GitHub @ropensci on Twitter Questions or comments to: karthik dot ram at berkeley dot edu

To navigate this presentation, type M to see all slides. G to go to a specific slide

Karthik Ram (@_inundata)

Supported by:

Search full text of 100k+ open access articles - `rplos`

Accessing data behind papers - `dryad`

Mapping biodiversity data - `rgbif`

World Bank climate knowledge portal `rWBclimate`

Tracking altmetrics - `rAltmetric, ALM`

Sharing unpublished data - `(figshare)`

Using figshare's API it is now possible to share figures, data and any other object generated in `R` directly to any figshare account.

@_inundata
ropensci.github.io/antweb-guide

ropensci on GitHub
@ropensci on Twitter
Questions or comments to: karthik dot ram at berkeley dot edu

To navigate this presentation, type M to see all slides.
G to go to a specific slide