Fostering the next generation of open science with R



Karthik Ram (@_inundata)

Supported by:









These data are hard to get to





Open Science




Source: PLOS, 2007



Instructions for preparation of the Biographical Sketch have been revised to rename the "Publications" section to "Products" and amend terminology and instructions accordingly. This change makes clear that products may include, but are not limited to, publications, data sets, software, patents, and copyrights.

Issuance of a new NSF Proposal & Award Policies and Procedures Guide (October 4th)

Why R?

The old way...

Why R?

A better way



glm(y ~ -1 + a + c + z + a:z, data = mydata, maxit = 30)


This is reproducible, repeatable and can serve as a analytic workflow.



Open Science needs open source tools



Source: Revolution Analytics, 2010, Nature editorial, 2012


Open data + code

Source: Wolkovich et al. Global Change Biology, 2012.





Enable access to scientific data repositories, full-text of articles, and science metrics and also facilitate a culture shift in the scientific community.



More info @ ropensci.org/packages

      
 Data
Treebase, Fishbase, 
Flybase,
GBIF, Vertnet
Dryad, ITIS
NPN, Taxize
AntWeb

      
  Journals
PLOS
Springer
textmine
pensoft
      
  Hybrid
figshare
Mendeley
DataONE, 
rAltmetric, rEML,
rNEXML

Search full text of 100k+ open access articles - rplos


library(rplos)
plot_throughtime(list("reproducible science"), 500)


Accessing data behind papers - dryad

# Get the URL for a data file
dryaddat <- download_url("10255/dryad.1759")

# Get a file given the URL
file <- dryad_getfile(dryaddat)

Mapping biodiversity data - rgbif

distribution <- occurrencelist(sciname = "Danaus plexippus", coordinatestatus = TRUE, maxresults = 1000, latlongdf = TRUE)

Species distribution modeling

World Bank climate knowledge portal rWBclimate

library(rWBclimate)
cal_basin_id <- c(365, 280, 281, 282, 273, 274, 275)
cal_basin <- create_map_df(cal_basin_id)
cal_basin_dat <- get_ensemble_temp(cal_basin_id, "annualanom", 2080, 2100)

Full code example

Resolve taxonomic names

library(taxize)
splist <- c("Helanthus annuus", "Pinos contorta", "Collomia grandiflorra", "Abies magnificaa",
    "Rosa california", "Datura wrighti", "Mimulus bicolour", "Nicotiana glauca",
    "Maddia sativa", "Bartlettia scapposa")
splist_tnrs <- tnrs(query = splist, getpost = "POST", source = "iPlant_TNRS")

Taxize queries 11 different name resolution services

Encylopedia of Life
Taxonomic Name Resolution Service
Integrated Taxonomic Information Service
Phylomatic
uBio
Global Names Resolver
Global Names Index
IUCN Red List
Tropicos
Plantminer
Theplantlist dot org
Catalogue of Life
Global Invasive Species Database

Measure research impact in real time



Tracking altmetrics - rAltmetric, ALM

library(rAltmetric)
altmetrics("doi/10.1038/489201a")
## Altmetrics on: "Future impact: Predicting scientific success" with altmetric_id: 942310 published in Nature.
##   provider count
## 1 Facebook     1
## 2    Feeds    10
## 3  Google+     1
## 4    Cited   179
## 5   Tweets   159
## 6 Accounts   171


Sharing unpublished data - (figshare)

Using figshare's API it is now possible to share figures, data and any other object generated in R directly to any figshare account.


library(rfigshare)
fs_auth()
# uses api keys to login
id <- fs_create()
fs_upload(id, r_objects)



A guide to using the Antweb package



@_inundata  
ropensci.github.io/antweb-guide



Installing the package

install.packages("AntWeb", dependencies = TRUE)
# Requires R version 3.0.1 or higher


Unique taxa in the data source

library(AntWeb)
families <- aw_unique(rank = "subfamily")
head(families)
      subfamily
  1      apidae
  2  bethylidae
  3  braconidae
  4   cynipidae
  5  diapriidae
  6 diaspididae
nrow(families)
  [1] 69


Unique taxa in the data source

library(AntWeb)
genera <- aw_unique(rank = "genus")
head(genera)
            genus
  1    Aenictinae
  2 Amblyoponinae
  3        Apidae
  4        Attini
  5  Basicerotini
  6    Bethylidae
nrow(genera)
  [1] 470


Unique taxa in the data source

library(AntWeb)
species <- aw_unique(rank = "species")
head(species)
       specificEpithet
  1       basicerotini
  2              indet
  3             indet.
  4 orizabanum_complex
  5         abbreviata
  6         abdelazizi
nrow(species)
  [1] 10413


Download data by any catalog number

data_by_code <- aw_code(occurrenceid = "CAS:ANTWEB:alas188691")
data_by_code
  [Total results on the server]: 1 
  [Args]: 
  occurrenceId = cas:antweb:alas188691 
  [Dataset]: [1 x 16] 
  [Data preview] :
                                                    specimens.url
  1  http://antweb.org/api/v2/?occurrenceId=CAS:ANTWEB:alas188691
  NA                                                         
     specimens.catalogNumber specimens.family specimens.subfamily
  1               alas188691       formicidae          myrmicinae
  NA                                                 
     specimens.genus specimens.specificEpithet  specimens.scientific_name
  1    Crematogaster              curvispinosa crematogaster curvispinosa
  NA                                                         
     specimens.typeStatus specimens.stateProvince specimens.country
  1                                       Heredia                  
  NA                                                   
     specimens.dateIdentified        specimens.habitat
  1                2000-06-06 Sendero Jaguar Hojarasca
  NA                                          
     specimens.minimumElevationInMeters specimens.geojson.type
  1                                  50                  point
  NA                                                  
     decimal_latitude decimal_longitude
  1          10.42118         -84.02531
  NA                           


Download data with bounding boxes

bbox <- "6.7117,-87.7867,13.2595,-76.9277" # Central America
crematogaster <- aw_data_all(genus = "crematogaster", bbox = bbox)
aw_map(crematogaster)


Download all available data on a taxonomic group

crematogaster <- aw_data_all(genus = "crematogaster", georeferenced = TRUE)


Search around specific coordinates with an error radius

san_andres <- aw_coords("12.54,-81.72", r = 10)
# Search around the San Andres Island.
nrow(san_andres$data)
  [1] 8


Find recent images

last_five_days <- aw_images(since = 5)


Species occurrence data (SPOCC)






Combine various data sources

Live Shiny app


Natively render geojson on GitHub

View on gists


Visualize data with Cartodb



Sharing robust data products



EML (Jones et al., 2001) is a comprehensive standard that has been adopted by a sector of the larger international ecological research community.


EML provides a common structure for these resources, to better enable ecologists to document, share, and interpret ecological data


EML standard enables data integration at the machine level (with little or no human intervention).




EML has four general descriptors at the top of the hierarchy. One can choose to describe a dataset, a protocol, a citation, or software.

Without metadata, a data table such as this one is useless.

A table with limited metadata

Valid EML


<?xml version="1.0"?>
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:ds="eml://ecoinformatics.org/dataset-2.1.1" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1" packageId="reml_3794487.58997023" system="reml">
  <dataset>
    <title>reml example</title>
    <creator>
      <individualName>
        <givenName>Karthik</givenName>
        <surName>Ram</surName>
      </individualName>
      <electronicMailAddress>karthik.ram@gmail.com</electronicMailAddress>
    </creator>

Valid EML


<?xml version="1.0"?>
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:ds="eml://ecoinformatics.org/dataset-2.1.1" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1" packageId="reml_3794487.58997023" system="reml">
  <dataset>
    <title>reml example</title>
    <creator>
      <individualName>
        <givenName>Karthik</givenName>
        <surName>Ram</surName>
      </individualName>
      <electronicMailAddress>karthik.ram@gmail.com</electronicMailAddress>
    </creator>

Units are well defined


 <attributeName>ct</attributeName>
          <attributeDefinition>count</attributeDefinition>
          <measurementScale>
            <nominal>
              <nonNumericDomain>
                <textDomain>
                  <definition>number</definition>
                </textDomain>
              </nonNumericDomain>
            </nominal>
          </measurementScale>
        </attribute>

Writing valid EML and uploading to a persistent repo is simple

require(EML)
description <- "Specimen records of genus Crematogaster from the AntWeb database"


eml_write(dat = dat, meta, title = "Crematogaster records", description = description,
    creator = "Karthik Ram <karthik@ropensci.org>", file = "crematogaster.xml")


eml_publish("crematogaster.xml", description = description, categories = "Ecology",
    tags = "entomology", destination = "figshare", visibility = "public")

A reproducible workflow in R



Load your own data
	

load all raw untransformed data.


   
Acquire additional data from the web

e.g. resolve taxonomic names, acquire additional datasets.


   
Document everything with metadata

The EML package makes it really easy to add valid EML to your data


   
Submit to a persistent repository

Share your data by submitting to figshare or one at your institution


Generate interactive maps, viewers.


ropensci.org/AntWeb

ropensci on GitHub
@ropensci on Twitter
Questions or comments to: karthik dot ram at berkeley dot edu

To navigate this presentation, type M to see all slides.
G to go to a specific slide

/

#