1. 1
Open Data Science Conference and iRODS User Group meeting
Raminder Singh
Research Data Services
Research Technologies, Indiana University
July 7th, 2016
3. 3
Technologies Discussed
• Julia is a high-level, high-performance dynamic programming language for technical computing with familiar syntax. It
provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical
function library.
• Stan is for statistical modeling, data analysis, and prediction in the social, biological, and physical sciences,
engineering, and business
• Scikit-learn is a python library with classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with other
libraries like NumPy and SciPy.
• Apache Spark is an application programming interface centered on a data structure called the resilient distributed
dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-
tolerant way.
• Apache Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into
many small fragments of work, each of which may be executed or re-executed on any node in the cluster.
• Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and
analysis.
• Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data
analysis programs, coupled with infrastructure for evaluating these programs.
5. 5
About Companies of Keynote Speakers
• Booz Allen Hamilton: Core business is the provision of management, technology and security services,
to civilian government agencies. http://www.boozallen.com/datascience
• Rapid Miner: Integrated environment for machine learning, data mining, text mining, predictive analytics
and business analytics. https://rapidminer.com/
• CrowdFlower: Data enrichment, data mining as a Software as a Service. https://www.crowdflower.com/
7. 7
Topics for Training Workshops
• Using R for Data Analytics
– https://github.com/zachmayer/forecast
• Building a Real-time Recommender Systems with Spark
ML, Kafka, and the PANCAKE STACK
– http://advancedspark.com/
• Analyzing Open Data in Healthcare
using Public APIs and Reproducible Workflows
– https://github.com/jhajagos/health-open-data-
workshop
8. 8
List of Good Talks Available Online
• Kirk Borne – “2 Most Important Things in Data Science”
– https://www.opendatascience.com/conferences/odsc-east-2016-kirk-borne-the-2-most-important-things-in-data-
science/
• Experiment
• Data collection
• Tomorrow’s Map Room: Data Portals
– https://www.opendatascience.com/blog/tomorrows-map-room-data-portals/
• Interactive Data Visualizations in R with Shiny and ggplot2
– https://www.opendatascience.com/conferences/odsc-east-2016-joe-cheng-zev-ross-interactive-data-
visualizations-in-r-with-shiny-and-ggplot2/
• Bokeh is a Python interactive visualization library that targets modern web browsers for
presentation. Shiny in R or D3 in Java script. http://bokeh.pydata.org
– https://www.opendatascience.com/conferences/odsc-east-2016-peter-wang-interactive-viz-of-a-billion-points-
with-bokeh-datashader/
• Exaptive Xap Store is an 'app store' for data applications. They are standardizing set of libraries to be
used to create Networks. http://www.exaptive.com/data-application-gallery
10. 10
Objective to Attend
• iRODS features and architecture
• User Community
• Use Cases and Solutions built over iRODS
• Future development and directions
Questions
• Can I write rules in other languages?
• Is it possible to attach it to existing storage?
• What does it take to implement data policy rules for Research Data Alliance (RDA) practical
policy recommendations?
13. 13
iRODS Implements Four Main Functions
Data Virtualization: iRODS provides a logical representation of files stored
in physical storage locations. We call this logical view a virtual file system
and the capabilities it provides.
Data Discovery: This information about data, called metadata, is
extremely useful for Data Discovery, locating relevant data within large
data sets.
Workflow Automation: Once data is stored and available in the catalog, it
often needs to be migrated, secured, or otherwise processed.
Secure Collaboration: Data is most useful when it’s in the hands of the
right people. There is a recognized need in the public research community
to publish data sets that accompany written articles.
19. 20
Getting R to talk to iRODS
Bernhard Sonderegger, Nestlé Institute of Health Sciences
• The R language is an environment with a large and highly active user community in the field of data
science. At NIHS we have developed the R-irods package which allows user-friendly access to irods
data objects and metadata from the R language. Information is passed to the R functions as native R
objects (e.g. data-frames) to facilitate integration with existing R code and to allow data access using
standard R constructs.
• To maximize performance and maintain a simple architecture, the implementation heavily relies on the
icommands C++ code wrapped using Rcpp bindings.
• The R-irods package has been engineered to have semantics equivalent to the icommands and can
easily be used as a basis for further customization. At the NIHS we have created an ontology aware
package on top of R-irods to ensure consistent metadata annotations and to facilitate query
construction.
23. 24
Review
Questions
• Can I write rules in other languages?
– YES
• Is it possible to attach it to existing storage?
– YES. There are tools to load the data
• What does it take to implement data policy rules for Research Data Alliance (RDA) practical
policy recommendations?
– Here https://github.com/DICE-UNC/policy-workbook is a reference implementation for
RDA recommendations. It needs some work to update and test these with the latest
version of iRODS.
24. 25
iRODS User Group Meeting notes and slides
• http://irods.org/documentation/articles/irods-user-group-meeting-2016/ - Use Case slides
• http://irods.org/wp-content/uploads/2016/06/technical-overview-2016-web.pdf - Tech report
• http://slides.com/irods/ : Workshop Slides
• https://github.com/DICE-UNC/policy-workbook: RDS Policies implementation
• http://www.cyverse.org/ : iRODS as a service
• http://irods.org/documentation/articles/ : Other Articles
• http://www.odum.unc.edu/
• http://datafed.org/about/use-cases/
• http://renci.org/news/virtual-institute-for-social-research/
Editor's Notes
Intro Slide
Stan : http://mc-stan.org/
CrowdFlower combines the best of human and machine intelligence to enrich data for the world's most innovative companies.