SlideShare a Scribd company logo
1 of 24
Download to read offline
1
Open Data Science Conference and iRODS User Group meeting
Raminder Singh
Research Data Services
Research Technologies, Indiana University
July 7th, 2016
2
ODSC East 2016
https://www.odsc.com/boston
3
Technologies Discussed
• Julia is a high-level, high-performance dynamic programming language for technical computing with familiar syntax. It
provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical
function library.
• Stan is for statistical modeling, data analysis, and prediction in the social, biological, and physical sciences,
engineering, and business
• Scikit-learn is a python library with classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with other
libraries like NumPy and SciPy.
• Apache Spark is an application programming interface centered on a data structure called the resilient distributed
dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-
tolerant way.
• Apache Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into
many small fragments of work, each of which may be executed or re-executed on any node in the cluster.
• Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and
analysis.
• Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data
analysis programs, coupled with infrastructure for evaluating these programs.
4
Keynote Speakers
5
About Companies of Keynote Speakers
• Booz Allen Hamilton: Core business is the provision of management, technology and security services,
to civilian government agencies. http://www.boozallen.com/datascience
• Rapid Miner: Integrated environment for machine learning, data mining, text mining, predictive analytics
and business analytics. https://rapidminer.com/
• CrowdFlower: Data enrichment, data mining as a Software as a Service. https://www.crowdflower.com/
6
Other Interesting Speakers
7
Topics for Training Workshops
• Using R for Data Analytics
– https://github.com/zachmayer/forecast
• Building a Real-time Recommender Systems with Spark
ML, Kafka, and the PANCAKE STACK
– http://advancedspark.com/
• Analyzing Open Data in Healthcare
using Public APIs and Reproducible Workflows
– https://github.com/jhajagos/health-open-data-
workshop
8
List of Good Talks Available Online
• Kirk Borne – “2 Most Important Things in Data Science”
– https://www.opendatascience.com/conferences/odsc-east-2016-kirk-borne-the-2-most-important-things-in-data-
science/
• Experiment
• Data collection
• Tomorrow’s Map Room: Data Portals
– https://www.opendatascience.com/blog/tomorrows-map-room-data-portals/
• Interactive Data Visualizations in R with Shiny and ggplot2
– https://www.opendatascience.com/conferences/odsc-east-2016-joe-cheng-zev-ross-interactive-data-
visualizations-in-r-with-shiny-and-ggplot2/
• Bokeh is a Python interactive visualization library that targets modern web browsers for
presentation. Shiny in R or D3 in Java script. http://bokeh.pydata.org
– https://www.opendatascience.com/conferences/odsc-east-2016-peter-wang-interactive-viz-of-a-billion-points-
with-bokeh-datashader/
• Exaptive Xap Store is an 'app store' for data applications. They are standardizing set of libraries to be
used to create Networks. http://www.exaptive.com/data-application-gallery
9
10
Objective to Attend
• iRODS features and architecture
• User Community
• Use Cases and Solutions built over iRODS
• Future development and directions
Questions
• Can I write rules in other languages?
• Is it possible to attach it to existing storage?
• What does it take to implement data policy rules for Research Data Alliance (RDA) practical
policy recommendations?
11
12
13
iRODS Implements Four Main Functions
Data Virtualization: iRODS provides a logical representation of files stored
in physical storage locations. We call this logical view a virtual file system
and the capabilities it provides.
Data Discovery: This information about data, called metadata, is
extremely useful for Data Discovery, locating relevant data within large
data sets.
Workflow Automation: Once data is stored and available in the catalog, it
often needs to be migrated, secured, or otherwise processed.
Secure Collaboration: Data is most useful when it’s in the hands of the
right people. There is a recognized need in the public research community
to publish data sets that accompany written articles.
14
15
16
18
EMC2 Case of Adaptive Hierarchical Metadata Using MetaLnx
19
20
Getting R to talk to iRODS
Bernhard Sonderegger, Nestlé Institute of Health Sciences
• The R language is an environment with a large and highly active user community in the field of data
science. At NIHS we have developed the R-irods package which allows user-friendly access to irods
data objects and metadata from the R language. Information is passed to the R functions as native R
objects (e.g. data-frames) to facilitate integration with existing R code and to allow data access using
standard R constructs.
• To maximize performance and maintain a simple architecture, the implementation heavily relies on the
icommands C++ code wrapped using Rcpp bindings.
• The R-irods package has been engineered to have semantics equivalent to the icommands and can
easily be used as a basis for further customization. At the NIHS we have created an ontology aware
package on top of R-irods to ensure consistent metadata annotations and to facilitate query
construction.
21
22
23
24
Review
Questions
• Can I write rules in other languages?
– YES
• Is it possible to attach it to existing storage?
– YES. There are tools to load the data
• What does it take to implement data policy rules for Research Data Alliance (RDA) practical
policy recommendations?
– Here https://github.com/DICE-UNC/policy-workbook is a reference implementation for
RDA recommendations. It needs some work to update and test these with the latest
version of iRODS.
25
iRODS User Group Meeting notes and slides
• http://irods.org/documentation/articles/irods-user-group-meeting-2016/ - Use Case slides
• http://irods.org/wp-content/uploads/2016/06/technical-overview-2016-web.pdf - Tech report
• http://slides.com/irods/ : Workshop Slides
• https://github.com/DICE-UNC/policy-workbook: RDS Policies implementation
• http://www.cyverse.org/ : iRODS as a service
• http://irods.org/documentation/articles/ : Other Articles
• http://www.odum.unc.edu/
• http://datafed.org/about/use-cases/
• http://renci.org/news/virtual-institute-for-social-research/

More Related Content

What's hot

Technical Presentation on Hadoop
Technical Presentation on HadoopTechnical Presentation on Hadoop
Technical Presentation on HadoopZaid Khan
 
Reinventing Laboratory Data To Be Bigger, Smarter & Faster
Reinventing Laboratory Data To Be Bigger, Smarter & FasterReinventing Laboratory Data To Be Bigger, Smarter & Faster
Reinventing Laboratory Data To Be Bigger, Smarter & FasterOSTHUS
 
Fair data principles for AOASG
Fair data principles for AOASGFair data principles for AOASG
Fair data principles for AOASGKeith Russell
 
How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...
How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...
How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...4Science
 
DSpace-CRIS ORCID Integration
DSpace-CRIS ORCID IntegrationDSpace-CRIS ORCID Integration
DSpace-CRIS ORCID Integration4Science
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archiveLewis Crawford
 
Analysis of big data in pandemic case
Analysis of big data in pandemic case Analysis of big data in pandemic case
Analysis of big data in pandemic case Muh Saleh
 
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...Dr. Haxel Consult
 
Fighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial IntelligenceFighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial Intelligencevty
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudPeter Haase
 
Yosemite part-4 webinar-final
Yosemite part-4 webinar-finalYosemite part-4 webinar-final
Yosemite part-4 webinar-finalDATAVERSITY
 
Sören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge GraphsSören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge Graphssemanticsconference
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataJoey Li
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
 
Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...
Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...
Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...4Science
 

What's hot (20)

Technical Presentation on Hadoop
Technical Presentation on HadoopTechnical Presentation on Hadoop
Technical Presentation on Hadoop
 
Bigdata
BigdataBigdata
Bigdata
 
Reinventing Laboratory Data To Be Bigger, Smarter & Faster
Reinventing Laboratory Data To Be Bigger, Smarter & FasterReinventing Laboratory Data To Be Bigger, Smarter & Faster
Reinventing Laboratory Data To Be Bigger, Smarter & Faster
 
Fair data principles for AOASG
Fair data principles for AOASGFair data principles for AOASG
Fair data principles for AOASG
 
Exploring Big Data Analytics Tools
Exploring Big Data Analytics ToolsExploring Big Data Analytics Tools
Exploring Big Data Analytics Tools
 
View on big data technologies
View on big data technologiesView on big data technologies
View on big data technologies
 
BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
 
How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...
How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...
How to enhance your DSpace repository: use cases for DSpace-CRIS, DSpace-RDM,...
 
DSpace-CRIS ORCID Integration
DSpace-CRIS ORCID IntegrationDSpace-CRIS ORCID Integration
DSpace-CRIS ORCID Integration
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
 
Analysis of big data in pandemic case
Analysis of big data in pandemic case Analysis of big data in pandemic case
Analysis of big data in pandemic case
 
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...
 
Fighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial IntelligenceFighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial Intelligence
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
 
Yosemite part-4 webinar-final
Yosemite part-4 webinar-finalYosemite part-4 webinar-final
Yosemite part-4 webinar-final
 
Sören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge GraphsSören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge Graphs
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...
Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...
Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...
 

Viewers also liked

Invitacion celebracion regional dia ffmm aula universidad mayor
Invitacion celebracion regional dia ffmm aula universidad mayorInvitacion celebracion regional dia ffmm aula universidad mayor
Invitacion celebracion regional dia ffmm aula universidad mayorMarcos Roa
 
iRODS/Dataverse Project by Jonathan Crabtree
iRODS/Dataverse Project by Jonathan CrabtreeiRODS/Dataverse Project by Jonathan Crabtree
iRODS/Dataverse Project by Jonathan Crabtreedatascienceiqss
 
Data Management for Grown Ups
Data Management for Grown UpsData Management for Grown Ups
Data Management for Grown UpsAll Things Open
 
NAGARA: SRB and iRODS
NAGARA: SRB and iRODSNAGARA: SRB and iRODS
NAGARA: SRB and iRODSMark Conrad
 
Green Shoots: Research Data Management Pilot at Imperial College London
Green Shoots:Research Data Management Pilot at Imperial College LondonGreen Shoots:Research Data Management Pilot at Imperial College London
Green Shoots: Research Data Management Pilot at Imperial College LondonTorsten Reimer
 
Research Data Management en bibliotheken
Research Data Management en bibliothekenResearch Data Management en bibliotheken
Research Data Management en bibliothekenSaskia Scheltjens
 
iRODS User Group Meeting 2016 - MUMC+
iRODS User Group Meeting 2016 - MUMC+iRODS User Group Meeting 2016 - MUMC+
iRODS User Group Meeting 2016 - MUMC+Maarten Coonen
 
iRODS Rule Language Cheat Sheet
iRODS Rule Language Cheat SheetiRODS Rule Language Cheat Sheet
iRODS Rule Language Cheat SheetSamuel Lampa
 
Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...
Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...
Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...The HDF-EOS Tools and Information Center
 
Private Cloud Architecture
Private Cloud ArchitecturePrivate Cloud Architecture
Private Cloud ArchitectureDerek Keats
 
File management ppt
File management pptFile management ppt
File management pptmarotti
 
I rods분석(20170313,01,김선태)
I rods분석(20170313,01,김선태)I rods분석(20170313,01,김선태)
I rods분석(20170313,01,김선태)Suntae Kim
 
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...EMC
 
Operating Systems - File Management
Operating Systems -  File ManagementOperating Systems -  File Management
Operating Systems - File ManagementDamian T. Gordon
 

Viewers also liked (18)

Invitacion celebracion regional dia ffmm aula universidad mayor
Invitacion celebracion regional dia ffmm aula universidad mayorInvitacion celebracion regional dia ffmm aula universidad mayor
Invitacion celebracion regional dia ffmm aula universidad mayor
 
iRODS/Dataverse Project by Jonathan Crabtree
iRODS/Dataverse Project by Jonathan CrabtreeiRODS/Dataverse Project by Jonathan Crabtree
iRODS/Dataverse Project by Jonathan Crabtree
 
Data Management for Grown Ups
Data Management for Grown UpsData Management for Grown Ups
Data Management for Grown Ups
 
iRODS: Interoperability in Data Management
iRODS: Interoperability in Data ManagementiRODS: Interoperability in Data Management
iRODS: Interoperability in Data Management
 
iRODS
iRODSiRODS
iRODS
 
NAGARA: SRB and iRODS
NAGARA: SRB and iRODSNAGARA: SRB and iRODS
NAGARA: SRB and iRODS
 
Green Shoots: Research Data Management Pilot at Imperial College London
Green Shoots:Research Data Management Pilot at Imperial College LondonGreen Shoots:Research Data Management Pilot at Imperial College London
Green Shoots: Research Data Management Pilot at Imperial College London
 
Research Data Management en bibliotheken
Research Data Management en bibliothekenResearch Data Management en bibliotheken
Research Data Management en bibliotheken
 
iRODS User Group Meeting 2016 - MUMC+
iRODS User Group Meeting 2016 - MUMC+iRODS User Group Meeting 2016 - MUMC+
iRODS User Group Meeting 2016 - MUMC+
 
UDT
UDTUDT
UDT
 
iRODS Rule Language Cheat Sheet
iRODS Rule Language Cheat SheetiRODS Rule Language Cheat Sheet
iRODS Rule Language Cheat Sheet
 
HDF5 iRODS
HDF5 iRODSHDF5 iRODS
HDF5 iRODS
 
Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...
Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...
Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...
 
Private Cloud Architecture
Private Cloud ArchitecturePrivate Cloud Architecture
Private Cloud Architecture
 
File management ppt
File management pptFile management ppt
File management ppt
 
I rods분석(20170313,01,김선태)
I rods분석(20170313,01,김선태)I rods분석(20170313,01,김선태)
I rods분석(20170313,01,김선태)
 
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
 
Operating Systems - File Management
Operating Systems -  File ManagementOperating Systems -  File Management
Operating Systems - File Management
 

Similar to ODSC and iRODS

Big Data Technologies.pdf
Big Data Technologies.pdfBig Data Technologies.pdf
Big Data Technologies.pdfRAHULRAHU8
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4Ferdin Joe John Joseph PhD
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
 
Memory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective ViewMemory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective Viewijtsrd
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companiesRobert Smith
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformSanjay Padhi, Ph.D
 
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptxITLAb21
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdataTom Rogers
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationDenodo
 
Building Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated DiscoveryBuilding Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated Discoveryadamkraut
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 

Similar to ODSC and iRODS (20)

Big Data Technologies.pdf
Big Data Technologies.pdfBig Data Technologies.pdf
Big Data Technologies.pdf
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 
Memory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective ViewMemory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective View
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companies
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
 
Big Data
Big DataBig Data
Big Data
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptx
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
 
Building Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated DiscoveryBuilding Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated Discovery
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 

ODSC and iRODS

  • 1. 1 Open Data Science Conference and iRODS User Group meeting Raminder Singh Research Data Services Research Technologies, Indiana University July 7th, 2016
  • 3. 3 Technologies Discussed • Julia is a high-level, high-performance dynamic programming language for technical computing with familiar syntax. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library. • Stan is for statistical modeling, data analysis, and prediction in the social, biological, and physical sciences, engineering, and business • Scikit-learn is a python library with classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with other libraries like NumPy and SciPy. • Apache Spark is an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault- tolerant way. • Apache Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. • Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. • Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
  • 5. 5 About Companies of Keynote Speakers • Booz Allen Hamilton: Core business is the provision of management, technology and security services, to civilian government agencies. http://www.boozallen.com/datascience • Rapid Miner: Integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics. https://rapidminer.com/ • CrowdFlower: Data enrichment, data mining as a Software as a Service. https://www.crowdflower.com/
  • 7. 7 Topics for Training Workshops • Using R for Data Analytics – https://github.com/zachmayer/forecast • Building a Real-time Recommender Systems with Spark ML, Kafka, and the PANCAKE STACK – http://advancedspark.com/ • Analyzing Open Data in Healthcare using Public APIs and Reproducible Workflows – https://github.com/jhajagos/health-open-data- workshop
  • 8. 8 List of Good Talks Available Online • Kirk Borne – “2 Most Important Things in Data Science” – https://www.opendatascience.com/conferences/odsc-east-2016-kirk-borne-the-2-most-important-things-in-data- science/ • Experiment • Data collection • Tomorrow’s Map Room: Data Portals – https://www.opendatascience.com/blog/tomorrows-map-room-data-portals/ • Interactive Data Visualizations in R with Shiny and ggplot2 – https://www.opendatascience.com/conferences/odsc-east-2016-joe-cheng-zev-ross-interactive-data- visualizations-in-r-with-shiny-and-ggplot2/ • Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Shiny in R or D3 in Java script. http://bokeh.pydata.org – https://www.opendatascience.com/conferences/odsc-east-2016-peter-wang-interactive-viz-of-a-billion-points- with-bokeh-datashader/ • Exaptive Xap Store is an 'app store' for data applications. They are standardizing set of libraries to be used to create Networks. http://www.exaptive.com/data-application-gallery
  • 9. 9
  • 10. 10 Objective to Attend • iRODS features and architecture • User Community • Use Cases and Solutions built over iRODS • Future development and directions Questions • Can I write rules in other languages? • Is it possible to attach it to existing storage? • What does it take to implement data policy rules for Research Data Alliance (RDA) practical policy recommendations?
  • 11. 11
  • 12. 12
  • 13. 13 iRODS Implements Four Main Functions Data Virtualization: iRODS provides a logical representation of files stored in physical storage locations. We call this logical view a virtual file system and the capabilities it provides. Data Discovery: This information about data, called metadata, is extremely useful for Data Discovery, locating relevant data within large data sets. Workflow Automation: Once data is stored and available in the catalog, it often needs to be migrated, secured, or otherwise processed. Secure Collaboration: Data is most useful when it’s in the hands of the right people. There is a recognized need in the public research community to publish data sets that accompany written articles.
  • 14. 14
  • 15. 15
  • 16. 16
  • 17. 18 EMC2 Case of Adaptive Hierarchical Metadata Using MetaLnx
  • 18. 19
  • 19. 20 Getting R to talk to iRODS Bernhard Sonderegger, Nestlé Institute of Health Sciences • The R language is an environment with a large and highly active user community in the field of data science. At NIHS we have developed the R-irods package which allows user-friendly access to irods data objects and metadata from the R language. Information is passed to the R functions as native R objects (e.g. data-frames) to facilitate integration with existing R code and to allow data access using standard R constructs. • To maximize performance and maintain a simple architecture, the implementation heavily relies on the icommands C++ code wrapped using Rcpp bindings. • The R-irods package has been engineered to have semantics equivalent to the icommands and can easily be used as a basis for further customization. At the NIHS we have created an ontology aware package on top of R-irods to ensure consistent metadata annotations and to facilitate query construction.
  • 20. 21
  • 21. 22
  • 22. 23
  • 23. 24 Review Questions • Can I write rules in other languages? – YES • Is it possible to attach it to existing storage? – YES. There are tools to load the data • What does it take to implement data policy rules for Research Data Alliance (RDA) practical policy recommendations? – Here https://github.com/DICE-UNC/policy-workbook is a reference implementation for RDA recommendations. It needs some work to update and test these with the latest version of iRODS.
  • 24. 25 iRODS User Group Meeting notes and slides • http://irods.org/documentation/articles/irods-user-group-meeting-2016/ - Use Case slides • http://irods.org/wp-content/uploads/2016/06/technical-overview-2016-web.pdf - Tech report • http://slides.com/irods/ : Workshop Slides • https://github.com/DICE-UNC/policy-workbook: RDS Policies implementation • http://www.cyverse.org/ : iRODS as a service • http://irods.org/documentation/articles/ : Other Articles • http://www.odum.unc.edu/ • http://datafed.org/about/use-cases/ • http://renci.org/news/virtual-institute-for-social-research/

Editor's Notes

  1. Intro Slide
  2. Stan : http://mc-stan.org/
  3. CrowdFlower combines the best of human and machine intelligence to enrich data for the world's most innovative companies.
  4. User community