This document discusses big data tools and management at large scales. It introduces Hadoop, an open-source software framework for distributed storage and processing of large datasets using MapReduce. Hadoop allows parallel processing of data across thousands of nodes and has been adopted by large companies like Yahoo!, Facebook, and Baidu to manage petabytes of data and perform tasks like sorting terabytes of data in hours.
1. Big Tools for Big Data Analytics and Management at web scale IIPC General Assembly, Singapore, May 2010 Lewis Crawford Web Archiving Programme Technical Lead British Library
11. BigSheets and the open source stack Top level Apache Project Yahoo! Contributed open source IBM Research Licence Insight Engine Spreadsheet Paradigm SQL ‘like’ programming language Distributed processing and file system
Introduction the problem of big data hadoop map / reduce hdfs Bigsheets! PIG the open source stack Analytics - the meta tag example. Data management Arc to Warc Jhove format migration flv to mpeg4? Simple Examples - Iraq Inquiry video link extraction Slash Page crawl - election sites extraction Newspapers Back to analytics the next generation access tool - targeted at researchers - cooliris, network / swirl, spreadsheet, skydragon
Straw Pole of how much archive material there is in the room. 3 Petabytes
Add diagram? Page
Add PIG
IBM insight engine
New york times example Page
Seadragon notes: Review current access tool Search by title, urls, or full text browse by Subject or special collection. More websites search results already in the millions Provide tools to mine the data (renewable resource?) Page