Large-scale social media analysis with Hadoop

Washington, DC
May 23, 2010
A tutorial in conjunction with ICWSM 2010

Over the last several years there has been a rapid increase in the number, variety, and size of readily available social media data. As such, there is a growing demand for software solutions that enable one to extract relevant information from these data. For sufficiently large data sets (e.g., networks with hundreds of millions of nodes) distributed solutions are often necessary, as the storage and memory constraints of single machines are prohibitive. Hadoop, an open source implementation of the map/reduce paradigm, is an increasingly popular framework for processing such data sets at the terabyte and petabyte scale using commodity hardware.

In this tutorial we will discuss the use of Hadoop for processing large-scale social data sets. We will first cover the map/reduce paradigm in general and subsequently discuss the particulars of Hadoop's implementation. We will then present several use cases for Hadoop in analyzing example data sets, examining the design and implementation of various algorithms with an emphasis on social network analysis. Accompanying data sets and code will be made available.

Bio: Jake Hofman is a member of the Human Social Dynamics group at Yahoo! Research. His work involves data-driven modeling of complex systems, focusing on applications of machine learning and statistical inference to large-scale data. He holds a B.S. in Electrical Engineering from Boston University and a Ph.D. in Physics from Columbia University.