Saturday, December 8, 2012

GraphBuilder is Live!

I am glad to report I got the following announcement from our core collaboratore Ted Willke from Intel Labs. As you may know, Intel Labs was developing some tools to help researchers and data scientist to format and clean their data into graph format, and then applications like Graphlab can be used much more easily.

Here is a quote from Ted's announcement in the intel blog:



But until recently, only the wizards of Big Data were able to rapidly extract knowledge from a different type of structure within the data, a type that is best modeled by tree or graph structures.  Imagine the pattern of hyperlinks connecting Wikipedia pages or the connections between Tweeters and Followers on Twitter.  In these models, a line is drawn between two bits of information if they are related to each other in some way.  The nature of the connection can be less obvious than in these examples and made specifically to serve a particular algorithm.  For example, a popular form of machine learning called Latent Dirichlet Allocation (a mouthful, I know) can create “word clouds” of topics in a set of documents without being told the topics in advance. All it needs is a graph that connects word occurrences to the filenames.  Another algorithm can accurately guess the type of noun (i.e., person, place, or thing) if given a graph that connects noun phrases to surrounding context phrases.
Many of these graphs are very large, with tens of billions of vertices (i.e., things being related) and hundreds of billions of edges (i.e., the relationships).  And, many that model natural phenomena possess power-law degree distributions, meaning that many vertices connect to a handful of others, but a few may have edges to a substantial portion of the vertices.  For instance, a graph of Twitter relationships would show that many people only have a few dozen followers while only a handful of celebrities have millions. This is all very problematic for parallel computation in general and MapReduce in particular.  As a result, Carlos Guestrin and his crack team at the University of Washington in Seattle have developed a new framework, called GraphLab, that is specifically designed for graph-based parallel machine learning.  In many cases, GraphLab can process such graphs 20-50X faster than Hadoop MapReduce.  Learn more about their exciting work here.
Carlos is a member of the Intel Science and Technology Center for Cloud Computing, and we started working with him on graph-based machine learning and data mining challenges in 2011.  Quickly it became clear that no one had a good story about how to construct large-scale graphs that frameworks like GraphLab could digest.  His team was constantly writing scripts to construct different graphs from various unstructured data sources.  These scripts ran on a single machine and would take a very long time to execute.  Essentially, they were using a labor-intensive, low-performance method to feed information to their elegant high-performance GraphLab framework.  This simply would not do.
Scanning the environment, we identified a more general hole in the open source ecosystem: A number of systems were out there to process, store, visualize, and mine graphs but, surprisingly, not to construct them from unstructured sources.  So, we set out to develop a demo of a scalable graph construction library for Hadoop.  Yes, for Hadoop.  Hadoop is not good for graph-based machine learning but graph construction is another story.  This work became GraphBuilder, which was demonstrated in July at the First GraphLab Workshop on Large-Scale Machine Learning and open sourced this week at 01.org (under Apache 2.0 licensing).
Anyway we are quite excited about this progress - we are sure that GraphBuilder is going to turn into a very useful application!


I also got from Nilesh Jain, GraphBuilder project owner, a link to the project release notes.

No comments:

Post a Comment