Large Scale Data Mining

The new electronic world is made up of millions of computers processing and collecting billions of pieces of information every day. This information comes in the form of ATM transactions, cellular & long distance phone call records, scientific observations, stock market information, web sites transaction logs -- the list goes on. Unfortunately, the scale of this data far outpaces the ability of state of the art data mining systems, and valuable information is left undiscovered. Further, these data sources exist over months or years, and the processes generating them will often change during this time. For example, seasonal effects, new products or promotions, natural disasters, and changing economic conditions can all lead to some type of concept drift. In these situations old data may actually be misleading and a data mining system must carefully consider the way such data is used. Finally, in many environments even storing the data for later use is prohibitively difficult or expensive, and data mining must instead be done on the stream of data, as it arrives.

We are exploring this new data mining environment, and propose a number of design criteria for any data mining system which hopes to succeed in it: the system must incrementally process data as it arrives; require at most one look at each piece of data (and do so faster than the data arrives); be ready to provide a useable model at any point; produce a model which is equivalent (within some user-supplied epsilon) to the model which would be produced by a conventional batch system; and keep its model up-to-date as the underlying process generating data changes over time. To date, we have a developed a general method for scaling learning algorithms to these massive data environments. We have used a version of the method to develop two decision tree induction systems, two clustering systems (for k-means clustering and EM clustering), two Bayesian network learning algorithms, and a system for mining relational data -- each of which meets many of our proposed design criteria. In the future we hope to extend the method, evaluate its effectiveness in scaling up a wider class of learning algorithms, work to make our current systems as practical and useful as possible, and evaluate our work on large real-world data mining problems.

People

Pedro Domingos
Geoff Hulten
Yeuhi Abe
Laurie Spencer
Chun-hsiang Hung

Software

Keep watching here for the release of our VKML toolkit for mining massive data streams and databases.

Publications

For a high level overview:

Catching Up with the Data: Research Issues in Mining Data Streams [PS] [PDF]: Pedro Domingos and Geoff Hulten. Workshop on Research Issues in Data Mining and Knowledge Discovery, 2001. Santa Barbara, CA: http://www.cs.cornell.edu/johannes/dmkd2001.htm.

Complete List of Publications:

Mining Massive Relational Databases [PS] [PDF]: Geoff Hulten, Pedro Domingos, and Yeuhi Abe Proceedings of the IJCAI-2003 Workshop on Learning Statistical Models from Relational Data, 2003. Acapulco, Mexico: IJCAII.
Mining Complex Models from Arbitrarily Large Databases in Constant Time [PS] [PDF]: Geoff Hulten and Pedro Domingos. Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining (pp. 525-531), 2002. Edmonton, Canada: ACM Press.
Learning from Infinite Data in Finite Time [PS] [PDF]: Pedro Domingos and Geoff Hulten. Advances in Neural Information Processing Systems (NIPS) 14, 2002. Cambridge, MA: MIT Press.
Mining Time-Changing Data Streams [PS] [PDF]: Geoff Hulten, Laurie Spencer and Pedro Domingos. In Proc. 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 97-106), 2001. San Francisco, CA: ACM Press.
A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering [PS] [PDF]: Pedro Domingos and Geoff Hulten. In Proc. 18th International Conference on Machine Learning (ICML) (pp. 106-113), 2001. Williamstown, MA: Morgan Kaufmann.
Catching Up with the Data: Research Issues in Mining Data Streams [PS] [PDF]: Pedro Domingos and Geoff Hulten. Workshop on Research Issues in Data Mining and Knowledge Discovery, 2001. Santa Barbara, CA: http://www.cs.cornell.edu/johannes/dmkd2001.htm.
Mining High-Speed Data Streams [PS] [PDF]: Pedro Domingos and Geoff Hulten. In Proc. of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 71-80), 2000. Boston, MA: ACM Press.