Large Scale Data Mining
The new electronic world is made up of millions of computers
processing and collecting billions of pieces of information every
day. This information comes in the form of ATM transactions, cellular
& long distance phone call records, scientific observations, stock
market information, web sites transaction logs -- the list goes
on. Unfortunately, the scale of this data far outpaces the ability of
state of the art data mining systems, and valuable information is left
undiscovered. Further, these data sources exist over months or years,
and the processes generating them will often change during this
time. For example, seasonal effects, new products or promotions,
natural disasters, and changing economic conditions can all lead to
some type of concept drift. In these situations old data may actually
be misleading and a data mining system must carefully consider the way
such data is used. Finally, in many environments even storing the data
for later use is prohibitively difficult or expensive, and data mining
must instead be done on the stream of data, as it arrives.
We are exploring this new data mining environment, and propose a
number of design criteria for any data mining system which hopes to
succeed in it: the system must incrementally process data as it
arrives; require at most one look at each piece of data (and do so
faster than the data arrives); be ready to provide a useable model at
any point; produce a model which is equivalent (within some
user-supplied epsilon) to the model which would be produced by a
conventional batch system; and keep its model up-to-date as the
underlying process generating data changes over time. To date, we have
a developed a general method for scaling learning algorithms to these
massive data environments. We have used a version of the method to
develop two decision tree induction systems, two clustering systems
(for k-means clustering and EM clustering), two Bayesian network
learning algorithms, and a system for mining relational data -- each
of which meets many of our proposed design criteria. In the future we
hope to extend the method, evaluate its effectiveness in scaling up a
wider class of learning algorithms, work to make our current systems
as practical and useful as possible, and evaluate our work on large
real-world data mining problems.
People
Software
Keep watching here for the release of our VKML toolkit for mining
massive data streams and databases.
Publications
For a high level overview:
- Catching Up with the Data:
Research Issues in Mining Data Streams
[PS]
[PDF]
- Pedro Domingos and Geoff Hulten. Workshop on Research Issues
in Data Mining and Knowledge Discovery, 2001. Santa Barbara, CA:
http://www.cs.cornell.edu/johannes/dmkd2001.htm.
Complete List of Publications:
- Mining Massive Relational Databases
[PS]
[PDF]
- Geoff Hulten, Pedro Domingos, and Yeuhi Abe Proceedings
of the IJCAI-2003 Workshop on Learning Statistical Models from
Relational Data, 2003. Acapulco, Mexico: IJCAII.
- Mining Complex Models from Arbitrarily Large Databases in Constant Time
[PS]
[PDF]
- Geoff Hulten and Pedro Domingos. Proceedings of the
Eighth International Conference on Knowledge Discovery and Data Mining
(pp. 525-531), 2002. Edmonton, Canada: ACM Press.
- Learning from Infinite Data in Finite Time
[PS]
[PDF]
- Pedro Domingos and Geoff Hulten. Advances in Neural Information
Processing Systems (NIPS) 14, 2002. Cambridge, MA: MIT Press.
- Mining Time-Changing Data Streams
[PS]
[PDF]
- Geoff Hulten, Laurie Spencer and Pedro Domingos. In Proc. 7th ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining (pp. 97-106), 2001. San Francisco, CA: ACM Press.
- A General Method for Scaling Up Machine Learning Algorithms
and its Application to Clustering
[PS]
[PDF]
- Pedro Domingos and Geoff Hulten. In Proc. 18th International
Conference on Machine Learning (ICML) (pp. 106-113),
2001. Williamstown, MA: Morgan Kaufmann.
- Catching Up with the Data:
Research Issues in Mining Data Streams
[PS]
[PDF]
- Pedro Domingos and Geoff Hulten. Workshop on Research Issues in Data
Mining and Knowledge Discovery, 2001. Santa Barbara, CA:
http://www.cs.cornell.edu/johannes/dmkd2001.htm.
- Mining High-Speed Data Streams
[PS]
[PDF]
- Pedro Domingos and Geoff Hulten. In Proc. of the 6th ACM SIGKDD
International Conference on Knowledge Discovery and Data
Mining (pp. 71-80), 2000. Boston, MA: ACM Press.