Krzysztof Fidelis
Structural Biology
Lawrence Livermore National
Lab
mailto:fidelis1@llnl.gov
Tom Slezak
Biology and Biotechnology Research Program
Lawrence
Livermore National Lab
mailto:slezak@llnl.gov
We are at a point in time where science, business, and government must face many common technical challenges in their struggle to keep moving forward. The challenges stem from our growing abilities to create and collect complex data at tremendous rates, followed naturally by our need to manage and analyze that same data. To maximize the value of this data, we first must be able to access and manipulate it as structured, coherent, integrated information. Scientific data management, the discipline of improving the scientist's interactions with data, addresses this need in the scientific arena (the domain of this pilot project). Data mining and knowledge discovery is a cornerstone of scientific data management, and of advanced management of data from all sectors in general. Data mining has the narrower focus of helping automate the exploration and characterization of the data being generated. This leaves the analyst/scientist free to concentrate on the actual interpretation and use of the data.
This white paper briefly describes a new, aggressive effort in large-scale data mining at Lawrence Livermore National Labs (LLNL). In the short term, this effort will focus on several mission-critical questions in the Human Genome project. We will adapt current data mining techniques to the genome domain, extend them to quantify the accuracy of inference results, and lay the groundwork for a more extensive R&D effort in large-scale data mining. A major aspect of the approach is that we will be leveraging a fully-staffed data warehousing effort in the human genome area. The long term goal is to build a strong applications-oriented research program in large-scale data mining. The tools, experience and skill set gained will be directly applicable to a wide spectrum of tasks involving advanced analytics for large spatial and multidimensional data. This includes applications in ensuring Global Security (non-proliferation, stockpile stewardship), enabling Global Ecology (Materials Database for Industrial Ecology), advancing the Biosciences (Human Genome Project), and supporting work for others (Battlefield Management, Health Care).
There are many challenges on the path to effective large-scale data mining. The issues are broad, and their solutions require expertise from several domains in computer science, mathematics and statistics. We introduce them by laying out several axes along which current technology must be scaled to match the demands of a data-intensive domain:
There are three main strategies for overcoming the barriers facing large-scale data mining:
1. Build scalable I/O architectures and interfaces. For example, build intelligent interfaces between DBMS systems and tertiary storage.
2. Develop algorithms that work with non-traditional data types such as time sequences, protein sequences, or 3D structures.
3. Scale algorithms to work with larger data sets. This involves reducing algorithm complexity, effective parallelization, and controlled sampling or filtering.
Several major efforts at LLNL focus on the first issue, including the High Performance Storage System (HPSS), the Message Passing Interface, and the Scalable I/O Facility. Other work revolving around scalable interfaces includes the Interface Data Repository, and the Conquest/OASIS projects [2, 9]. We plan to leverage these efforts at some point in the future. The second focus, developing algorithms that work with non-traditional data types, is important as well. In our judgment, the highest payoff research area for genome, and for large-scale data mining in general is #3, scaling algorithms to work with large data. The key enablers will come from a focus on parallel codes, and controlled sampling techniques.
For the pilot, we will adapt existing parallel codes like SPRINT [8] and MLC++ [3] to the human genome domain. Where advantageous we will extend other algorithms with the more obvious parallel optimizations such as numerical and graph-search parallel techniques. Our primary focus, however, will be to build a controlled random sampling framework based on the expected utility of additional sampling.
The key concept behind random sampling is the realization that, often, the economically rational decision is to use a small portion of the available data to answer a question. For interactive data mining tasks, a user might want only a rough idea of the predicted value of a variable, and so would want to cut off search when the expected accuracy reaches, say, +/-10%. Or similarly, stop when it is 95% certain that the current answer will not change no matter how much of the remaining data is seen. The methods that can support these abilities can also be used to answer questions like: "What sample size will probably be large enough to produce inferences within a given expected error," and "What is the estimated accuracy of this prediction given the data that was used to produce it."
The theoretical basis for these techniques stems from the idea that the utility of additional sampling is a measure of the expected marginal gain in inference quality [4, 5]. The research is strongly related to ideas from learning from sparse data [6], sampling theory [1], game theory [7], and anytime algorithms [10].
There are two clear benefits of this approach. First, this research will work a critical shortcoming in many state of the art mining algorithms: the lack of a well-grounded metric by which to judge the quality of inference results. The marginal gain in inference quality is the key to providing this metric. The second major benefit is that we can control the large-scale aspect by taking samples of the data large enough to return high quality answers, but much smaller than the full-blown corpus of data.
The technical focus of this pilot project is such that the results could be readily transferred to nearly any application of data mining to large data. The usefulness of data mining techniques in business, scientific and government applications is well known. We believe data mining will also become a central technology in data warehousing and intelligent integration techniques, aiding in schema and row identity analysis.
The initial domain of study for the pilot is the Human Genome Project. This domain is stimulating and challenging, and the staff is interested and already contributing to a highly leveragable R&D project: "Data Warehousing and Integration for Scientific Data Management". The warehousing project marks a significant push to collect, organize and integrate the corpus of genetic map data, sequence data, protein structure, taxonomy, and other information that exists at LLNL and other sites around the world. The data infrastructure is currently under construction, and should be in place by FY98. The warehousing effort addresses the Heavy Data Management Demands described in the Barriers Section, and will establish an excellent testbed environment for the data mining effort.
Human Genome has several important research questions that data mining can be used to answer. Two examples are:
The first usable prototype of the warehouse will be deployed by the end of FY97, and will provide the testbed for the data mining research. From the research standpoint, we will be productive within a few months of project initiation. We expect that with two scientists committed to the pilot project, strong practical results will be achievable within half a year of the start date. Based on research results achieved by one of the PI's in this area [4, 5, 6], we believe the chance of success to be high.
There are significant resources that this effort will be able to leverage. The largest impact will come from the data warehousing R&D effort between the Center for Applied Scientific Computing (CASC) and the Biology and Biotechnology Research Program (BBRP). The project is fully funded, staffed at 2 postdocs and 3/4ths of a staff scientist from CASC, as well as several staff scientists from BBRP.
Other leveragable resources include:
LLNL is strongly positioned to make a significant impact on the Human Genome effort in the short term, and on both the theoretic and practical aspects of this new field of large-scale data mining. The combined knowledge, tools, collaborations and other resources available to this effort at LLNL will enable significant progress in a short time frame.
[1] R. Bechhofer and D. Goldsman; "Sequential Identification and Ranking Procedures", The University of Chicago Press, Chicago, IL, 1968.
[2] P. Brown, R. Troy, D. Fisher, S. Louis, J. McGraw, R. Musick; "Metadata for Balanced Performance", Proc. of the 1st IEEE Metadata Conference, 1996.
[3] R. Kohavi, G. John, R. Long, D. Manley, K. Pfleger; "MLC++: A Machine Learning Library in C++", Tools with Artificial Intelligence, 1994.
[4] R. Musick, J. Catlett, S. Russell; "An Efficient Method for Constructing Approximate Decision Trees for Large Databases", Proc. of the 10th Int'l Conference in Machine Learning, 1993.
[5] R. Musick; "Belief Network Induction", Ph. D. Dissertation, UCB Tech Report CSD-95-863. University of California, Berkeley, 1994.
[6] R. Musick; "Rethinking the Learning of Belief Network Probabilities", Proc. of the 2nd Int'l Conference on Knowledge Discovery and Data Mining, 1996.
[7] S. Russell and E. Wefald; "Do the Right Thing: Studies in Limited Rationality", MIT Press, Cambridge, MA, 1991.
[8] J. Shafer, R. Agrawal, M. Mehta; "Fast Serial and Parallel Classification of Very Large Databases", Proc. of the 22nd Int'l Conference on Very Large Database, 1996.
[9] P. Stolorz, H. Nakamura, E. Mesrobian, R. Muntz, E. Shek, J. Santos, J. Yi, K. Ng, S. Chien, C. Mechoso, J. Farrara; "Fast Spatio-Temporal Data Mining of Large Geophysical Datasets", Proc. of the 1st Int'l Conference on Knowledge Discovery and Data Mining, 1995.
[10] S. Zilberstein; "Using Anytime Algorithms in Intelligent Systems", AI Magazine, 17(3), 1996.
The principal investigators on this project all have significant experience and visibility in their respective fields.
Dr. Ron Musick earned his Ph.D. in Computer Science at the
University of California, Berkeley. At LLNL Dr. Musick has been deeply involved
in data and information management, and is the P.I. of the above mentioned data
warehousing R&D project. He is the Program Chair for the
Dr. Krzysztof Fidelis (Structural Biology) is leading the LLNL effort in protein structure prediction and assessment. He is an organizer and co-founder of international conferences on Critical Assessment of Techniques for Protein Structure Prediction and Director of the Livermore based Center for Critical Assessment of Protein Structure Prediction. His current research is devoted to the development of protein fold recognition with applications to structural and functional characterization of the genomic and expressed sequence data. Ph.D. Biophysics/Phys.Chem.
Mr. Tom Slezak (Human Genome) has spent more than 18 years at LLNL, most involving computational support of BBRP research (nine of those years on the Genome project). He is the architect of all of BBRP's database, analysis, and graphical tools. He designed and implemented the physical map assembly and integration solutions and all database abstractions. He is an acknowledged expert in genome informatics, has been invited to speak at major genome labs in England, France, Germany, and Japan and has consulted at many major biotech and pharmaceutical companies. BS/MS CS.
The data warehousing R&D effort was begun in FY97, and the pilot project described in this white paper has not yet begun. Most of our related prior results are research-oriented. Publications directly relevant to the research problems in large-scale data mining include:
P. Brown, R. Troy, D. Fisher, S. Louis, J. McGraw, R. Musick; "Metadata for Balance Performance", Proc. of the 1st IEEE Metadata Conference, 1996.
R. Musick, J. Catlett, S. Russell; "An Efficient Method for Constructing Approximate Decision Trees for Large Databases", Proc. of the 10th Int'l Conference in Machine Learning, 1993.
R. Musick; "Belief Network Induction", Ph. D. Dissertation, UCB Tech Report CSD-95-863. University of California, Berkeley, 1994.
R. Musick; "Rethinking the Learning of Belief Network Probabilities", Proc. of the 2nd Int'l Conference on Knowledge Discovery and Data Mining, 1996.
R. Musick; "Minimal Assumption Distribution Propagation in Belief Networks", Proc. of Int'l Conference on Uncertainty in Artificial Intelligence, 1993.
There is an existing, effective collaboration between BBRP and CASC upon which this effort would build. The principal investigators on this proposal are already working together on the data warehousing LDRD. We are looking into further collaborations with Professor Shlomo Zilberstein at Amherst University in Massachusetts, a leader in the area of Anytime Algorithms, and Dr. Paul Stolorz from NASA JPL.