Frequent subtree mining
In computer science, frequent subtree mining is the problem of finding all patterns in a given database whose support (a metric related to its number of occurrences in other subtrees) is over a given threshold.[1] It is a more general form of the maximum agreement subtree problem.[2]
Definition
Frequent subtree mining is the problem of trying to find all of the patterns whose "support" is over a certain user-specified level, where "support" is calculated as the number of trees in a database which have at least one subtree isomorphic to a given pattern.[3]
Formal definition
The problem of frequent subtree mining has been formally defined as:[1]
- Given a threshold minfreq, a class of trees , a transitive subtree relation between trees , a finite set of trees , the frequent subtree mining problem is the problem of finding all trees such that no two trees in are isomorphic and
- where d is an anti-monotone function such that if then
Algorithms
In 2002, Mohammed J. Zaki introduced TreeMiner, an efficient algorithm for solving the frequent subtree mining problem, which used a "scope list" to represent tree nodes and which was contrasted with PatternMatcher, an algorithm based on pattern matching.[4]
Applications
Domains in which frequent subtree mining is useful tend to involve complex relationships between data entities: for instance, the analysis of XML documents often requires frequent subtree mining.[1] Another domain where this is useful is the web usage mining problem: since the actions taken by users when visiting a web site can be recorded and categorized in many different ways, complex databases of trees need to be analyzed with frequent subtree mining.[4] Other domains in which frequent subtree mining is useful include computational biology,[5][6] RNA structure analysis,[6] pattern recognition,[7] bioinformatics,[8] and analysis of the KEGG GLYCAN database.[9]
Challenges
Checking whether a pattern (or a transaction) supports a given subgraph is an NP-complete problem, since it is an NP-complete instance of the subgraph isomorphism problem.[7] Furthermore, due to combinatorial explosion, according to Lei et al., "mining all frequent subtree patterns becomes infeasible for a large and dense tree database".[10]
References
- 1 2 3 Chi, Yun; Muntz, Richard R.; Nijssen, Siegfried; Kok, Joost N. (28 June 2005). "Frequent Subtree Mining - An Overview". Fundamenta Informaticae. 66: 161–198. Retrieved 3 July 2014.
- ↑ Deepak, Akshay; Fernández-Baca, David; Tirthapura, Srikanta; Sanderson, Michael J.; McMahon, Michelle M. (July 2013). "EvoMiner: frequent subtree mining in phylogenetic databases". Knowledge and Information Systems. doi:10.1007/s10115-013-0676-0. Retrieved 4 July 2014.
- ↑ Dai, H., Srikant, R. and Zhang, C. (2004). "Advances in Knowledge Discovery and Data Mining." 8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, May 26–28, 2004, Proceedings. 1st ed. p. 65.
- 1 2 Zaki, Mohammed J. (2002). "Efficiently mining frequent trees in a forest". Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining: 71–80. Retrieved 16 June 2014.
- ↑ Deepak, Akshay, David Fernández-Baca, Srikanta Tirthapura, Michael J. Sanderson, and Michelle M. McMahon. "EvoMiner: frequent subtree mining in phylogenetic databases." Knowledge and Information Systems (2011): 1-32.
- 1 2 Chi, Yun, Yirong Yang, and Richard R. Muntz. "Canonical forms for labelled trees and their applications in frequent subtree mining." Knowledge and Information Systems 8, no. 2 (2005): 203–234.
- 1 2 Chi, Yun; Yang, Yirong; Muntz, Richard R. (2004). "Mining Frequent Rooted Trees and Free Trees Using Canonical Forms" (PDF). Knowledge and Information Systems. Retrieved 16 June 2014.
- ↑ Xiao, Yongqiao; Yao, Jenq-Foung; Li, Zhigang; Dunham, Margaret H. (2003). "Efficient data mining for maximal frequent subtrees". Third IEEE International Conference on Data Mining: 379–386. Retrieved 16 June 2014.
- ↑ Aoki-Kinoshita, Kiyoko F. (2009). Glycome Informatics: Methods and Applications. CRC Press. p. 141. ISBN 9781420083347.
- ↑ Zou, Lei, Yansheng Lu, Huaming Zhang, and Rong Hu. "Mining frequent induced subtree patterns with subtree-constraint." In Data Mining Workshops, 2006. ICDM Workshops 2006. Sixth IEEE International Conference on, pp. 3–7. IEEE, 2006.