Frequent subtree mining

In computer science, frequent subtree mining is the problem of finding all patterns in a given database whose support (a metric related to its number of occurrences in other subtrees) is over a given threshold.^[1] It is a more general form of the maximum agreement subtree problem.^[2]

Definition

Frequent subtree mining is the problem of trying to find all of the patterns whose "support" is over a certain user-specified level, where "support" is calculated as the number of trees in a database which have at least one subtree isomorphic to a given pattern.^[3]

Formal definition

The problem of frequent subtree mining has been formally defined as:^[1]

Given a threshold minfreq, a class of trees

{\mathcal {C}}

, a transitive subtree relation

P\preceq T

between trees

P,T\in {\mathcal {C}}

, a finite set of trees

{\mathcal {D}}\subseteq {\mathcal {C}}

, the frequent subtree mining problem is the problem of finding all trees

{\mathcal {P}}\subset {\mathcal {C}}

such that no two trees in

{\mathcal {P}}

are isomorphic and

\forall P\in {\mathcal {P}}:\quad \mathrm {freq} (P,{\mathcal {D}})=\sum \nolimits _{T\in {\mathcal {D}}}d(P,T)\geq \mathrm {minfreq} ,

where

d

is an anti-monotone function such that if

P'\preceq P

then

\forall T\in {\mathcal {C}}:\quad d(P',T)\geq d(P,T).

Algorithms

In 2002, Mohammed J. Zaki introduced TreeMiner, an efficient algorithm for solving the frequent subtree mining problem, which used a "scope list" to represent tree nodes and which was contrasted with PatternMatcher, an algorithm based on pattern matching.^[4]

Applications

Domains in which frequent subtree mining is useful tend to involve complex relationships between data entities: for instance, the analysis of XML documents often requires frequent subtree mining.^[1] Another domain where this is useful is the web usage mining problem: since the actions taken by users when visiting a web site can be recorded and categorized in many different ways, complex databases of trees need to be analyzed with frequent subtree mining.^[4] Other domains in which frequent subtree mining is useful include computational biology,^[5]^[6] RNA structure analysis,^[6] pattern recognition,^[7] bioinformatics,^[8] and analysis of the KEGG GLYCAN database.^[9]

Challenges

Checking whether a pattern (or a transaction) supports a given subgraph is an NP-complete problem, since it is an NP-complete instance of the subgraph isomorphism problem.^[7] Furthermore, due to combinatorial explosion, according to Lei et al., "mining all frequent subtree patterns becomes infeasible for a large and dense tree database".^[10]

References

1 2 3 Chi, Yun; Muntz, Richard R.; Nijssen, Siegfried; Kok, Joost N. (28 June 2005). "Frequent Subtree Mining - An Overview". Fundamenta Informaticae. 66: 161–198. Retrieved 3 July 2014.
↑ Deepak, Akshay; Fernández-Baca, David; Tirthapura, Srikanta; Sanderson, Michael J.; McMahon, Michelle M. (July 2013). "EvoMiner: frequent subtree mining in phylogenetic databases". Knowledge and Information Systems. doi:10.1007/s10115-013-0676-0. Retrieved 4 July 2014.
↑ Dai, H., Srikant, R. and Zhang, C. (2004). "Advances in Knowledge Discovery and Data Mining." 8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, May 26–28, 2004, Proceedings. 1st ed. p. 65.
1 2 Zaki, Mohammed J. (2002). "Efficiently mining frequent trees in a forest". Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining: 71–80. Retrieved 16 June 2014.
↑ Deepak, Akshay, David Fernández-Baca, Srikanta Tirthapura, Michael J. Sanderson, and Michelle M. McMahon. "EvoMiner: frequent subtree mining in phylogenetic databases." Knowledge and Information Systems (2011): 1-32.
1 2 Chi, Yun, Yirong Yang, and Richard R. Muntz. "Canonical forms for labelled trees and their applications in frequent subtree mining." Knowledge and Information Systems 8, no. 2 (2005): 203–234.
1 2 Chi, Yun; Yang, Yirong; Muntz, Richard R. (2004). "Mining Frequent Rooted Trees and Free Trees Using Canonical Forms" (PDF). Knowledge and Information Systems. Retrieved 16 June 2014.
↑ Xiao, Yongqiao; Yao, Jenq-Foung; Li, Zhigang; Dunham, Margaret H. (2003). "Efficient data mining for maximal frequent subtrees". Third IEEE International Conference on Data Mining: 379–386. Retrieved 16 June 2014.
↑ Aoki-Kinoshita, Kiyoko F. (2009). Glycome Informatics: Methods and Applications. CRC Press. p. 141. ISBN 9781420083347.
↑ Zou, Lei, Yansheng Lu, Huaming Zhang, and Rong Hu. "Mining frequent induced subtree patterns with subtree-constraint." In Data Mining Workshops, 2006. ICDM Workshops 2006. Sixth IEEE International Conference on, pp. 3–7. IEEE, 2006.

This article is issued from Wikipedia - version of the 9/2/2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.