With the emergence of big data, inducting regression trees on very large data
sets became a common data mining task. Even though centralized algorithms for
computing ensembles of Classification/Regression trees are a well studied
machine learning/data mining problem, their distributed versions still raise
scalability, efficiency and accuracy issues.
Most state of the art tree learning algorithms require data to reside in memory
on a single machine.
Adopting this approach for trees on big data is not feasible as the limited
resources provided by only one machine lead to scalability problems. While more
scalable implementations of tree learning algorithms have been proposed, they
typically require specialized parallel computing architectures rendering those
algorithms complex and error-prone.
In this thesis we will introduce two approaches to computing ensembles of
regression trees on very large training data sets using the MapReduce framework
as an underlying tool. The first approach employs the entire MapReduce cluster
to parallely and fully distributedly learn tree ensembles. The second approach
exploits locality and independence in the tree learning process.