Definition
A random forest is a machine learning algorithm for sorting data (classification) and making predictions (regression). It works by combining many decision trees, each generated from a random subset of the training data.
History of Random Forests
- The decision tree concept was conceptualized in the early 1980s, and it would later form the foundation of random forests. J. Ross Quinlan’s ID3 (1986) and C4.5 (1993) were among the most notable algorithms.
- Leo Breiman coined the concept of bootstrap aggregating (bagging) in 1996. He trained several decision trees using different bootstrap samples of the data and merged their predictions.
- Breiman named the technique random forests in 2001 and improved it by allowing each split in the tree to use a random subset of features, making the whole model more accurate.
- Random forests became popular in the early 2000s for being accurate, user-friendly, and versatile.
- Between 2000 and 2010, researchers and practitioners continued to explore the model’s capabilities, and its application expanded to finance, bioinformatics, and environmental modeling.
- Currently, random forests are integral to machine learning. They are often cited in academic research and used as benchmarks in machine learning competitions.
How a Random Forest Work
- A random forest begins with bootstrap sampling (splitting the original dataset into many small subsets). Multiple subsets can have similar data points.
- Each subset has a decision tree. However, only a random subset of the features is used when dividing a node in a tree, which introduces more randomness in the model.
- All the trees’ predictions are combined for the final results. The most common choice among the trees is assigned classification tasks, while the average of all the trees’ outputs is for regression.
Random Forest Use Cases
- Classification and regression: Random forests can categorize items and predict healthcare, banking, and stock market numbers.
- Feature importance: They also help to determine the parts of the data that are most important.
- Handling noisy, large datasets: Random forests can help with huge and complex datasets with numerous input variables. They work well with outliers and noise in the data.