This project implements a Hybrid Apriori algorithm for market basket analysis, designed specifically to handle extremely sparse transaction datasets. The implementation includes optimizations for efficient processing and comprehensive analysis tools for interpreting the results.
The analysis uses two datasets:
-
Sales1998.txt:
- 34,070 transactions with an average of 4.83 items per transaction
- 1,559 unique items
- Extremely sparse: the most frequent item (277) appears in only 0.42% of transactions
-
productList.txt:
- Contains names for all 1,560 products
- Maps product IDs to their corresponding product names
- Enhances the interpretability of analysis results
The HybridApriori
class implements a modified version of the classic Apriori algorithm with several optimizations:
-
Efficient Data Structures:
- Uses sets for transactions to speed up item lookups
- Employs Counter for efficient item counting
- Utilizes defaultdict for counting co-occurrences
-
Frequency-Based Sorting:
- Sorts items by frequency in descending order to optimize the search process
-
Pruning Strategies:
- Applies the Apriori principle to prune candidate itemsets
- Implements early termination if no frequent itemsets are found at a certain level
-
Memory Efficiency:
- Processes transactions in a streaming fashion
- Only stores frequent itemsets, not all candidate itemsets
-
Product Name Integration: - Loads product names from productList.txt
- Maps product IDs to their corresponding names
- Enhances visualization and result interpretation by displaying product names alongside IDs
Due to the extreme sparsity of the dataset, the analysis required using an absolute support count (10 transactions, equivalent to 0.0294%) instead of a percentage-based threshold. This approach yielded:
- 6 frequent 2-itemsets
- 12 association rules with high lift values (ranging from 23 to 32)
- Item 132 (Denny 60 Watt Lightbulb) appearing in 2 different frequent itemsets, making it the most connected item
- Interestingly, the top 3 most frequent items (277 "Great English Muffins", 1352 "Carrington Ice Cream", 846 "Nationeel Fudge Brownies") don't appear in any frequent itemsets
- Enhanced interpretability through product name integration, revealing meaningful associations like "Carlson Low Fat Sour Cream" frequently purchased with "Big Time Apple Cinnamon Waffles"
main.py
: Implementation of the Hybrid Apriori algorithm and analysis toolsSales1998.txt
: Raw transaction data (each line represents a transaction)productList.txt
: Product names corresponding to product IDsfrequent_itemsets.csv
: Output file containing discovered frequent itemsets with product namesassociation_rules.csv
: Output file containing generated association rules with product namesrequirements.txt
: List of Python dependencies
- Python 3.13 or higher
- Required packages: numpy, pandas, matplotlib
- Clone this repository:
git clone https://github.com/ZamoRzgar/basket-analysis.git
cd basket-analysis
- Install required packages:
pip install -r requirements.txt
To run the analysis:
python main.py
The script will:
- Load and analyze the transaction data
- Apply the Hybrid Apriori algorithm with appropriate parameters
- Generate and display frequent itemsets and association rules
- Create visualizations of the results
- Save the findings to CSV files
The algorithm uses the following default parameters:
- Minimum support count: 10 transactions (0.0294%)
- Minimum confidence: 0.05 (5%)
- Maximum itemset length: 3
These parameters can be adjusted in the main()
function to suit different analysis needs.
The implementation includes several visualization tools:
- Support distribution for frequent itemsets
- Top frequent items
- Top association rules by lift
- Item relationship networks
The analysis reveals strong associations between specific item pairs, with lift values indicating that these associations are 23-32 times stronger than would be expected by random chance. These insights can be valuable for:
- Product placement strategies
- Targeted promotions and cross-selling
- Inventory management
- Customer behavior understanding
Potential extensions to this project include:
- Testing different support thresholds to balance pattern discovery and statistical significance
- Implementing alternative algorithms like FP-Growth or Eclat for comparison
- Incorporating temporal aspects to analyze how associations change over time
- Applying clustering techniques to group similar items before running association rule mining
- Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), 487-499.
- Han, J., Pei, J., & Kamber, M. (2011). Data Mining: Concepts and Techniques (3rd ed.). Morgan Kaufmann.
- Tan, P. N., Steinbach, M., & Kumar, V. (2005). Introduction to Data Mining. Addison-Wesley.
This project is licensed under the MIT License - see the LICENSE file for details.
Zamo Rzgar Ahmed
If you found this project helpful, please consider giving it a star ⭐