8000 GitHub - ZamoRzgar/Basket-Analysis: Using Hybrid Apriori Algorithm for Basket Market Analysis
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

ZamoRzgar/Basket-Analysis

Repository files navigation

Market Basket Analysis using Hybrid Apriori Algorithm

Python License

Project Overview

This project implements a Hybrid Apriori algorithm for market basket analysis, designed specifically to handle extremely sparse transaction datasets. The implementation includes optimizations for efficient processing and comprehensive analysis tools for interpreting the results.

Dataset

The analysis uses two datasets:

  1. Sales1998.txt:

    • 34,070 transactions with an average of 4.83 items per transaction
    • 1,559 unique items
    • Extremely sparse: the most frequent item (277) appears in only 0.42% of transactions
  2. productList.txt:

    • Contains names for all 1,560 products
    • Maps product IDs to their corresponding product names
    • Enhances the interpretability of analysis results

Algorithm Implementation

The HybridApriori class implements a modified version of the classic Apriori algorithm with several optimizations:

  1. Efficient Data Structures:

    • Uses sets for transactions to speed up item lookups
    • Employs Counter for efficient item counting
    • Utilizes defaultdict for counting co-occurrences
  2. Frequency-Based Sorting:

    • Sorts items by frequency in descending order to optimize the search process
  3. Pruning Strategies:

    • Applies the Apriori principle to prune candidate itemsets
    • Implements early termination if no frequent itemsets are found at a certain level
  4. Memory Efficiency:

    • Processes transactions in a streaming fashion
    • Only stores frequent itemsets, not all candidate itemsets
  5. Product Name Integration:

    • Loads product names from productList.txt
    • Maps product IDs to their corresponding names
    • Enhances visualization and result interpretation by displaying product names alongside IDs

Key Findings

Due to the extreme sparsity of the dataset, the analysis required using an absolute support count (10 transactions, equivalent to 0.0294%) instead of a percentage-based threshold. This approach yielded:

  • 6 frequent 2-itemsets
  • 12 association rules with high lift values (ranging from 23 to 32)
  • Item 132 (Denny 60 Watt Lightbulb) appearing in 2 different frequent itemsets, making it the most connected item
  • Interestingly, the top 3 most frequent items (277 "Great English Muffins", 1352 "Carrington Ice Cream", 846 "Nationeel Fudge Brownies") don't appear in any frequent itemsets
  • Enhanced interpretability through product name integration, revealing meaningful associations like "Carlson Low Fat Sour Cream" frequently purchased with "Big Time Apple Cinnamon Waffles"

Files in this Repository

  • main.py: Implementation of the Hybrid Apriori algorithm and analysis tools
  • Sales1998.txt: Raw transaction data (each line represents a transaction)
  • productList.txt: Product names corresponding to product IDs
  • frequent_itemsets.csv: Output file containing discovered frequent itemsets with product names
  • association_rules.csv: Output file containing generated association rules with product names
  • requirements.txt: List of Python dependencies

Installation & Setup

Prerequisites

  • Python 3.13 or higher
  • Required packages: numpy, pandas, matplotlib

Installation

  1. Clone this repository:
git clone https://github.com/ZamoRzgar/basket-analysis.git
cd basket-analysis
  1. Install required packages:
pip install -r requirements.txt

Usage

To run the analysis:

python main.py

The script will:

  1. Load and analyze the transaction data
  2. Apply the Hybrid Apriori algorithm with appropriate parameters
  3. Generate and display frequent itemsets and association rules
  4. Create visualizations of the results
  5. Save the findings to CSV files

Parameters

The algorithm uses the following default parameters:

  • Minimum support count: 10 transactions (0.0294%)
  • Minimum confidence: 0.05 (5%)
  • Maximum itemset length: 3

These parameters can be adjusted in the main() function to suit different analysis needs.

Visualizations

The implementation includes several visualization tools:

  • Support distribution for frequent itemsets
  • Top frequent items
  • Top association rules by lift
  • Item relationship networks

Results

The analysis reveals strong associations between specific item pairs, with lift values indicating that these associations are 23-32 times stronger than would be expected by random chance. These insights can be valuable for:

  • Product placement strategies
  • Targeted promotions and cross-selling
  • Inventory management
  • Customer behavior understanding

Future Work

Potential extensions to this project include:

  • Testing different support thresholds to balance pattern discovery and statistical significance
  • Implementing alternative algorithms like FP-Growth or Eclat for comparison
  • Incorporating temporal aspects to analyze how associations change over time
  • Applying clustering techniques to group similar items before running association rule mining

References

  1. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), 487-499.
  2. Han, J., Pei, J., & Kamber, M. (2011). Data Mining: Concepts and Techniques (3rd ed.). Morgan Kaufmann.
  3. Tan, P. N., Steinbach, M., & Kumar, V. (2005). Introduction to Data Mining. Addison-Wesley.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Zamo Rzgar Ahmed


If you found this project helpful, please consider giving it a star ⭐

About

Using Hybrid Apriori Algorithm for Basket Market Analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0