Chandan Rajpurohit

An Artist With Technical Skills

As world is shifting towards future, data has become driving force.

Data mining and its application are giving regular insights to world and its nature. We are able to predict various outcome with atmost confidence.

Association rule has greatly influenced sales.

Let’s dive into concept of association rule in DATA MINING

What is Association Rule?

The objects or items from databases and other information repositories are used to find frequent patterns, associations, correlations, or structures.

In association rule we try to search for relationships among items. We associate one item with other.

It can help you to find likelihood of items that are bought together etc.

The rule can be stated as “Given that someone has bought item x they are likely to buy item y”.

What is Apriori Algorithm?

Apriori Algorithm solves frequent item sets problem. It determines which combination of items occur together by analyzing a data set.

It is core of various data mining problems. It is also used for finding the association rules in basket-items relation.

Code

from itertools import combinations


def getItemsetCountPairs(itemsets_): #takes arg like this [('i1', 'i4'),('i2','i3')]
    ##print("inside getcandidate")
    Candidate_ = {itemset_:0 for itemset_ in itemsets_}
    ##print("\tCandidate: ", Candidate_)
    for itemset_ in itemsets_:        # itemset is a tuple
        ##print("\titemset: ",itemset_)
        for transaction in dataset:
            ##print("\t\tdataset[transaction]: ", dataset[transaction])
            ##print("\t\t",itemset_,"is subset of", set(dataset[transaction]),"?")

            if str == type(itemset_):
                ##print("\t\tSingleton set")
                ##print("\t\t", set([itemset_]).issubset(set(dataset[transaction])))
                if set([itemset_]).issubset(dataset[transaction]):
                    Candidate_[itemset_] += 1
            else:
                ##print("\t\t", set(itemset_).issubset(set(dataset[transaction])))
                if set(itemset_).issubset(dataset[transaction]):
                    Candidate_[itemset_] += 1

    return Candidate_    #returning something like this {('i1','i2'):2 ,('i2','i3'):3}


def getShortlistedPairs(Candidate_):
    Candidate_2 = {}
    for itemset_ in Candidate_:
        SC = Candidate_[itemset_]
        if SC >= min_SC:
            Candidate_2[itemset_] = Candidate_[itemset_]

    return Candidate_2


def getItems(itemsets_):
    items =[]
    for itemset in itemsets_:
        if tuple == type(itemset):
            for item in itemset:
                if item not in items:
                    items.append(item)
        elif str == type(itemset):
            if itemset not in items:
                items.append(itemset)
    return  items


def genAssociations(itemsets_):
    associations = []

    if type(itemsets_[0]) == str:
        itemset_set = set(itemsets_)  # because only one itemset
        for i in range(1,len(itemset_set)):
            As_list = list(combinations(itemset_set,i))
            for A in As_list:      # if bug itemset_set -> itemsets_
                A_set = set(A)
                B_set = itemset_set - A_set
                associations.append([A_set,B_set])
        return  associations


    for itemset in itemsets_:
        itemset_set = set(itemset)
        for i in range(1, len(itemset_set)):
            As_list = list(combinations(itemset_set,i))

            for A in As_list:  # if bug itemset_set -> itemsets_
                A_set = set(A)
                B_set = itemset_set - A_set
                associations.append([A_set, B_set])
    return associations

def getSupportCount(itemset_):
    supportCount = 0
    for transaction in dataset:
        if itemset_.issubset(dataset[transaction]):
            supportCount +=1
    return supportCount

def getConfidence(A_,B_):
    return getSupportCount(A_|B_)/getSupportCount(A_)



n_Ts = int(input("Enter no. of transactions: "))
dataset = {"T"+str(_+1):[] for _ in range(n_Ts)}

for i in range(1,n_Ts+1):
    items = input("Enter items for T{}: ".format(i)).split()
    for item in items:
        dataset["T"+str(i)].append(item)

min_SC = int(input("Enter Minimum Support Count: "))
confidence_threshold = int(input("Enter Confidence Threshold % : "))

itemsets = []
for key in dataset:           # Identifying Itemsets
    for item in dataset[key]:
        if item not in itemsets:
            itemsets.append(item)

##print(itemsets)
Candidate = getItemsetCountPairs(itemsets)
##print(Candidate)
Candidate = getShortlistedPairs(Candidate)
##print("--------------------------------------")
##print("After shortlisting")
##print(Candidate)
#Candidate = getCandidate([('i1', 'i2'), ('i2','i3')])
#print(Candidate)
Candidate_old = Candidate
no_of_items_in_itemset = 1
while max(Candidate.values())>=min_SC :
    ##print("___________________________________________________________")
    Candidate_old = Candidate
    ##print("before shortlisting: ", Candidate)
    Candidate = getShortlistedPairs(Candidate)
    ##print("After shortlisting", Candidate)

    no_of_items_in_itemset += 1
    ##print(Candidate.keys())
    items = getItems(Candidate.keys())
    ##print("items=", items)
    if len(items) < no_of_items_in_itemset:
        Candidate_old = Candidate
        break
    itemsets = list(combinations(items,no_of_items_in_itemset))
    ##print("itemsets:",itemsets)

    Candidate = getItemsetCountPairs(itemsets)



##print("Final", Candidate_old)

frequent_sets = list(Candidate_old.keys())

associations = genAssociations(frequent_sets)

#print(associations)
#print(len(associations))

confidences = []
confidence_percentages = []

for association in associations:
    A,B = association
    confidences.append(getConfidence(A,B))

#print("confidences",confidences)

true_rules_indexes =[]

for i in range(len(confidences)):
    if confidences[i]*100 > confidence_threshold:
        true_rules_indexes.append(i+1)    #icrementing by 1 for display

#print(true_rules_indexes)

 ############ Displaying final output #############
print("\nFrequent itemset(s) are: ")
for itemset_ in Candidate_old:
    print("itemset: {"+str(itemset_).strip("()")+"} support count:", Candidate_old[itemset_])
print()
print("Sr. No.\tAssociation Rule\tSupport Count\tConfidence\tConfidence %")
for i,association,confidence in zip(range(1,len(associations)+1),associations,confidences):
    A_,B_ = association
    sc = getSupportCount(A|B)
    print("{}\t\t{}\t\t{}\t\t\t\t{}\t\t\t{}".format(i,str(A_).strip("{}")+"->"+str(B_).strip("{}"), sc, confidence,confidence*100))
print()
print("If the minimum confidence threshold is {} (Given),\n"
      "then only the rules {} are the output and \n"
      "final association rules generated which are strong.".format(confidence_threshold,str(true_rules_indexes).strip("[]")))

Output

output

Solved Example

Consider following problem with min. support count = 60% and confidence = 80%

Problem

Solution

min. support count = 60% i.e support count = 3

Step 1: Scan D (database) for count of each candidate.

The candidate list is {A,B,C,D,E,K}. Find the support

step 1

Step 2: Compare candidate support count with minimum support count.

step 2

Step 3: Generate candidate

step 3

Step 4: Scan D for count of each candidate and find the final support.

step 4

Step 5: Compare candidate support count with minimum support count.

step 5

Step 6: Generate candidate

step 6

Step 7: Scan D for count of candidate.

step7

Step 8: Compare candidate support count with minimum support count.

step 8

Step 9: So data contains frequent itemset {A,B,D}.

Therefore, the association rule that can be generated from frequent itemset are as follows:

step 9

If minimum confidence threshold is 80% (given); then only 3,4,6 rules are the output and final association rules which are generated are strong.

Practice with more set of examples.

THIS IS THE WAY !!!!

Happy Learning ๐Ÿ™‚


Leave a Reply

%d bloggers like this: