The application of data mining techniques for knowledge discovery holds big promise. Data mining algorithms have been successfully applied in many different application areas, including but not limited to, retail, telecommunications, and more. However, applying these methods in the medical domain has its challenges because the data sets are often very large and complex, with numerous rare variables such as diagnosis, procedures and drugs. These variables are rare as most people are healthy (and therefore have few diagnosis, procedures and drugs), and those who are not, suffer from a wide variety of conditions (and therefore relatively few people have the same set of conditions).
The data used in this assignment is based on claims data. Claims data is generated when hospitals, pharmacies and other health care providers send claims to third party payers to receive reimbursement for their services. The claims data include information on diagnoses, procedures and drugs prescriptions as well as place of service, and patient’s age, gender and ZIP-code. A collection of members’ claims has the benefit of giving a “bird’s eye view” of patients’ health care.
Diabetes is a disease in which the body does not produce or properly use insulin. Diabetes is a chronic disease, and it is believed that over 18 million Americans suffer from it. Effective disease management of this group, and advancing understanding of the disease, is therefore of great importance.
The data (members.xlsx) used in this study has information on over 17,000 diabetic patients. We have data on their diagnosis, procedures and drugs over a 12 month observation period (as well as some other details from the observation period) together with their overall health care costs in the 12 months following the observation period, called the result period. We will apply Association Rules to gain insights into diabetes, risky health care patterns and exploratory data-mining in general.
Associations Rule Analysis
1. For each of the diagnosis columns, count the number of members with the diagnosis.
a. Provide a readable (translate the diagnosis codes into actual conditions) table as an exhibit of the top 10 diagnosis and their counts.
2. Run the association rule algorithm in XLMiner, on diagnosis information only (columns Diag_DD0002 through Diag_DD0266 – leave out Diag_DD0000), with all the default parameter settings.
a. What is the average confidence of the rules created?
b. What is the average lift of the rules created?
c. Briefly explain the reasons behind the average confidence and the average lift.
3. One of the difficulties with creating good rules with medical data is that most diseases are rare; therefore setting the support too high will create uninteresting rules, if any at all. Run the association rule algorithm again with default confidence setting, but changing the minimum support setting, setting it first to 174, and then to 17.
a. How many rules were created when the minimum support is 174?
b. How many rules were created when the minimum support is 17?
4. Analyze the three rules that have the maximum lift (you may need to order your rules) when the minimum support is set to 174. Provide a brief interpretation of these three rules and explain why all of them have the same support, confidence and lift.
5. Delete Diag_DD0046 from the data (or otherwise exclude it from the analysis). Rerun the Association Rules algorithm with a) default settings, b) minimum support set to 174, c) minimum support set to 17.
a. How many rules were created when the minimum support is 174?
b. How many rules were created when the minimum support is 17?
6. Out of the latter two runs, select the rule with the highest lift ratio. Explain the rule in words.
a. Hypothetically, if you were a doctor and a diabetic patient walks in already diagnosed with {the rules antecedent diagnosis}, how could the rule potentially guide your work (with all the simplifying assumptions needed to answer this question!).
7. One of the main reasons behind the interest in data-mining in health care is the hope that through intervention and prevention, one can help reduce health care costs by identifying patients early who are at risk of high health care cost. In order to use association rules to contribute to this goal, it is not enough to run association rules on the whole data set, as there is nothing that distinguishes between costly patients and not costly patients. In order to use association rules to distinguish high risk patients from low risk patients, we need to identify rules that have good support and high confidence on high cost population, but low support for the not-high-cost population. In particular, we are interested in identifying rules of the type:
{group of diagnosis codes} -> {High cost in a future period}
The variable TA2 contains the overall cost in the year following the observation period. Define a new variable that equals one if the overall cost is ≥ $40,000, and zero otherwise.
Run the Association Rule algorithm again using all diagnosis variables (continue to exclude the diabetes diagnosis and Diag_DD0000) and the new variable. Set the support as low as possible (but no smaller than 15) and the keep the confidence at the default setting. Sort the resulting rules by “consequent (c)”, and identify rules of the form above.
a. What was your support setting?
b. How many rules did you identify?
8. Create an exhibit that summarizes the rules (translate the diagnosis codes into actual conditions), their support, confidence and lift.
a. Briefly discuss the main characteristics of the rules.
9. Renal Failure is in general a very expensive condition to treat, as either the patient needs to be treated with dialysis or undergo a transplant, which is a costly surgery. It is therefore not surprising that some of the rules above include Renal Failure.
a. In the population of just over 17,000, how many suffer from renal failure?
b. Out of those that suffer from renal failure, how many have high costs following the observation period?
c. Based on the above answers, why did the Association Rule Analysis not give us the rule:
{Renal Failure} ->{High Cost} ?