Identifying High-Risk Groups: Medical Costs

juliamtw20
Oct 20, 2024
9 min read

Access the complete code on Kaggle.

Objective

In this project, I aimed to leverage the gathered data to benefit both companies, such as insurance providers, and individuals. By utilizing Clustering, I successfully grouped all individuals into 4 distinct clusters, each representing different risk levels. The main characteristics of each group were identified, providing valuable insights into their healthcare costs and potential risks. This analysis can help companies implement more targeted risk assessments and offer individuals a better understanding of their future healthcare costs.

For Individuals

Understanding Factors Driving Their Healthcare Costs: Business Meaning: By understanding the main contributors to their medical expenses, people can make more informed decisions about their health and lifestyle. This also allows them to plan for future medical expenses, potentially leading to better budgeting and financial planning.

For Businesses

Identifying High-Risk Groups: Business Meaning: Identifying groups (e.g., smokers, high BMI individuals) that are likely to have higher medical costs can help businesses adjust policies, design preventive health programs, or customize premium plans.

To achieve this objective, the analysis will follow a systematic approach that begins by exploring the data to uncover patterns and relationships, then delves deeper into statistical analysis and modeling to quantify the impact of each variable on healthcare costs. The following steps outline how this will be accomplished:

Data Cleaning and Transformation: Ensure the dataset is clean and reliable.
Exploratory Data Analysis (EDA): Grasp the initial insights into the variables and how they may relate to healthcare costs.
Statistical Analysis: Applying statistiscal tests and correlation analyses to evaluate the significance of the relationships.
Clustering Analysis: Grouping individuals into clusters based on similar characteristics.

Data Cleaning and Transformation

Starting with a glimpse at our dataset.

There are 1,337 data in total. With 7 columns including age, sex, bmi, number of children, smoking habit, region, and charges.

In order to ensure the dataset is clean, I deal with the duplicates, missing values, and outliers.

# check and remove duplicates
print(df[df.duplicated()])
df.drop_duplicates(inplace = True)
print(df[df.duplicated()]) # there are one duplicated value, from which we deleted here.

# missing values
print(df.isnull().sum()) # there is no missing value here. 

# Outliers (IQR)
Q1 = df[['charges']].quantile(0.25)
Q3 = df[['charges']].quantile(0.75)
IQR = Q3-Q1

criteria = (df[['charges']] < (Q1 - 1.5*IQR)) | (df[['charges']] > (Q3 + 1.5*IQR))
df[criteria.any(axis = 1)]
df2 = df.copy()
df2 = df2[~criteria.any(axis = 1)]

Feature Encoding

As there are modeling step in the project, I encode the feature to includes the non-numeric data into consideration.

sex: Male = 0, Female = 1
smoker: No = 0, Yes = 1
region: one-hot encode

df3 = df2.copy()
df3['sex'] = df2['sex'].apply(lambda x: 1 if x== 'female' else 0)
df3['smoker'] = df2['smoker'].apply(lambda x: 1 if x=='yes' else 0)
df3 = pd.get_dummies(df3, columns=['region'], prefix='region')
df3

Now the encoded data looks like the following.

Understanding Factors Driving Medical Costs -- EDA (Exploratory Data Analysis) and Statistical Analysis

The main purpose of EDA is to understand data's characteristics, distributions, and relationships.

Data Distribution

Understanding the distribution of the data helps to uncover patterns, identify potential biases, and ensure that the analysis results are accurate. It is particularly useful for recognizing the presence of outliers, skewness, and the overall shape of the data, all of which can influence the outcomes of statistical analysis and machine learning models.

# Distribution of all variables
numericVars = ['age', 'bmi', 'children', 'charges']
fig,ax = plt.subplots(nrows = 2, ncols = 4, figsize = (16,6))
for n in range(0, len(numericVars)):
    sns.histplot(data = df2, x = numericVars[n], kde = True, ax = ax[0, n])
sns.countplot(x='sex', data=df2, ax = ax[1,0], palette = 'Blues')
sns.countplot(x='smoker', data=df2, ax = ax[1,1], palette = 'Blues')
sns.countplot(x='region', data=df2, ax = ax[1,2], palette = 'Blues')

fig.suptitle('Distribution of')
plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.show()

In exploring the data distribution, several important patterns were identified. The age variable exhibited a more uniform distribution, with a notable concentration of individuals around 20 years old, while the BMI variable followed a near-normal distribution. In contrast, the charges variable was right-skewed due to the presence of high-cost individuals. The number of children was heavily skewed towards smaller family sizes. Additionally, while the dataset had a relatively even distribution across regions, the proportion of smokers was lower, creating a slight imbalance in that variable. Understanding these distributions provides a foundation for more detailed analysis, particularly in identifying high-risk groups and modeling medical costs.

Correlation Analysis

The primary aim of correlation analysis is to assess the strength and direction of relationships between variables, particularly how they may influence healthcare costs (charges), which is vital for identifying factors that contribute to higher medical costs.

The correlation coefficients revealed several significant relationships. For instance, there was a strong positive correlation between age and charges (r = 0.44), indicating that as individuals age, their medical costs tend to increase. Similarly, smoker also showed a high positive correlation with charges (r = 0.6), suggesting that smoking may lead to higher healthcare expenses.

Biviriate Analysis

Comparing every unique value in each variable by their charges, we can observe the following patterns:

There appears to be an increasing charge rate based on the person's age.
As the number of children increases, the charges also show a slight upward trend.
Whether an individual smokes or not causes a significant difference in charges.
Charges vary by region, with the Northern region costing more than the Southern region, while the Eastern region incurs higher charges compared to the Western region.

Statistical Significance

To further investigate whether these relationships are statistically significant, I applied several statistical tests:

Independent t-test (ttest_ind): This method was selected for analyzing the impact of the binary categorical variables 'sex' and 'smoker' on healthcare charges, given that t-test is more appropriate for understanding te statistical significance of two independent groups.
One-Way ANOVA (f_oneway): This test was used to assess the differences in charges across the categorical variable 'region'. ANOVA is suitable for this analysis because it allows for the comparison of means across multiple groups (in this case, four regions).
Pearson Correlation (pearsonr): This method was chosen for assessing the relationship between numerical variables such as 'age', 'bmi', and 'children' with healthcare charges. Pearson correlation is ideal for evaluating linear relationships between continuous variables.

In summary, while sex does not significantly affect healthcare costs, smoking status, age, BMI, and the number of children are important factors. The analysis highlights the need for targeted strategies to address high healthcare costs, particularly for smokers and older individuals, as well as the influence of regional disparities.

Identifying High-Risk Groups -- Clustering

From the previous correlation analysis and the statistical significance testing results, I selected the variables most related to the charges. Notably, I excluded the region dummies from the analysis as they introduced noise into the clustering task, potentially skewing the results and making it harder to identify distinct high-risk groups based on more relevant factors.

To group individuals effectively based on their likelihood of incurring high medical costs, I employed K-Means Clustering. Before implementing the clustering algorithm, I utilized the Elbow Method to determine the optimal number of clusters. This technique involves plotting the Within-Cluster Sum of Squares (WCSS) against the number of clusters. By examining the 'elbow' point in the graph, where the rate of decrease in WCSS sharply changes, I aimed to identify the most suitable number of clusters that balances the complexity of the model with the variance explained by the clusters.

# K-Means Clustering
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Scale the Features 
features = df3[['children', 'age', 'bmi', 'smoker']]
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Elbow Method
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, n_init=1).fit(features_scaled)
    kmeans.fit(features_scaled)
    wcss.append(kmeans.inertia_)

# Plot WCSS
plt.figure(figsize=(4, 2))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.grid(True)
plt.show()

To determine the optimal number of clusters, we have to select the value of k at the “elbow” in the point after which the distortion/inertia starts decreasing in a linear fashion. Thus, we conclude that the optimal number of clusters for the data is 4.

The K-Means algorithm partitions the data into distinct clusters, helping to uncover patterns and groupings within the dataset that are indicative of high-risk individuals likely to incur higher healthcare costs. On the right showing the number of individuals in each cluster.

Visualizations

Charges of each clusters

By visualizing charges, we can identify which groups incur higher medical costs, providing insights into the potential healthcare needs and risk profiles of different populations. This information can guide healthcare providers and insurers in developing targeted strategies for managing costs and improving care.

fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (12,3))
ax[0].set_title('Avg. Charges in Each Cluster')
plot_df.groupby('cluster')['charges'].mean().plot(kind = 'bar', ax = ax[0])
ax[0].set_ylabel('charges')
sns.boxplot(data = plot_df, x = 'cluster', y= 'charges', ax = ax[1],color = '#1f77b4')
ax[1].set_title('Charges in Each Cluster (Boxplot)')

We can see that Group 3 hold the highest average charges, indicating that individuals in this cluster likely have more complex health needs or higher medical expenses. This may suggest a concentration of high-risk individuals or those with chronic conditions within this group.

In contrast, Group 0 exhibits the lowest average charges, which could indicate that this cluster comprises individuals with generally better health profiles or lower healthcare utilization rates.

Data Distribution for Clusters

To gain a deeper understanding of the characteristics defining each cluster, we analyze the data distribution across various features. This visualization allows us to observe patterns and variations in key variables, such as age, BMI, and smoking status, within each segment. By understanding these distributions, we can better interpret the underlying factors that contribute to the healthcare costs associated with each cluster, ultimately leading to more informed decision-making in healthcare management and policy development.

kmeans = KMeans(n_clusters=4, n_init=1)
df3['cluster'] = kmeans.fit_predict(features_scaled)
plot_df = df2.copy()
plot_df['cluster'] = df3['cluster'].astype('category')

fig, ax = plt.subplots(nrows = 2, ncols = 3, figsize = (16,4))
fig.suptitle('Clusters in Each Variable')

for i in range(0,3):
    plot_df.groupby('cluster')[charVars[i]].value_counts().unstack(fill_value=0).plot(kind = 'bar', ax = ax[0,i], cmap = 'Paired')
    plot_df.plot(kind = 'scatter', x = numericVars[i], y = 'charges', ax = ax[1,i], c= 'cluster', cmap = 'RdBu', alpha = 0.6)
    ax[0,i].set_ylabel('Count')
    ax[1,i].set_xlabel(numericVars[i])

plt.subplots_adjust(wspace=0.3, hspace=0.3)

The distribution of key demographic and health-related variables across the identified clusters reveals several significant patterns:

Gender Distribution: Both female and male participants are uniformly distributed across all four clusters, indicating no substantial gender bias in the clustering process.
Smoking Status: Notably, all smokers are concentrated within Cluster 3. This suggests a potential association between smoking and the characteristics (high-risk) of this cluster.
Age Distribution: Cluster 0 primarily consists of a younger population, noted that Cluster 0 possesses the lowest charges. In contrast, Cluster 1 has a significant concentration of older individuals, which could correlate with higher healthcare expenses due to age-related health issues.
BMI Distribution: In terms of Body Mass Index (BMI), Cluster 3 shows a concentration of individuals with BMI values ranging from 20 to 30, which is against our intuition. The other clusters do not exhibit any pronounced patterns in BMI.
Number of Children: Cluster 2 includes a significant proportion of individuals with more than two children.

In the perspectives of Cluster, keep in mind that Cluster 0 & 3 own the lowest and highest charges perspectively.

From the plots, it shows that Cluster 3 has the highest average charges, comprising all smokers and individuals with the lowest mean BMI. This suggests a strong correlation between smoking status and elevated healthcare costs, potentially indicating higher health risks associated with smoking. On the other hand, Cluster 0 features the lowest average charges and consists predominantly of the youngest population. This may imply that younger individuals tend to have lower healthcare costs. However, the average costs in Cluster 3 are relatively higher despite its younger demographic, indicating that smoking may lead to greater healthcare expenses than simply aging.

Conclusion

In this project, we conducted a comprehensive analysis of the factors influencing insurance charges, utilizing various statistical methods and clustering techniques to uncover significant patterns within the dataset. Our exploration revealed several critical insights:

Statistically Significant Variables: Through T-tests and ANOVA, we identified key variables that significantly affect insurance charges, particularly age, smoking status, and BMI. The analysis highlighted the stark contrast in charges between smokers and non-smokers, confirming the substantial financial impact of smoking on insurance premiums.
Cluster Analysis: By employing K-means clustering, we effectively categorized individuals into distinct groups based on their characteristics and associated charges. Notably, our clustering results revealed that Cluster 3 comprises primarily smokers and individuals with higher average charges, while Cluster 0 represents the younger population with the lowest charges. This segmentation aids in understanding high-risk groups and tailoring insurance offerings accordingly.

By recognizing the characteristics of high-risk groups, insurers can implement more personalized and equitable pricing models that accurately reflect individual risk profiles. As the dataset expands, clustering techniques can help reduce dimensionality and minimize noise caused by complex data, thereby yielding more precise and actionable insights.

The patterns identified through this analysis can be utilized to assess the risk levels of future individuals entering the system, allowing insurers to inform them of potential medical costs they may face down the line. Moving forward, further investigations could explore the integration of additional variables, such as lifestyle factors and socioeconomic status, to enhance the predictive power of the models. Moreover, expanding the dataset to include a broader demographic could improve the generalizability of the findings, ensuring they remain relevant across diverse populations.