Published on Jan 21, 2021
Hello world, I'm Rodney Osodo. An undergrad student at Jomo Kenyatta University of Agriculture and Technology in Kenya. I've been interested in quantum computing for a while now and am so excited to share my learnings from my most recent experience with quantum computing.
This is my quantum open source foundation project on building a quantum variational classifier using a heart attack dataset. The purpose of this project was to help me gain insight into the actual construction of a quantum model, applied to real data. By sharing these insights, I hope to help many of you understand and learn much of the dynamics accompanied with quantum machine learning, which I grasped whilst doing this project. This will be a series of blogs in the order of:
this oneUltimately, we will:
My project plan is to:
Exploratory Data Analysis (EDA) refers to the critical process of performing initial investigations on a dataset so as to discover patterns, spot anomalies, test hypotheses and check assumptions. It it usually good practice in data science to explore the data first before getting your hands dirty and starting to build models.
My EDA process was short as the series topic alluded earlier, I would like to focus on quantum machine learning rather than dwell on data science!
To start off, we imported necessary libraries: pandas, numpy, matplotlib, seaborn and pandas profiling, and loaded our dataset
data = pd.read_csv(DATA_PATH)
The original dataset can be found here.
We then checked the shape of the dataset and found that we have 14 columns — that is 13 features and one target variable. We have 294 rows implying we have 294 datapoints which we will use to train and test our model. When we check our columns we see the actual names in the heading. The explanations of the each column is as follows:
print(data.columns)
Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num '], dtype='object')
Attribute Information
When we look at the head of the data, we see some missing values labeled ?.
We will try and fix this later. We proceed and check the info of the data and
see most columns are under dtype object instead of float64. We will also
need to fix this later on. This is because a dtype object can't be understood
by our model. As our model only understands numerical values e.g float and
integers.
When we check the unique values of each column,
def check_unique(df):
"""
Checks the unique value in each column
:param df: The dataframe
"""
for col in df.columns:
unique = df[col].unique()
print("Column: {} has {} unique values\n".format(col, unique))
we see that sex, cp , fbs, exang, oldpeak, slope, ca, thal and num are categorical variables, while the others are continuous variables. This will help inform us when deciding how to fill values with null spaces.
When we look at the target distribution, we can see that they are almost balanced, since one of the classes isn't more than 2/3 of the dataset.
When we look at the age distribution we see that most of the people are male:
We can definitely see that as you age, you are more prone to a heart attack until the age of 57, where your chances interestingly start reducing. At the age of 54 is the peak age where most people are susceptible to a heart attack.
This pairplot shows us the distribution of every class.
data = data.rename(columns={'num ':'num'})
?
in the dataset. One way of handling this is to change the ? in the data to
be np.Nan.def fix_missing_values(df):
"""
Changes ? in the data to be np.Nan
:param df: The dataframe
:return df: Fixed dataframe
"""
cols = df.columns
for col in cols:
for i in range(len(df[col])):
if df[col][i] == '?':
df[col][i] = np.NaN
return df
object to float64def change_dtype(df):
"""
Changes the data type from object to float64
:param df: The dataframe
:return df: Fixed dataframe
"""
cols = df.columns
for col in cols:
if df[col].dtype == 'O':
df[col] = df[col].astype("float64")
return df
We will delete the columns with more than half of its members empty. This is because if we try and fill them, most of the entries will have more fabricated values than real values.
Columns that have continuous values and the null values that are not less than half, we'll try and fix this by either replacing with the mean, mode or median. For our case I used mean values for:
These, we will fix with the mode value since they are either 0 or 1:
def fix_missing_values(df):
"""
Fixes the missing values by either deleting or replacing
:param df: The dataframe
:return df: Fixed dataframe
"""
def delete_missing_values(df):
"""
Deletes the column with Null values which are less than half of its values
"""
df = df[['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach','exang', 'oldpeak', 'num']]
return df
def fill_with_mean(df):
"""
Fills the NaN values with the mean value of the column
"""
cols = ['trestbps', 'chol', 'thalach']
for col in cols:
df[col].fillna(value=df[col].mean(), inplace=True)
return df
def fill_with_mode(df):
"""
Fills the NaN values with the mode value of the column
"""
cols =['fbs', 'restecg', 'exang']
for col in cols:
df[col].fillna(value=df[col].mode()[0], inplace=True)
return df
df = delete_missing_values(df)
df = fill_with_mean(df)
df = fill_with_mode(df)
return df
data.drop_duplicates(inplace=True)
One way of doing better and faster exploratory data analysis in a very short time, is to use pandas profiling. It basically returns an interactive report in HTML format which is quick to analyse the data.
We first check the correlation of the columns to the target variable:
data.drop('num', axis=1).corrwith(data['num']).plot(kind='bar', grid=True, figsize=(12, 8), title="Correlation with target")
We see that exang has the highest prositive correlation, followed by oldpeak and then cp. Thalach has the highest negative correlation.
Based on the correlation of the features with the target variable, we chose the first 4 positively highly correlated features. We finish off by checking the pandas profiling:
pp.ProfileReport(df=data, dark_mode=True, explorative=True)
From the overview, we are able to see we have to warnings which we will ignore because:
df_index has unique values. This is because it is the index column, so it
has all unique values.oldpeak has 188 (64.2%) zeros. The data is not normalized, hence most
values are 0.sklearn.preprocessing.MinMaxScaler between
ranges -2pi and 2pi. This is to ensure we utilize the Hilbert space
appropriately as we will be encoding the data into quantum states via rotation
angles.def normalize_data(dataPath="../../Data/Processed/data.csv"):
data = pd.read_csv(dataPath)
data = shuffle(data, random_state=42)
X, Y = data[['sex', 'cp', 'exang', 'oldpeak']].values, data['num'].values
# normalize the data
scaler = MinMaxScaler(feature_range=(-2 * np.pi, 2 * np.pi))
X = scaler.fit_transform(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)
return X_train, X_test, Y_train, Y_test
Subscribe to get future posts via email