Working with files

Working with files#

For most of your health data science work you will be working with larger datasets contains tens or hundreds of features. Unless your dataset is very large the storage medium is likely to be a file. Here we will take a look at how to read in the file from both a local storage, URLs and compressed archives. We’ll then take a short look at how to summarise whole or subsets of a DataFrame

Reading data into pandas from a flat file#

A common task is to work with data stored in one or more files. For example in a Comma Seperated Value (CSV) or other type of delimited (e.g. tab or pipe) file.

There are a number of scenarios you may encounter:

The data file is held locally on your machine (or network drive)
The data file is accessed via a URL (e.g. it is located in GitHub or hosted on a third party website.)
The data file is compressed e.g. in a .zip format

The good news is that reading from a local directory and reading from a remote URL is identical in pandas. In both cases we can use the pd.read_csv() function specifying either the local path to the file or the url.

As an example let’s read in the famous Wisonsin Breast Cancer dataset from Github.

# Wisconsin breast cancer dataset URL
url = 'https://raw.githubusercontent.com/health-data-science-OR/' \
      + 'hpdm139-datasets/main/wisconsin.csv'

# read into dataframe
df = pd.read_csv(url, index_col='id')

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 569 entries, 842302 to 92751
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 Unnamed: 0               569 non-null    int64  
 diagnosis                569 non-null    object 
 radius_mean              569 non-null    float64
 texture_mean             569 non-null    float64
 perimeter_mean           569 non-null    float64
 area_mean                569 non-null    float64
 smoothness_mean          569 non-null    float64
 compactness_mean         569 non-null    float64
 concavity_mean           569 non-null    float64
 concave points_mean      569 non-null    float64
symmetry_mean            569 non-null    float64
fractal_dimension_mean   569 non-null    float64
radius_se                569 non-null    float64
texture_se               569 non-null    float64
perimeter_se             569 non-null    float64
area_se                  569 non-null    float64
smoothness_se            569 non-null    float64
compactness_se           569 non-null    float64
concavity_se             569 non-null    float64
concave points_se        569 non-null    float64
symmetry_se              569 non-null    float64
fractal_dimension_se     569 non-null    float64
radius_worst             569 non-null    float64
texture_worst            569 non-null    float64
perimeter_worst          569 non-null    float64
area_worst               569 non-null    float64
smoothness_worst         569 non-null    float64
compactness_worst        569 non-null    float64
concavity_worst          569 non-null    float64
concave points_worst     569 non-null    float64
symmetry_worst           569 non-null    float64
fractal_dimension_worst  569 non-null    float64
dtypes: float64(30), int64(1), object(1)
memory usage: 146.7+ KB

df.shape

(569, 32)

Now let’s read in the same file, but this time it is compressed in .zip format.

url = 'https://raw.githubusercontent.com/health-data-science-OR/' \
      + 'hpdm139-datasets/main/wisconsin.zip'
df = pd.read_csv(url, index_col='id')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 569 entries, 842302 to 92751
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 Unnamed: 0               569 non-null    int64  
 diagnosis                569 non-null    object 
 radius_mean              569 non-null    float64
 texture_mean             569 non-null    float64
 perimeter_mean           569 non-null    float64
 area_mean                569 non-null    float64
 smoothness_mean          569 non-null    float64
 compactness_mean         569 non-null    float64
 concavity_mean           569 non-null    float64
 concave points_mean      569 non-null    float64
symmetry_mean            569 non-null    float64
fractal_dimension_mean   569 non-null    float64
radius_se                569 non-null    float64
texture_se               569 non-null    float64
perimeter_se             569 non-null    float64
area_se                  569 non-null    float64
smoothness_se            569 non-null    float64
compactness_se           569 non-null    float64
concavity_se             569 non-null    float64
concave points_se        569 non-null    float64
symmetry_se              569 non-null    float64
fractal_dimension_se     569 non-null    float64
radius_worst             569 non-null    float64
texture_worst            569 non-null    float64
perimeter_worst          569 non-null    float64
area_worst               569 non-null    float64
smoothness_worst         569 non-null    float64
compactness_worst        569 non-null    float64
concavity_worst          569 non-null    float64
concave points_worst     569 non-null    float64
symmetry_worst           569 non-null    float64
fractal_dimension_worst  569 non-null    float64
dtypes: float64(30), int64(1), object(1)
memory usage: 146.7+ KB

Describing features and making selections#

Now that you are working with a larger dataset you will want to summarise it. Good news pandas makes this very easy! You can use Dataframe.describe() to quickly summarise data. The result of the method .decribe() is a DataFrame that contains summary statistics or each variable. By default the index of the results Dataframe is the summary statistic descriptor.

Before we do that we will drop the ‘unnamed: 0’ column as it is not useful.

# drop the 'unnamed' column; axis = 1 for columns.
df = df.drop(labels=['Unnamed: 0'], axis=1)

df.describe()

	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	symmetry_mean	fractal_dimension_mean	...	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst
count	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	...	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000
mean	14.127292	19.289649	91.969033	654.889104	0.096360	0.104341	0.088799	0.048919	0.181162	0.062798	...	16.269190	25.677223	107.261213	880.583128	0.132369	0.254265	0.272188	0.114606	0.290076	0.083946
std	3.524049	4.301036	24.298981	351.914129	0.014064	0.052813	0.079720	0.038803	0.027414	0.007060	...	4.833242	6.146258	33.602542	569.356993	0.022832	0.157336	0.208624	0.065732	0.061867	0.018061
min	6.981000	9.710000	43.790000	143.500000	0.052630	0.019380	0.000000	0.000000	0.106000	0.049960	...	7.930000	12.020000	50.410000	185.200000	0.071170	0.027290	0.000000	0.000000	0.156500	0.055040
25%	11.700000	16.170000	75.170000	420.300000	0.086370	0.064920	0.029560	0.020310	0.161900	0.057700	...	13.010000	21.080000	84.110000	515.300000	0.116600	0.147200	0.114500	0.064930	0.250400	0.071460
50%	13.370000	18.840000	86.240000	551.100000	0.095870	0.092630	0.061540	0.033500	0.179200	0.061540	...	14.970000	25.410000	97.660000	686.500000	0.131300	0.211900	0.226700	0.099930	0.282200	0.080040
75%	15.780000	21.800000	104.100000	782.700000	0.105300	0.130400	0.130700	0.074000	0.195700	0.066120	...	18.790000	29.720000	125.400000	1084.000000	0.146000	0.339100	0.382900	0.161400	0.317900	0.092080
max	28.110000	39.280000	188.500000	2501.000000	0.163400	0.345400	0.426800	0.201200	0.304000	0.097440	...	36.040000	49.540000	251.200000	4254.000000	0.222600	1.058000	1.252000	0.291000	0.663800	0.207500

8 rows × 30 columns

When there is a large number of features pandas might not be able to show the full summary to you. There’s a few ways to get sub-summary’s if needed. The first is to transpose the results of describe.

df.describe().T

	count	mean	std	min	25%	50%	75%	max
radius_mean	569.0	14.127292	3.524049	6.981000	11.700000	13.370000	15.780000	28.11000
texture_mean	569.0	19.289649	4.301036	9.710000	16.170000	18.840000	21.800000	39.28000
perimeter_mean	569.0	91.969033	24.298981	43.790000	75.170000	86.240000	104.100000	188.50000
area_mean	569.0	654.889104	351.914129	143.500000	420.300000	551.100000	782.700000	2501.00000
smoothness_mean	569.0	0.096360	0.014064	0.052630	0.086370	0.095870	0.105300	0.16340
compactness_mean	569.0	0.104341	0.052813	0.019380	0.064920	0.092630	0.130400	0.34540
concavity_mean	569.0	0.088799	0.079720	0.000000	0.029560	0.061540	0.130700	0.42680
concave points_mean	569.0	0.048919	0.038803	0.000000	0.020310	0.033500	0.074000	0.20120
symmetry_mean	569.0	0.181162	0.027414	0.106000	0.161900	0.179200	0.195700	0.30400
fractal_dimension_mean	569.0	0.062798	0.007060	0.049960	0.057700	0.061540	0.066120	0.09744
radius_se	569.0	0.405172	0.277313	0.111500	0.232400	0.324200	0.478900	2.87300
texture_se	569.0	1.216853	0.551648	0.360200	0.833900	1.108000	1.474000	4.88500
perimeter_se	569.0	2.866059	2.021855	0.757000	1.606000	2.287000	3.357000	21.98000
area_se	569.0	40.337079	45.491006	6.802000	17.850000	24.530000	45.190000	542.20000
smoothness_se	569.0	0.007041	0.003003	0.001713	0.005169	0.006380	0.008146	0.03113
compactness_se	569.0	0.025478	0.017908	0.002252	0.013080	0.020450	0.032450	0.13540
concavity_se	569.0	0.031894	0.030186	0.000000	0.015090	0.025890	0.042050	0.39600
concave points_se	569.0	0.011796	0.006170	0.000000	0.007638	0.010930	0.014710	0.05279
symmetry_se	569.0	0.020542	0.008266	0.007882	0.015160	0.018730	0.023480	0.07895
fractal_dimension_se	569.0	0.003795	0.002646	0.000895	0.002248	0.003187	0.004558	0.02984
radius_worst	569.0	16.269190	4.833242	7.930000	13.010000	14.970000	18.790000	36.04000
texture_worst	569.0	25.677223	6.146258	12.020000	21.080000	25.410000	29.720000	49.54000
perimeter_worst	569.0	107.261213	33.602542	50.410000	84.110000	97.660000	125.400000	251.20000
area_worst	569.0	880.583128	569.356993	185.200000	515.300000	686.500000	1084.000000	4254.00000
smoothness_worst	569.0	0.132369	0.022832	0.071170	0.116600	0.131300	0.146000	0.22260
compactness_worst	569.0	0.254265	0.157336	0.027290	0.147200	0.211900	0.339100	1.05800
concavity_worst	569.0	0.272188	0.208624	0.000000	0.114500	0.226700	0.382900	1.25200
concave points_worst	569.0	0.114606	0.065732	0.000000	0.064930	0.099930	0.161400	0.29100
symmetry_worst	569.0	0.290076	0.061867	0.156500	0.250400	0.282200	0.317900	0.66380
fractal_dimension_worst	569.0	0.083946	0.018061	0.055040	0.071460	0.080040	0.092080	0.20750

If there are common strings in the column labels you can use the filter() method to restrict your results summary.

df.filter(like='worst').describe().T.round(2)

	count	mean	std	min	25%	50%	75%	max
radius_worst	569.0	16.27	4.83	7.93	13.01	14.97	18.79	36.04
texture_worst	569.0	25.68	6.15	12.02	21.08	25.41	29.72	49.54
perimeter_worst	569.0	107.26	33.60	50.41	84.11	97.66	125.40	251.20
area_worst	569.0	880.58	569.36	185.20	515.30	686.50	1084.00	4254.00
smoothness_worst	569.0	0.13	0.02	0.07	0.12	0.13	0.15	0.22
compactness_worst	569.0	0.25	0.16	0.03	0.15	0.21	0.34	1.06
concavity_worst	569.0	0.27	0.21	0.00	0.11	0.23	0.38	1.25
concave points_worst	569.0	0.11	0.07	0.00	0.06	0.10	0.16	0.29
symmetry_worst	569.0	0.29	0.06	0.16	0.25	0.28	0.32	0.66
fractal_dimension_worst	569.0	0.08	0.02	0.06	0.07	0.08	0.09	0.21

df.filter(like='mean').describe().T.round(2)

	count	mean	std	min	25%	50%	75%	max
radius_mean	569.0	14.13	3.52	6.98	11.70	13.37	15.78	28.11
texture_mean	569.0	19.29	4.30	9.71	16.17	18.84	21.80	39.28
perimeter_mean	569.0	91.97	24.30	43.79	75.17	86.24	104.10	188.50
area_mean	569.0	654.89	351.91	143.50	420.30	551.10	782.70	2501.00
smoothness_mean	569.0	0.10	0.01	0.05	0.09	0.10	0.11	0.16
compactness_mean	569.0	0.10	0.05	0.02	0.06	0.09	0.13	0.35
concavity_mean	569.0	0.09	0.08	0.00	0.03	0.06	0.13	0.43
concave points_mean	569.0	0.05	0.04	0.00	0.02	0.03	0.07	0.20
symmetry_mean	569.0	0.18	0.03	0.11	0.16	0.18	0.20	0.30
fractal_dimension_mean	569.0	0.06	0.01	0.05	0.06	0.06	0.07	0.10

Selecting a subset of the DataFrame#

Let’s assume, for example, that you need to take a closer look at the smoothness statistics for samples where the area_mean is above the 3rd quartile (782).

The first step is to create a DataFrame (here called extremes) that contains the subset of data where area_mean > the 3rd quartile.

# get the threshold (could be hard coded as 782 as an alternative)
threshold = df.describe().T.loc['area_mean']['75%']

# get the extremes of the dataset
extremes = df.loc[df['area_mean'] > threshold]
extremes.shape

(142, 31)

Given the naming convention in the dataset we can use .filter for columns containing the string smoothness and call the describe method.

extremes.filter(like='smoothness').describe()

	smoothness_mean	smoothness_se	smoothness_worst
count	142.000000	142.000000	142.000000
mean	0.100433	0.006653	0.138356
std	0.011829	0.003180	0.018561
min	0.073710	0.002667	0.087740
25%	0.092672	0.004951	0.124750
50%	0.099985	0.006106	0.139950
75%	0.107775	0.007547	0.150150
max	0.144700	0.031130	0.187300

An alternative is to specify the list of columns you want to summarise.

to_select = ['smoothness_mean', 'smoothness_se', 'smoothness_worst']
extremes[to_select].describe()

	smoothness_mean	smoothness_se	smoothness_worst
count	142.000000	142.000000	142.000000
mean	0.100433	0.006653	0.138356
std	0.011829	0.003180	0.018561
min	0.073710	0.002667	0.087740
25%	0.092672	0.004951	0.124750
50%	0.099985	0.006106	0.139950
75%	0.107775	0.007547	0.150150
max	0.144700	0.031130	0.187300

Working with files

Contents

Working with files#

Reading data into pandas from a flat file#

Describing features and making selections#

Selecting a subset of the DataFrame#