DataMiningPratice(1)-BasicImportofDatasets

August 9, 2023

Utilize public data to import two completely different data, pre-process them, and merge them

Trends in pre-sale prices for new private apartments across the country

Will the trend of real estate price fluctuations from 2013 to recent years be reflected in the sale price of apartments? Analyze the data in the public data portal using Pandas’ reshape functions such as melt, concat, pivot, and transpose. And we will summarize and analyze the data through groupby, pivot_table, info, describe, value_counts, etc. This will teach you how to handle the process of taking two completely different types of data, cleaning them, and merging them. You will be able to understand the difference between numerical and categorical data for preprocessed results and visualize them in various graphs.

Coverage

Utilize public data to import two completely different data, pre-process them, and merge them
Cultivate the perspective of looking at numerical and categorical data
Understanding various visualization methods depending on the format of data

Training

After downloading public data, load it into Jupyter notebook
Summarize and analyze data with Pandas
Data preprocessing and merging
Handling numeric and categorical data
Practicing bar plot, line plot, scatter plot, correlation (lm plot), heat map, box and whisker plot, swarm plot, frequency distribution table, and histogram (distplot)

# The datasets to use should be in the same folder

%ls data

전국 평균 분양가격(2013년 9월부터 2015년 8월까지).csv
주택도시보증공사_전국 평균 분양가격(2019년 12월).csv

import pandas as pd

Loading datasets

UnicodeDecode Error: Add encoding euc-kr or cp949 The latest sale price file is downloaded and loaded into a variable called df_last. After loading the data, output the number of rows and columns through shape.

Load the latest sale price file and store it in a variable called df_last.

# the shape of data : (columns & rows)

df_last = pd.read_csv("data/주택도시보증공사_전국 평균 분양가격(2019년 12월).csv", encoding = "cp949")
df_last.shape

(4335, 5)

Preview the file with head and tail. If you put ? after a method, the docstring of the method is output through a function called self-invocation. You can also open the same document by pressing the Shift + Tab keys inside the () parenthesis of the method. You can increase the pop-up window by pressing Shift + Tab + Tab.

df_last.head()

	지역명	규모구분	연도	월	분양가격(㎡)
0	서울	전체	2015	10	5841
1	서울	전용면적 60㎡이하	2015	10	5652
2	서울	전용면적 60㎡초과 85㎡이하	2015	10	5882
3	서울	전용면적 85㎡초과 102㎡이하	2015	10	5721
4	서울	전용면적 102㎡초과	2015	10	5879

df_last.tail()

	지역명	규모구분	연도	월	분양가격(㎡)
4330	제주	전체	2019	12	3882
4331	제주	전용면적 60㎡이하	2019	12	NaN
4332	제주	전용면적 60㎡초과 85㎡이하	2019	12	3898
4333	제주	전용면적 85㎡초과 102㎡이하	2019	12	NaN
4334	제주	전용면적 102㎡초과	2019	12	3601

There are “NaN”, “not a number” results. We need to refine them

Load data from 2015 to the latest

Import the national average sale price (from September 2013 to August 2015) file. Put it in a variable called df_first and output the number of rows and columns in shape.

Outputs a list of files in the applicable folder or path.

%ls data

전국 평균 분양가격(2013년 9월부터 2015년 8월까지).csv
주택도시보증공사_전국 평균 분양가격(2019년 12월).csv

df_first = pd.read_csv("data/전국 평균 분양가격(2013년 9월부터 2015년 8월까지).csv", encoding = "cp949")
df_first.shape

(17, 22)

Preview the data frame contained in the df_first variable as head.

df_first.head()

	지역	2013년12월	2014년1월	2014년2월	2014년3월	2014년4월	2014년5월	2014년6월	2014년7월	2014년8월	...	2014년11월	2014년12월	2015년1월	2015년2월	2015년3월	2015년4월	2015년5월	2015년6월	2015년7월	2015년8월
0	서울	18189	17925	17925	18016	18098	19446	18867	18742	19274	...	20242	20269	20670	20670	19415	18842	18367	18374	18152	18443
1	부산	8111	8111	9078	8965	9402	9501	9453	9457	9411	...	9208	9208	9204	9235	9279	9327	9345	9515	9559	9581
2	대구	8080	8080	8077	8101	8267	8274	8360	8360	8370	...	8439	8253	8327	8416	8441	8446	8568	8542	8542	8795
3	인천	10204	10204	10408	10408	10000	9844	10058	9974	9973	...	10020	10020	10017	9876	9876	9938	10551	10443	10443	10449
4	광주	6098	7326	7611	7346	7346	7523	7659	7612	7622	...	7752	7748	7752	7756	7861	7914	7877	7881	8089	8231

5 rows × 22 columns

Preview the data frame contained in the df_first variable with tail.

df_first.tail()

	지역	2013년12월	2014년1월	2014년2월	2014년3월	2014년4월	2014년5월	2014년6월	2014년7월	2014년8월	...	2014년11월	2014년12월	2015년1월	2015년2월	2015년3월	2015년4월	2015년5월	2015년6월	2015년7월	2015년8월
12	전북	6282	6281	5946	5966	6277	6306	6351	6319	6436	...	6583	6583	6583	6583	6542	6551	6556	6601	6750	6580
13	전남	5678	5678	5678	5696	5736	5656	5609	5780	5685	...	5768	5784	5784	5833	5825	5940	6050	6243	6286	6289
14	경북	6168	6168	6234	6317	6412	6409	6554	6556	6563	...	6881	6989	6992	6953	6997	7006	6966	6887	7035	7037
15	경남	6473	6485	6502	6610	6599	6610	6615	6613	6606	...	7125	7332	7592	7588	7668	7683	7717	7715	7723	7665
16	제주	7674	7900	7900	7900	7900	7900	7914	7914	7914	...	7724	7739	7739	7739	7826	7285	7285	7343	7343	7343

5 rows × 22 columns

Summarize the Datasets

# Summarize with `info`.
df_last.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4335 entries, 0 to 4334
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   지역명      4335 non-null   object
 1   규모구분     4335 non-null   object
 2   연도       4335 non-null   int64 
 3   월        4335 non-null   int64 
 4   분양가격(㎡)  4058 non-null   object
dtypes: int64(2), object(3)
memory usage: 169.5+ KB

RangeIndex : index numbers & data entry numbers
the price: 4th column is different b/c of NaN numbers

Showing missing values

You can check whether the data is empty with isnull or isna. Missing values are displayed as True, but since True == 1, adding all of these values results in the number of missing values.

True == 1

True

False == 0

True

True + True + False

Check missing values via isnull.

df_last.isnull()

	지역명	규모구분	연도	월	분양가격(㎡)
0	False	False	False	False	False
1	False	False	False	False	False
2	False	False	False	False	False
3	False	False	False	False	False
4	False	False	False	False	False
...	...	...	...	...	...
4330	False	False	False	False	False
4331	False	False	False	False	True
4332	False	False	False	False	False
4333	False	False	False	False	True
4334	False	False	False	False	False

4335 rows × 5 columns

Use isnull to find missing values.

df_last.isnull().sum()

지역명          0
규모구분         0
연도           0
월            0
분양가격(㎡)    277
dtype: int64

Find missing values via isna.

df_last.isna().sum()

지역명          0
규모구분         0
연도           0
월            0
분양가격(㎡)    277
dtype: int64

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

DataMiningPratice(1)-BasicImportofDatasets

Trends in pre-sale prices for new private apartments across the country

Coverage

Training

Loading datasets

Load data from 2015 to the latest

Summarize the Datasets

Showing missing values

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)