Utilize public data to import two completely different data, pre-process them, and merge them

Trends in pre-sale prices for new private apartments across the country

Will the trend of real estate price fluctuations from 2013 to recent years be reflected in the sale price of apartments? Analyze the data in the public data portal using Pandasโ€™ reshape functions such as melt, concat, pivot, and transpose. And we will summarize and analyze the data through groupby, pivot_table, info, describe, value_counts, etc. This will teach you how to handle the process of taking two completely different types of data, cleaning them, and merging them. You will be able to understand the difference between numerical and categorical data for preprocessed results and visualize them in various graphs.

Coverage

  • Utilize public data to import two completely different data, pre-process them, and merge them
  • Cultivate the perspective of looking at numerical and categorical data
  • Understanding various visualization methods depending on the format of data

Training

  • After downloading public data, load it into Jupyter notebook
  • Summarize and analyze data with Pandas
  • Data preprocessing and merging
  • Handling numeric and categorical data
  • Practicing bar plot, line plot, scatter plot, correlation (lm plot), heat map, box and whisker plot, swarm plot, frequency distribution table, and histogram (distplot)
# The datasets to use should be in the same folder

%ls data
แ„Œแ…ฅแ†ซแ„€แ…ฎแ†จ แ„‘แ…งแ†ผแ„€แ…ฒแ†ซ แ„‡แ…ฎแ†ซแ„‹แ…ฃแ†ผแ„€แ…กแ„€แ…งแ†จ(2013แ„‚แ…งแ†ซ 9แ„‹แ…ฏแ†ฏแ„‡แ…ฎแ„แ…ฅ 2015แ„‚แ…งแ†ซ 8แ„‹แ…ฏแ†ฏแ„แ…กแ„Œแ…ต).csv
์ฃผํƒ๋„์‹œ๋ณด์ฆ๊ณต์‚ฌ_์ „๊ตญ ํ‰๊ท  ๋ถ„์–‘๊ฐ€๊ฒฉ(2019๋…„ 12์›”).csv
import pandas as pd

Loading datasets

UnicodeDecode Error: Add encoding euc-kr or cp949 The latest sale price file is downloaded and loaded into a variable called df_last. After loading the data, output the number of rows and columns through shape.

Load the latest sale price file and store it in a variable called df_last.

# the shape of data : (columns & rows)

df_last = pd.read_csv("data/์ฃผํƒ๋„์‹œ๋ณด์ฆ๊ณต์‚ฌ_์ „๊ตญ ํ‰๊ท  ๋ถ„์–‘๊ฐ€๊ฒฉ(2019๋…„ 12์›”).csv", encoding = "cp949")
df_last.shape
(4335, 5)

Preview the file with head and tail. If you put ? after a method, the docstring of the method is output through a function called self-invocation. You can also open the same document by pressing the Shift + Tab keys inside the () parenthesis of the method. You can increase the pop-up window by pressing Shift + Tab + Tab.

df_last.head()
์ง€์—ญ๋ช… ๊ทœ๋ชจ๊ตฌ๋ถ„ ์—ฐ๋„ ์›” ๋ถ„์–‘๊ฐ€๊ฒฉ(ใŽก)
0 ์„œ์šธ ์ „์ฒด 2015 10 5841
1 ์„œ์šธ ์ „์šฉ๋ฉด์  60ใŽก์ดํ•˜ 2015 10 5652
2 ์„œ์šธ ์ „์šฉ๋ฉด์  60ใŽก์ดˆ๊ณผ 85ใŽก์ดํ•˜ 2015 10 5882
3 ์„œ์šธ ์ „์šฉ๋ฉด์  85ใŽก์ดˆ๊ณผ 102ใŽก์ดํ•˜ 2015 10 5721
4 ์„œ์šธ ์ „์šฉ๋ฉด์  102ใŽก์ดˆ๊ณผ 2015 10 5879
df_last.tail()
์ง€์—ญ๋ช… ๊ทœ๋ชจ๊ตฌ๋ถ„ ์—ฐ๋„ ์›” ๋ถ„์–‘๊ฐ€๊ฒฉ(ใŽก)
4330 ์ œ์ฃผ ์ „์ฒด 2019 12 3882
4331 ์ œ์ฃผ ์ „์šฉ๋ฉด์  60ใŽก์ดํ•˜ 2019 12 NaN
4332 ์ œ์ฃผ ์ „์šฉ๋ฉด์  60ใŽก์ดˆ๊ณผ 85ใŽก์ดํ•˜ 2019 12 3898
4333 ์ œ์ฃผ ์ „์šฉ๋ฉด์  85ใŽก์ดˆ๊ณผ 102ใŽก์ดํ•˜ 2019 12 NaN
4334 ์ œ์ฃผ ์ „์šฉ๋ฉด์  102ใŽก์ดˆ๊ณผ 2019 12 3601

There are โ€œNaNโ€, โ€œnot a numberโ€ results. We need to refine them

Load data from 2015 to the latest

Import the national average sale price (from September 2013 to August 2015) file. Put it in a variable called df_first and output the number of rows and columns in shape.

Outputs a list of files in the applicable folder or path.

%ls data
แ„Œแ…ฅแ†ซแ„€แ…ฎแ†จ แ„‘แ…งแ†ผแ„€แ…ฒแ†ซ แ„‡แ…ฎแ†ซแ„‹แ…ฃแ†ผแ„€แ…กแ„€แ…งแ†จ(2013แ„‚แ…งแ†ซ 9แ„‹แ…ฏแ†ฏแ„‡แ…ฎแ„แ…ฅ 2015แ„‚แ…งแ†ซ 8แ„‹แ…ฏแ†ฏแ„แ…กแ„Œแ…ต).csv
์ฃผํƒ๋„์‹œ๋ณด์ฆ๊ณต์‚ฌ_์ „๊ตญ ํ‰๊ท  ๋ถ„์–‘๊ฐ€๊ฒฉ(2019๋…„ 12์›”).csv
df_first = pd.read_csv("data/แ„Œแ…ฅแ†ซแ„€แ…ฎแ†จ แ„‘แ…งแ†ผแ„€แ…ฒแ†ซ แ„‡แ…ฎแ†ซแ„‹แ…ฃแ†ผแ„€แ…กแ„€แ…งแ†จ(2013แ„‚แ…งแ†ซ 9แ„‹แ…ฏแ†ฏแ„‡แ…ฎแ„แ…ฅ 2015แ„‚แ…งแ†ซ 8แ„‹แ…ฏแ†ฏแ„แ…กแ„Œแ…ต).csv", encoding = "cp949")
df_first.shape
(17, 22)

Preview the data frame contained in the df_first variable as head.

df_first.head()
์ง€์—ญ 2013๋…„12์›” 2014๋…„1์›” 2014๋…„2์›” 2014๋…„3์›” 2014๋…„4์›” 2014๋…„5์›” 2014๋…„6์›” 2014๋…„7์›” 2014๋…„8์›” ... 2014๋…„11์›” 2014๋…„12์›” 2015๋…„1์›” 2015๋…„2์›” 2015๋…„3์›” 2015๋…„4์›” 2015๋…„5์›” 2015๋…„6์›” 2015๋…„7์›” 2015๋…„8์›”
0 ์„œ์šธ 18189 17925 17925 18016 18098 19446 18867 18742 19274 ... 20242 20269 20670 20670 19415 18842 18367 18374 18152 18443
1 ๋ถ€์‚ฐ 8111 8111 9078 8965 9402 9501 9453 9457 9411 ... 9208 9208 9204 9235 9279 9327 9345 9515 9559 9581
2 ๋Œ€๊ตฌ 8080 8080 8077 8101 8267 8274 8360 8360 8370 ... 8439 8253 8327 8416 8441 8446 8568 8542 8542 8795
3 ์ธ์ฒœ 10204 10204 10408 10408 10000 9844 10058 9974 9973 ... 10020 10020 10017 9876 9876 9938 10551 10443 10443 10449
4 ๊ด‘์ฃผ 6098 7326 7611 7346 7346 7523 7659 7612 7622 ... 7752 7748 7752 7756 7861 7914 7877 7881 8089 8231

5 rows ร— 22 columns

Preview the data frame contained in the df_first variable with tail.

df_first.tail()
์ง€์—ญ 2013๋…„12์›” 2014๋…„1์›” 2014๋…„2์›” 2014๋…„3์›” 2014๋…„4์›” 2014๋…„5์›” 2014๋…„6์›” 2014๋…„7์›” 2014๋…„8์›” ... 2014๋…„11์›” 2014๋…„12์›” 2015๋…„1์›” 2015๋…„2์›” 2015๋…„3์›” 2015๋…„4์›” 2015๋…„5์›” 2015๋…„6์›” 2015๋…„7์›” 2015๋…„8์›”
12 ์ „๋ถ 6282 6281 5946 5966 6277 6306 6351 6319 6436 ... 6583 6583 6583 6583 6542 6551 6556 6601 6750 6580
13 ์ „๋‚จ 5678 5678 5678 5696 5736 5656 5609 5780 5685 ... 5768 5784 5784 5833 5825 5940 6050 6243 6286 6289
14 ๊ฒฝ๋ถ 6168 6168 6234 6317 6412 6409 6554 6556 6563 ... 6881 6989 6992 6953 6997 7006 6966 6887 7035 7037
15 ๊ฒฝ๋‚จ 6473 6485 6502 6610 6599 6610 6615 6613 6606 ... 7125 7332 7592 7588 7668 7683 7717 7715 7723 7665
16 ์ œ์ฃผ 7674 7900 7900 7900 7900 7900 7914 7914 7914 ... 7724 7739 7739 7739 7826 7285 7285 7343 7343 7343

5 rows ร— 22 columns

Summarize the Datasets

# Summarize with `info`.
df_last.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4335 entries, 0 to 4334
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   ์ง€์—ญ๋ช…      4335 non-null   object
 1   ๊ทœ๋ชจ๊ตฌ๋ถ„     4335 non-null   object
 2   ์—ฐ๋„       4335 non-null   int64 
 3   ์›”        4335 non-null   int64 
 4   ๋ถ„์–‘๊ฐ€๊ฒฉ(ใŽก)  4058 non-null   object
dtypes: int64(2), object(3)
memory usage: 169.5+ KB
  • RangeIndex : index numbers & data entry numbers

  • the price: 4th column is different b/c of NaN numbers

Showing missing values

You can check whether the data is empty with isnull or isna. Missing values are displayed as True, but since True == 1, adding all of these values results in the number of missing values.

True == 1
True
False == 0
True
True + True + False
2

Check missing values via isnull.

df_last.isnull()
์ง€์—ญ๋ช… ๊ทœ๋ชจ๊ตฌ๋ถ„ ์—ฐ๋„ ์›” ๋ถ„์–‘๊ฐ€๊ฒฉ(ใŽก)
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
... ... ... ... ... ...
4330 False False False False False
4331 False False False False True
4332 False False False False False
4333 False False False False True
4334 False False False False False

4335 rows ร— 5 columns

Use isnull to find missing values.

df_last.isnull().sum()
์ง€์—ญ๋ช…          0
๊ทœ๋ชจ๊ตฌ๋ถ„         0
์—ฐ๋„           0
์›”            0
๋ถ„์–‘๊ฐ€๊ฒฉ(ใŽก)    277
dtype: int64

Find missing values via isna.

df_last.isna().sum()
์ง€์—ญ๋ช…          0
๊ทœ๋ชจ๊ตฌ๋ถ„         0
์—ฐ๋„           0
์›”            0
๋ถ„์–‘๊ฐ€๊ฒฉ(ใŽก)    277
dtype: int64

Leave a comment