DataMiningPratice(1)-BasicImportofDatasets
Utilize public data to import two completely different data, pre-process them, and merge them
Trends in pre-sale prices for new private apartments across the country
Will the trend of real estate price fluctuations from 2013 to recent years be reflected in the sale price of apartments? Analyze the data in the public data portal using Pandasโ reshape functions such as melt, concat, pivot, and transpose. And we will summarize and analyze the data through groupby, pivot_table, info, describe, value_counts, etc. This will teach you how to handle the process of taking two completely different types of data, cleaning them, and merging them. You will be able to understand the difference between numerical and categorical data for preprocessed results and visualize them in various graphs.
Coverage
- Utilize public data to import two completely different data, pre-process them, and merge them
- Cultivate the perspective of looking at numerical and categorical data
- Understanding various visualization methods depending on the format of data
Training
- After downloading public data, load it into Jupyter notebook
- Summarize and analyze data with Pandas
- Data preprocessing and merging
- Handling numeric and categorical data
- Practicing bar plot, line plot, scatter plot, correlation (lm plot), heat map, box and whisker plot, swarm plot, frequency distribution table, and histogram (distplot)
# The datasets to use should be in the same folder
%ls data
แแ
ฅแซแแ
ฎแจ แแ
งแผแแ
ฒแซ แแ
ฎแซแแ
ฃแผแแ
กแแ
งแจ(2013แแ
งแซ 9แแ
ฏแฏแแ
ฎแแ
ฅ 2015แแ
งแซ 8แแ
ฏแฏแแ
กแแ
ต).csv
์ฃผํ๋์๋ณด์ฆ๊ณต์ฌ_์ ๊ตญ ํ๊ท ๋ถ์๊ฐ๊ฒฉ(2019๋
12์).csv
import pandas as pd
Loading datasets
UnicodeDecode Error: Add encoding euc-kr
or cp949
The latest sale price file is downloaded and loaded into a variable called df_last.
After loading the data, output the number of rows and columns through shape.
Load the latest sale price file and store it in a variable called df_last
.
# the shape of data : (columns & rows)
df_last = pd.read_csv("data/์ฃผํ๋์๋ณด์ฆ๊ณต์ฌ_์ ๊ตญ ํ๊ท ๋ถ์๊ฐ๊ฒฉ(2019๋
12์).csv", encoding = "cp949")
df_last.shape
(4335, 5)
Preview the file with head
and tail
.
If you put ?
after a method, the docstring of the method is output through a function called self-invocation.
You can also open the same document by pressing the Shift + Tab
keys inside the () parenthesis of the method.
You can increase the pop-up window by pressing Shift + Tab + Tab
.
df_last.head()
์ง์ญ๋ช | ๊ท๋ชจ๊ตฌ๋ถ | ์ฐ๋ | ์ | ๋ถ์๊ฐ๊ฒฉ(ใก) | |
---|---|---|---|---|---|
0 | ์์ธ | ์ ์ฒด | 2015 | 10 | 5841 |
1 | ์์ธ | ์ ์ฉ๋ฉด์ 60ใก์ดํ | 2015 | 10 | 5652 |
2 | ์์ธ | ์ ์ฉ๋ฉด์ 60ใก์ด๊ณผ 85ใก์ดํ | 2015 | 10 | 5882 |
3 | ์์ธ | ์ ์ฉ๋ฉด์ 85ใก์ด๊ณผ 102ใก์ดํ | 2015 | 10 | 5721 |
4 | ์์ธ | ์ ์ฉ๋ฉด์ 102ใก์ด๊ณผ | 2015 | 10 | 5879 |
df_last.tail()
์ง์ญ๋ช | ๊ท๋ชจ๊ตฌ๋ถ | ์ฐ๋ | ์ | ๋ถ์๊ฐ๊ฒฉ(ใก) | |
---|---|---|---|---|---|
4330 | ์ ์ฃผ | ์ ์ฒด | 2019 | 12 | 3882 |
4331 | ์ ์ฃผ | ์ ์ฉ๋ฉด์ 60ใก์ดํ | 2019 | 12 | NaN |
4332 | ์ ์ฃผ | ์ ์ฉ๋ฉด์ 60ใก์ด๊ณผ 85ใก์ดํ | 2019 | 12 | 3898 |
4333 | ์ ์ฃผ | ์ ์ฉ๋ฉด์ 85ใก์ด๊ณผ 102ใก์ดํ | 2019 | 12 | NaN |
4334 | ์ ์ฃผ | ์ ์ฉ๋ฉด์ 102ใก์ด๊ณผ | 2019 | 12 | 3601 |
There are โNaNโ, โnot a numberโ results. We need to refine them
Load data from 2015 to the latest
Import the national average sale price (from September 2013 to August 2015) file.
Put it in a variable called df_first
and output the number of rows and columns in shape
.
Outputs a list of files in the applicable folder or path.
%ls data
แแ
ฅแซแแ
ฎแจ แแ
งแผแแ
ฒแซ แแ
ฎแซแแ
ฃแผแแ
กแแ
งแจ(2013แแ
งแซ 9แแ
ฏแฏแแ
ฎแแ
ฅ 2015แแ
งแซ 8แแ
ฏแฏแแ
กแแ
ต).csv
์ฃผํ๋์๋ณด์ฆ๊ณต์ฌ_์ ๊ตญ ํ๊ท ๋ถ์๊ฐ๊ฒฉ(2019๋
12์).csv
df_first = pd.read_csv("data/แแ
ฅแซแแ
ฎแจ แแ
งแผแแ
ฒแซ แแ
ฎแซแแ
ฃแผแแ
กแแ
งแจ(2013แแ
งแซ 9แแ
ฏแฏแแ
ฎแแ
ฅ 2015แแ
งแซ 8แแ
ฏแฏแแ
กแแ
ต).csv", encoding = "cp949")
df_first.shape
(17, 22)
Preview the data frame contained in the df_first
variable as head
.
df_first.head()
์ง์ญ | 2013๋ 12์ | 2014๋ 1์ | 2014๋ 2์ | 2014๋ 3์ | 2014๋ 4์ | 2014๋ 5์ | 2014๋ 6์ | 2014๋ 7์ | 2014๋ 8์ | ... | 2014๋ 11์ | 2014๋ 12์ | 2015๋ 1์ | 2015๋ 2์ | 2015๋ 3์ | 2015๋ 4์ | 2015๋ 5์ | 2015๋ 6์ | 2015๋ 7์ | 2015๋ 8์ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ์์ธ | 18189 | 17925 | 17925 | 18016 | 18098 | 19446 | 18867 | 18742 | 19274 | ... | 20242 | 20269 | 20670 | 20670 | 19415 | 18842 | 18367 | 18374 | 18152 | 18443 |
1 | ๋ถ์ฐ | 8111 | 8111 | 9078 | 8965 | 9402 | 9501 | 9453 | 9457 | 9411 | ... | 9208 | 9208 | 9204 | 9235 | 9279 | 9327 | 9345 | 9515 | 9559 | 9581 |
2 | ๋๊ตฌ | 8080 | 8080 | 8077 | 8101 | 8267 | 8274 | 8360 | 8360 | 8370 | ... | 8439 | 8253 | 8327 | 8416 | 8441 | 8446 | 8568 | 8542 | 8542 | 8795 |
3 | ์ธ์ฒ | 10204 | 10204 | 10408 | 10408 | 10000 | 9844 | 10058 | 9974 | 9973 | ... | 10020 | 10020 | 10017 | 9876 | 9876 | 9938 | 10551 | 10443 | 10443 | 10449 |
4 | ๊ด์ฃผ | 6098 | 7326 | 7611 | 7346 | 7346 | 7523 | 7659 | 7612 | 7622 | ... | 7752 | 7748 | 7752 | 7756 | 7861 | 7914 | 7877 | 7881 | 8089 | 8231 |
5 rows ร 22 columns
Preview the data frame contained in the df_first
variable with tail
.
df_first.tail()
์ง์ญ | 2013๋ 12์ | 2014๋ 1์ | 2014๋ 2์ | 2014๋ 3์ | 2014๋ 4์ | 2014๋ 5์ | 2014๋ 6์ | 2014๋ 7์ | 2014๋ 8์ | ... | 2014๋ 11์ | 2014๋ 12์ | 2015๋ 1์ | 2015๋ 2์ | 2015๋ 3์ | 2015๋ 4์ | 2015๋ 5์ | 2015๋ 6์ | 2015๋ 7์ | 2015๋ 8์ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12 | ์ ๋ถ | 6282 | 6281 | 5946 | 5966 | 6277 | 6306 | 6351 | 6319 | 6436 | ... | 6583 | 6583 | 6583 | 6583 | 6542 | 6551 | 6556 | 6601 | 6750 | 6580 |
13 | ์ ๋จ | 5678 | 5678 | 5678 | 5696 | 5736 | 5656 | 5609 | 5780 | 5685 | ... | 5768 | 5784 | 5784 | 5833 | 5825 | 5940 | 6050 | 6243 | 6286 | 6289 |
14 | ๊ฒฝ๋ถ | 6168 | 6168 | 6234 | 6317 | 6412 | 6409 | 6554 | 6556 | 6563 | ... | 6881 | 6989 | 6992 | 6953 | 6997 | 7006 | 6966 | 6887 | 7035 | 7037 |
15 | ๊ฒฝ๋จ | 6473 | 6485 | 6502 | 6610 | 6599 | 6610 | 6615 | 6613 | 6606 | ... | 7125 | 7332 | 7592 | 7588 | 7668 | 7683 | 7717 | 7715 | 7723 | 7665 |
16 | ์ ์ฃผ | 7674 | 7900 | 7900 | 7900 | 7900 | 7900 | 7914 | 7914 | 7914 | ... | 7724 | 7739 | 7739 | 7739 | 7826 | 7285 | 7285 | 7343 | 7343 | 7343 |
5 rows ร 22 columns
Summarize the Datasets
# Summarize with `info`.
df_last.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4335 entries, 0 to 4334
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ์ง์ญ๋ช
4335 non-null object
1 ๊ท๋ชจ๊ตฌ๋ถ 4335 non-null object
2 ์ฐ๋ 4335 non-null int64
3 ์ 4335 non-null int64
4 ๋ถ์๊ฐ๊ฒฉ(ใก) 4058 non-null object
dtypes: int64(2), object(3)
memory usage: 169.5+ KB
-
RangeIndex : index numbers & data entry numbers
-
the price: 4th column is different b/c of NaN numbers
Showing missing values
You can check whether the data is empty with isnull
or isna
.
Missing values are displayed as True, but since True == 1
, adding all of these values results in the number of missing values.
True == 1
True
False == 0
True
True + True + False
2
Check missing values via isnull
.
df_last.isnull()
์ง์ญ๋ช | ๊ท๋ชจ๊ตฌ๋ถ | ์ฐ๋ | ์ | ๋ถ์๊ฐ๊ฒฉ(ใก) | |
---|---|---|---|---|---|
0 | False | False | False | False | False |
1 | False | False | False | False | False |
2 | False | False | False | False | False |
3 | False | False | False | False | False |
4 | False | False | False | False | False |
... | ... | ... | ... | ... | ... |
4330 | False | False | False | False | False |
4331 | False | False | False | False | True |
4332 | False | False | False | False | False |
4333 | False | False | False | False | True |
4334 | False | False | False | False | False |
4335 rows ร 5 columns
Use isnull
to find missing values.
df_last.isnull().sum()
์ง์ญ๋ช
0
๊ท๋ชจ๊ตฌ๋ถ 0
์ฐ๋ 0
์ 0
๋ถ์๊ฐ๊ฒฉ(ใก) 277
dtype: int64
Find missing values via isna
.
df_last.isna().sum()
์ง์ญ๋ช
0
๊ท๋ชจ๊ตฌ๋ถ 0
์ฐ๋ 0
์ 0
๋ถ์๊ฐ๊ฒฉ(ใก) 277
dtype: int64
Leave a comment