Change data type

The sale price is in the object(character) type. Since the string type cannot be calculated, it has to be changed to numerical data. Transformations do not work well when missing values are mixed. So, we change the data type through pd.to_numeric.

df_last["๋ถ„์–‘๊ฐ€๊ฒฉ(ใŽก)"].sum()
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

Cell In[26], line 1
----> 1 df_last["๋ถ„์–‘๊ฐ€๊ฒฉ(ใŽก)"].sum()


TypeError: can only concatenate str (not "int") to str

**The sale price is in the object (character) type. Since the string type cannot be calculated, it is changed to numerical data. With .astype(int) or .to_numeric. **

df_last["๋ถ„์–‘๊ฐ€๊ฒฉ(ใŽก)"].astype(int)
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

Cell In[30], line 1
----> 1 df_last["๋ถ„์–‘๊ฐ€๊ฒฉ(ใŽก)"].astype(int)


ValueError: invalid literal for int() with base 10: '  '

But there comes up the error : โ€˜ ValueError: Unable to parse string โ€œ โ€œ at position 28 โ€“> input errorsโ€™ โ€“> We need to delete the spacing " ".

pd.to_numeric(df_last["๋ถ„์–‘๊ฐ€๊ฒฉ(ใŽก)"])
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

File ~/anaconda3/lib/python3.10/site-packages/pandas/_libs/lib.pyx:2369, in pandas._libs.lib.maybe_convert_numeric()


ValueError: Unable to parse string "  "

โ€‹

ValueError: Unable to parse string "  " at position 28
  1. Checking the type of datasets
type(pd.np.nan)
/var/folders/bg/wtktbqcn3c7334w9nxnw3lxr0000gn/T/ipykernel_73203/907039516.py:1: FutureWarning: The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead.
  type(pd.np.nan)

float
  1. **Change data types to numbers forcely. Then, we can check the data type has been changed to float64 <โ€“ b/c NaN type . **
df_last["๋ถ„์–‘๊ฐ€๊ฒฉ"] = pd.to_numeric(df_last["๋ถ„์–‘๊ฐ€๊ฒฉ(ใŽก)"], errors = 'coerce')
df_last["๋ถ„์–‘๊ฐ€๊ฒฉ"].sum()
12813275.0

You can check the type of dataset has been changed from int to float.

df_last["๋ถ„์–‘๊ฐ€๊ฒฉ"].head(1)
0    5841.0
Name: ๋ถ„์–‘๊ฐ€๊ฒฉ, dtype: float64

Finding the sale price per pyeong

The data from 2013 on the public data portal is based on the pre-sale price per pyeong. To view the sale price per pyeong, multiply by 3.3 to create and add the โ€œsale price per pyeongโ€ column. (Column7)

df_last["ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ"] = df_last["๋ถ„์–‘๊ฐ€๊ฒฉ"] * 3.3
df_last.head(1)
์ง€์—ญ๋ช… ๊ทœ๋ชจ๊ตฌ๋ถ„ ์—ฐ๋„ ์›” ๋ถ„์–‘๊ฐ€๊ฒฉ(ใŽก) ๋ถ„์–‘๊ฐ€๊ฒฉ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ
0 ์„œ์šธ ์ „์ฒด 2015 10 5841 5841.0 19275.3

Summarize the sales price

View the sales price through info. You can check the data types of column 5 & 6 have been changed into โ€˜float64โ€™ type.

df_last.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4335 entries, 0 to 4334
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   ์ง€์—ญ๋ช…      4335 non-null   object 
 1   ๊ทœ๋ชจ๊ตฌ๋ถ„     4335 non-null   object 
 2   ์—ฐ๋„       4335 non-null   int64  
 3   ์›”        4335 non-null   int64  
 4   ๋ถ„์–‘๊ฐ€๊ฒฉ(ใŽก)  4058 non-null   object 
 5   ๋ถ„์–‘๊ฐ€๊ฒฉ     3957 non-null   float64
 6   ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ   3957 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 237.2+ KB

Summarize the pre-sale price (ใŽก) column, which is the column before the change.โ€“> obj type summarized

  • Unique โ€“> unique numbers
  • Top โ€“> most frequenced
  • Freq โ€“> numbers of the most frequenced
df_last["๋ถ„์–‘๊ฐ€๊ฒฉ(ใŽก)"].describe()
count     4058
unique    1753
top       2221
freq        17
Name: ๋ถ„์–‘๊ฐ€๊ฒฉ(ใŽก), dtype: object

Summarizes the changed sales price column with numerical data.

  • std = Standard Deviation โ€“> float type summarized
  • NaN data โ€“> blank โ€“> data count decreased
  • mean = average / min, max / 25%โ€“> top 25% number / 50 % = medium number / 75% = last 25% number
  • meanย ยป medium <== b/c max number is way higher compared to the min number, mean is a lot higher
df_last["๋ถ„์–‘๊ฐ€๊ฒฉ"].describe()
count     3957.000000
mean      3238.128633
std       1264.309933
min       1868.000000
25%       2441.000000
50%       2874.000000
75%       3561.000000
max      12728.000000
Name: ๋ถ„์–‘๊ฐ€๊ฒฉ, dtype: float64

Leave a comment