Change data type

The sale price is in the object(character) type. Since the string type cannot be calculated, it has to be changed to numerical data. Transformations do not work well when missing values are mixed. So, we change the data type through pd.to_numeric.

df_last["분양가격(㎡)"].sum()
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

Cell In[26], line 1
----> 1 df_last["분양가격(㎡)"].sum()


TypeError: can only concatenate str (not "int") to str

**The sale price is in the object (character) type. Since the string type cannot be calculated, it is changed to numerical data. With .astype(int) or .to_numeric. **

df_last["분양가격(㎡)"].astype(int)
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

Cell In[30], line 1
----> 1 df_last["분양가격(㎡)"].astype(int)


ValueError: invalid literal for int() with base 10: '  '

But there comes up the error : ‘ ValueError: Unable to parse string “ “ at position 28 –> input errors’ –> We need to delete the spacing " ".

pd.to_numeric(df_last["분양가격(㎡)"])
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

File ~/anaconda3/lib/python3.10/site-packages/pandas/_libs/lib.pyx:2369, in pandas._libs.lib.maybe_convert_numeric()


ValueError: Unable to parse string "  "

ValueError: Unable to parse string "  " at position 28
  1. Checking the type of datasets
type(pd.np.nan)
/var/folders/bg/wtktbqcn3c7334w9nxnw3lxr0000gn/T/ipykernel_73203/907039516.py:1: FutureWarning: The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead.
  type(pd.np.nan)

float
  1. **Change data types to numbers forcely. Then, we can check the data type has been changed to float64 <– b/c NaN type . **
df_last["분양가격"] = pd.to_numeric(df_last["분양가격(㎡)"], errors = 'coerce')
df_last["분양가격"].sum()
12813275.0

You can check the type of dataset has been changed from int to float.

df_last["분양가격"].head(1)
0    5841.0
Name: 분양가격, dtype: float64

Finding the sale price per pyeong

The data from 2013 on the public data portal is based on the pre-sale price per pyeong. To view the sale price per pyeong, multiply by 3.3 to create and add the “sale price per pyeong” column. (Column7)

df_last["평당분양가격"] = df_last["분양가격"] * 3.3
df_last.head(1)
지역명 규모구분 연도 분양가격(㎡) 분양가격 평당분양가격
0 서울 전체 2015 10 5841 5841.0 19275.3

Summarize the sales price

View the sales price through info. You can check the data types of column 5 & 6 have been changed into ‘float64’ type.

df_last.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4335 entries, 0 to 4334
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   지역명      4335 non-null   object 
 1   규모구분     4335 non-null   object 
 2   연도       4335 non-null   int64  
 3   월        4335 non-null   int64  
 4   분양가격(㎡)  4058 non-null   object 
 5   분양가격     3957 non-null   float64
 6   평당분양가격   3957 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 237.2+ KB

Summarize the pre-sale price (㎡) column, which is the column before the change.–> obj type summarized

  • Unique –> unique numbers
  • Top –> most frequenced
  • Freq –> numbers of the most frequenced
df_last["분양가격(㎡)"].describe()
count     4058
unique    1753
top       2221
freq        17
Name: 분양가격(㎡), dtype: object

Summarizes the changed sales price column with numerical data.

  • std = Standard Deviation –> float type summarized
  • NaN data –> blank –> data count decreased
  • mean = average / min, max / 25%–> top 25% number / 50 % = medium number / 75% = last 25% number
  • mean » medium <== b/c max number is way higher compared to the min number, mean is a lot higher
df_last["분양가격"].describe()
count     3957.000000
mean      3238.128633
std       1264.309933
min       1868.000000
25%       2441.000000
50%       2874.000000
75%       3561.000000
max      12728.000000
Name: 분양가격, dtype: float64

Leave a comment