DataMiningPratice(2)-RefiningDatasets

August 11, 2023

Change data type

The sale price is in the object(character) type. Since the string type cannot be calculated, it has to be changed to numerical data. Transformations do not work well when missing values are mixed. So, we change the data type through pd.to_numeric.

df_last["분양가격(㎡)"].sum()

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

Cell In[26], line 1
----> 1 df_last["분양가격(㎡)"].sum()

TypeError: can only concatenate str (not "int") to str

**The sale price is in the object (character) type. Since the string type cannot be calculated, it is changed to numerical data. With .astype(int) or .to_numeric. **

df_last["분양가격(㎡)"].astype(int)

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

Cell In[30], line 1
----> 1 df_last["분양가격(㎡)"].astype(int)

ValueError: invalid literal for int() with base 10: '  '

But there comes up the error : ‘ ValueError: Unable to parse string “ “ at position 28 –> input errors’ –> We need to delete the spacing " ".

pd.to_numeric(df_last["분양가격(㎡)"])

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

File ~/anaconda3/lib/python3.10/site-packages/pandas/_libs/lib.pyx:2369, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "  "

ValueError: Unable to parse string "  " at position 28

Checking the type of datasets

type(pd.np.nan)

/var/folders/bg/wtktbqcn3c7334w9nxnw3lxr0000gn/T/ipykernel_73203/907039516.py:1: FutureWarning: The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead.
  type(pd.np.nan)

float

**Change data types to numbers forcely. Then, we can check the data type has been changed to float64 <– b/c NaN type . **

df_last["분양가격"] = pd.to_numeric(df_last["분양가격(㎡)"], errors = 'coerce')

df_last["분양가격"].sum()

12813275.0

You can check the type of dataset has been changed from int to float.

df_last["분양가격"].head(1)

0    5841.0
Name: 분양가격, dtype: float64

Finding the sale price per pyeong

The data from 2013 on the public data portal is based on the pre-sale price per pyeong. To view the sale price per pyeong, multiply by 3.3 to create and add the “sale price per pyeong” column. (Column7)

df_last["평당분양가격"] = df_last["분양가격"] * 3.3
df_last.head(1)

	지역명	규모구분	연도	월	분양가격(㎡)	분양가격	평당분양가격
0	서울	전체	2015	10	5841	5841.0	19275.3

Summarize the sales price

View the sales price through info. You can check the data types of column 5 & 6 have been changed into ‘float64’ type.

df_last.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4335 entries, 0 to 4334
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   지역명      4335 non-null   object 
 1   규모구분     4335 non-null   object 
 2   연도       4335 non-null   int64  
 3   월        4335 non-null   int64  
 4   분양가격(㎡)  4058 non-null   object 
 5   분양가격     3957 non-null   float64
 6   평당분양가격   3957 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 237.2+ KB

Summarize the pre-sale price (㎡) column, which is the column before the change.–> obj type summarized

Unique –> unique numbers
Top –> most frequenced
Freq –> numbers of the most frequenced

df_last["분양가격(㎡)"].describe()

count     4058
unique    1753
top       2221
freq        17
Name: 분양가격(㎡), dtype: object

Summarizes the changed sales price column with numerical data.

std = Standard Deviation –> float type summarized
NaN data –> blank –> data count decreased
mean = average / min, max / 25%–> top 25% number / 50 % = medium number / 75% = last 25% number
mean » medium <== b/c max number is way higher compared to the min number, mean is a lot higher

df_last["분양가격"].describe()

count     3957.000000
mean      3238.128633
std       1264.309933
min       1868.000000
25%       2441.000000
50%       2874.000000
75%       3561.000000
max      12728.000000
Name: 분양가격, dtype: float64

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

DataMiningPratice(2)-RefiningDatasets

Change data type

Finding the sale price per pyeong

Summarize the sales price

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)