DataMiningPratice(2)-RefiningDatasets
Change data type
The sale price is in the object(character) type. Since the string type
cannot be calculated, it has to be changed to numerical data. Transformations do not work well when missing values are mixed. So, we change the data type through pd.to_numeric
.
df_last["분양가격(㎡)"].sum()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[26], line 1
----> 1 df_last["분양가격(㎡)"].sum()
TypeError: can only concatenate str (not "int") to str
**The sale price is in the object (character) type. Since the string type cannot be calculated, it is changed to numerical data. With .astype(int)
or .to_numeric
. **
df_last["분양가격(㎡)"].astype(int)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[30], line 1
----> 1 df_last["분양가격(㎡)"].astype(int)
ValueError: invalid literal for int() with base 10: ' '
But there comes up the error :
‘ ValueError: Unable to parse string “ “ at position 28 –> input errors’
–> We need to delete the spacing " "
.
pd.to_numeric(df_last["분양가격(㎡)"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/anaconda3/lib/python3.10/site-packages/pandas/_libs/lib.pyx:2369, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string " "
ValueError: Unable to parse string " " at position 28
- Checking the type of datasets
type(pd.np.nan)
/var/folders/bg/wtktbqcn3c7334w9nxnw3lxr0000gn/T/ipykernel_73203/907039516.py:1: FutureWarning: The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead.
type(pd.np.nan)
float
- **Change data types to numbers forcely.
Then, we can check the data type has been changed to
float64
<– b/c NaN type . **
df_last["분양가격"] = pd.to_numeric(df_last["분양가격(㎡)"], errors = 'coerce')
df_last["분양가격"].sum()
12813275.0
You can check the type of dataset has been changed from int
to float
.
df_last["분양가격"].head(1)
0 5841.0
Name: 분양가격, dtype: float64
Finding the sale price per pyeong
The data from 2013 on the public data portal is based on the pre-sale price per pyeong. To view the sale price per pyeong, multiply by 3.3 to create and add the “sale price per pyeong” column. (Column7)
df_last["평당분양가격"] = df_last["분양가격"] * 3.3
df_last.head(1)
지역명 | 규모구분 | 연도 | 월 | 분양가격(㎡) | 분양가격 | 평당분양가격 | |
---|---|---|---|---|---|---|---|
0 | 서울 | 전체 | 2015 | 10 | 5841 | 5841.0 | 19275.3 |
Summarize the sales price
View the sales price through info. You can check the data types of column 5 & 6 have been changed into ‘float64’ type.
df_last.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4335 entries, 0 to 4334
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 지역명 4335 non-null object
1 규모구분 4335 non-null object
2 연도 4335 non-null int64
3 월 4335 non-null int64
4 분양가격(㎡) 4058 non-null object
5 분양가격 3957 non-null float64
6 평당분양가격 3957 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 237.2+ KB
Summarize the pre-sale price (㎡) column, which is the column before the change.–> obj type summarized
- Unique –> unique numbers
- Top –> most frequenced
- Freq –> numbers of the most frequenced
df_last["분양가격(㎡)"].describe()
count 4058
unique 1753
top 2221
freq 17
Name: 분양가격(㎡), dtype: object
Summarizes the changed sales price column with numerical data.
- std = Standard Deviation –> float type summarized
- NaN data –> blank –> data count decreased
- mean = average / min, max / 25%–> top 25% number / 50 % = medium number / 75% = last 25% number
- mean » medium <== b/c max number is way higher compared to the min number, mean is a lot higher
df_last["분양가격"].describe()
count 3957.000000
mean 3238.128633
std 1264.309933
min 1868.000000
25% 2441.000000
50% 2874.000000
75% 3561.000000
max 12728.000000
Name: 분양가격, dtype: float64
Leave a comment