DataMiningPratice(2)-RefiningDatasets
Change data type
The sale price is in the object(character) type. Since the string type cannot be calculated, it has to be changed to numerical data. Transformations do not work well when missing values are mixed. So, we change the data type through pd.to_numeric.
df_last["๋ถ์๊ฐ๊ฒฉ(ใก)"].sum()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[26], line 1
----> 1 df_last["๋ถ์๊ฐ๊ฒฉ(ใก)"].sum()
TypeError: can only concatenate str (not "int") to str
**The sale price is in the object (character) type. Since the string type cannot be calculated, it is changed to numerical data. With .astype(int) or .to_numeric. **
df_last["๋ถ์๊ฐ๊ฒฉ(ใก)"].astype(int)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[30], line 1
----> 1 df_last["๋ถ์๊ฐ๊ฒฉ(ใก)"].astype(int)
ValueError: invalid literal for int() with base 10: ' '
But there comes up the error :
โ ValueError: Unable to parse string โ โ at position 28 โ> input errorsโ
โ> We need to delete the spacing " ".
pd.to_numeric(df_last["๋ถ์๊ฐ๊ฒฉ(ใก)"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/anaconda3/lib/python3.10/site-packages/pandas/_libs/lib.pyx:2369, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string " "
โ
ValueError: Unable to parse string " " at position 28
- Checking the type of datasets
type(pd.np.nan)
/var/folders/bg/wtktbqcn3c7334w9nxnw3lxr0000gn/T/ipykernel_73203/907039516.py:1: FutureWarning: The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead.
type(pd.np.nan)
float
- **Change data types to numbers forcely.
Then, we can check the data type has been changed to
float64<โ b/c NaN type . **
df_last["๋ถ์๊ฐ๊ฒฉ"] = pd.to_numeric(df_last["๋ถ์๊ฐ๊ฒฉ(ใก)"], errors = 'coerce')
df_last["๋ถ์๊ฐ๊ฒฉ"].sum()
12813275.0
You can check the type of dataset has been changed from int to float.
df_last["๋ถ์๊ฐ๊ฒฉ"].head(1)
0 5841.0
Name: ๋ถ์๊ฐ๊ฒฉ, dtype: float64
Finding the sale price per pyeong
The data from 2013 on the public data portal is based on the pre-sale price per pyeong. To view the sale price per pyeong, multiply by 3.3 to create and add the โsale price per pyeongโ column. (Column7)
df_last["ํ๋น๋ถ์๊ฐ๊ฒฉ"] = df_last["๋ถ์๊ฐ๊ฒฉ"] * 3.3
df_last.head(1)
| ์ง์ญ๋ช | ๊ท๋ชจ๊ตฌ๋ถ | ์ฐ๋ | ์ | ๋ถ์๊ฐ๊ฒฉ(ใก) | ๋ถ์๊ฐ๊ฒฉ | ํ๋น๋ถ์๊ฐ๊ฒฉ | |
|---|---|---|---|---|---|---|---|
| 0 | ์์ธ | ์ ์ฒด | 2015 | 10 | 5841 | 5841.0 | 19275.3 |
Summarize the sales price
View the sales price through info. You can check the data types of column 5 & 6 have been changed into โfloat64โ type.
df_last.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4335 entries, 0 to 4334
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ์ง์ญ๋ช
4335 non-null object
1 ๊ท๋ชจ๊ตฌ๋ถ 4335 non-null object
2 ์ฐ๋ 4335 non-null int64
3 ์ 4335 non-null int64
4 ๋ถ์๊ฐ๊ฒฉ(ใก) 4058 non-null object
5 ๋ถ์๊ฐ๊ฒฉ 3957 non-null float64
6 ํ๋น๋ถ์๊ฐ๊ฒฉ 3957 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 237.2+ KB
Summarize the pre-sale price (ใก) column, which is the column before the change.โ> obj type summarized
- Unique โ> unique numbers
- Top โ> most frequenced
- Freq โ> numbers of the most frequenced
df_last["๋ถ์๊ฐ๊ฒฉ(ใก)"].describe()
count 4058
unique 1753
top 2221
freq 17
Name: ๋ถ์๊ฐ๊ฒฉ(ใก), dtype: object
Summarizes the changed sales price column with numerical data.
- std = Standard Deviation โ> float type summarized
- NaN data โ> blank โ> data count decreased
- mean = average / min, max / 25%โ> top 25% number / 50 % = medium number / 75% = last 25% number
- meanย ยป medium <== b/c max number is way higher compared to the min number, mean is a lot higher
df_last["๋ถ์๊ฐ๊ฒฉ"].describe()
count 3957.000000
mean 3238.128633
std 1264.309933
min 1868.000000
25% 2441.000000
50% 2874.000000
75% 3561.000000
max 12728.000000
Name: ๋ถ์๊ฐ๊ฒฉ, dtype: float64
Leave a comment