DataMiningPratice(3)-RefiningUnnecessaryColumns

August 12, 2023

Adding new columns and refining unnecessary columns.

Change size classification to exclusive area column

The size category column contains information on exclusive area. Since the exclusive area is more intuitive than the size category, create a new column called exclusive area and subtract phrases such as excess, or less from the existing size classification value to make it concise.

At this time, using the replace function of str, for example, if “exclusive area exceeds 60㎡ and is less than 85㎡”, change it to “60㎡~85㎡”.

More information of Pandas’s string handling function : https://pandas.pydata.org/pandas-docs/stable/reference/series.html#string-handling

View the unique value of the scale division.

df_last["규모구분"].unique()  # Scare Division

array(['전체', '전용면적 60㎡이하', '전용면적 60㎡초과 85㎡이하', '전용면적 85㎡초과 102㎡이하',
       '전용면적 102㎡초과'], dtype=object)

Change the scale division(전용면적) to exclusive area

df_last["규모구분"].replace("전용면적", "")

                    전체
            전용면적 60㎡이하
      전용면적 60㎡초과 85㎡이하
     전용면적 85㎡초과 102㎡이하
           전용면적 102㎡초과
              ...        
                 전체
         전용면적 60㎡이하
   전용면적 60㎡초과 85㎡이하
  전용면적 85㎡초과 102㎡이하
        전용면적 102㎡초과
Name: 규모구분, Length: 4335, dtype: object

df_last["Exclusive Area"] = df_last["규모구분"].str.replace("Exclusive Area", "")  
# Once you entered `str` it's working well
df_last["Exclusive Area"] = df_last["Exclusive Area"].str.replace("전용면적", "")
df_last["Exclusive Area"] = df_last["Exclusive Area"].str.replace("초과", "~")
df_last["Exclusive Area"] = df_last["Exclusive Area"].str.replace("이하", "")
df_last["Exclusive Area"] = df_last["Exclusive Area"].str.replace(" ", "").str.strip()
df_last["Exclusive Area"]

           전체
          60㎡
      60㎡~85㎡
     85㎡~102㎡
        102㎡~
          ...   
        전체
       60㎡
   60㎡~85㎡
  85㎡~102㎡
     102㎡~
Name: Exclusive Area, Length: 4335, dtype: object

Remove unnecessary columns

Remove the column pretreated by drop. Methods related to data frames in pandas sometimes require an axis option, which means which row or column to process. It usually defaults to 0, meaning processing is row-by-row. Check if the memory usage is reduced.

See the information of the list with .info() and .head(1).

df_last.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4335 entries, 0 to 4334
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   지역명             4335 non-null   object 
 1   연도              4335 non-null   int64  
 2   월               4335 non-null   int64  
 3   분양가격            3957 non-null   float64
 4   평당분양가격          3957 non-null   float64
 5   Exclusive Area  4335 non-null   object 
 6   규모구분            4335 non-null   object 
dtypes: float64(2), int64(2), object(3)
memory usage: 237.2+ KB

df_last.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4335 entries, 0 to 4334
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   지역명             4335 non-null   object 
 1   연도              4335 non-null   int64  
 2   월               4335 non-null   int64  
 3   분양가격            3957 non-null   float64
 4   평당분양가격          3957 non-null   float64
 5   Exclusive Area  4335 non-null   object 
 6   규모구분            4335 non-null   object 
dtypes: float64(2), int64(2), object(3)
memory usage: 237.2+ KB

df_last.head(1)

	지역명	연도	월	분양가격	평당분양가격	Exclusive Area	규모구분
0	서울	2015	10	5841.0	19275.3	전체	서울

Use drop to delete unnecessary columns.

Pay attention to the axis when using drop.
axis 0: row, 1: column

df_last = df_last.drop(["규모구분", "분양가격"], axis=1)

# Check if the columns have been successfully deleted.
df_last.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4335 entries, 0 to 4334
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   지역명             4335 non-null   object 
 1   연도              4335 non-null   int64  
 2   월               4335 non-null   int64  
 3   평당분양가격          3957 non-null   float64
 4   Exclusive Area  4335 non-null   object 
dtypes: float64(1), int64(2), object(2)
memory usage: 169.5+ KB

# Check if memory usage is reduced by removing columns.
df_last.head()

	지역명	연도	월	평당분양가격	Exclusive Area
0	서울	2015	10	19275.3	전체
1	서울	2015	10	18651.6	60㎡
2	서울	2015	10	19410.6	60㎡~85㎡
3	서울	2015	10	18879.3	85㎡~102㎡
4	서울	2015	10	19400.7	102㎡~

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

DataMiningPratice(3)-RefiningUnnecessaryColumns

Change size classification to exclusive area column

Remove unnecessary columns

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)