Adding new columns and refining unnecessary columns.

Change size classification to exclusive area column

The size category column contains information on exclusive area. Since the exclusive area is more intuitive than the size category, create a new column called exclusive area and subtract phrases such as excess, or less from the existing size classification value to make it concise.

At this time, using the replace function of str, for example, if “exclusive area exceeds 60㎡ and is less than 85㎡”, change it to “60㎡~85㎡”.

  • More information of Pandas’s string handling function : https://pandas.pydata.org/pandas-docs/stable/reference/series.html#string-handling

View the unique value of the scale division.

df_last["규모구분"].unique()  # Scare Division
array(['전체', '전용면적 60㎡이하', '전용면적 60㎡초과 85㎡이하', '전용면적 85㎡초과 102㎡이하',
       '전용면적 102㎡초과'], dtype=object)

Change the scale division(전용면적) to exclusive area

df_last["규모구분"].replace("전용면적", "")
0                      전체
1              전용면적 60㎡이하
2        전용면적 60㎡초과 85㎡이하
3       전용면적 85㎡초과 102㎡이하
4             전용면적 102㎡초과
              ...        
4330                   전체
4331           전용면적 60㎡이하
4332     전용면적 60㎡초과 85㎡이하
4333    전용면적 85㎡초과 102㎡이하
4334          전용면적 102㎡초과
Name: 규모구분, Length: 4335, dtype: object
df_last["Exclusive Area"] = df_last["규모구분"].str.replace("Exclusive Area", "")  
# Once you entered `str` it's working well
df_last["Exclusive Area"] = df_last["Exclusive Area"].str.replace("전용면적", "")
df_last["Exclusive Area"] = df_last["Exclusive Area"].str.replace("초과", "~")
df_last["Exclusive Area"] = df_last["Exclusive Area"].str.replace("이하", "")
df_last["Exclusive Area"] = df_last["Exclusive Area"].str.replace(" ", "").str.strip()
df_last["Exclusive Area"]
0             전체
1            60㎡
2        60㎡~85㎡
3       85㎡~102㎡
4          102㎡~
          ...   
4330          전체
4331         60㎡
4332     60㎡~85㎡
4333    85㎡~102㎡
4334       102㎡~
Name: Exclusive Area, Length: 4335, dtype: object

Remove unnecessary columns

Remove the column pretreated by drop. Methods related to data frames in pandas sometimes require an axis option, which means which row or column to process. It usually defaults to 0, meaning processing is row-by-row. Check if the memory usage is reduced.

See the information of the list with .info() and .head(1).

df_last.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4335 entries, 0 to 4334
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   지역명             4335 non-null   object 
 1   연도              4335 non-null   int64  
 2   월               4335 non-null   int64  
 3   분양가격            3957 non-null   float64
 4   평당분양가격          3957 non-null   float64
 5   Exclusive Area  4335 non-null   object 
 6   규모구분            4335 non-null   object 
dtypes: float64(2), int64(2), object(3)
memory usage: 237.2+ KB
df_last.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4335 entries, 0 to 4334
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   지역명             4335 non-null   object 
 1   연도              4335 non-null   int64  
 2   월               4335 non-null   int64  
 3   분양가격            3957 non-null   float64
 4   평당분양가격          3957 non-null   float64
 5   Exclusive Area  4335 non-null   object 
 6   규모구분            4335 non-null   object 
dtypes: float64(2), int64(2), object(3)
memory usage: 237.2+ KB
df_last.head(1)
지역명 연도 분양가격 평당분양가격 Exclusive Area 규모구분
0 서울 2015 10 5841.0 19275.3 전체 서울

Use drop to delete unnecessary columns.

  • Pay attention to the axis when using drop.
  • axis 0: row, 1: column
df_last = df_last.drop(["규모구분", "분양가격"], axis=1)
# Check if the columns have been successfully deleted.
df_last.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4335 entries, 0 to 4334
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   지역명             4335 non-null   object 
 1   연도              4335 non-null   int64  
 2   월               4335 non-null   int64  
 3   평당분양가격          3957 non-null   float64
 4   Exclusive Area  4335 non-null   object 
dtypes: float64(1), int64(2), object(2)
memory usage: 169.5+ KB
# Check if memory usage is reduced by removing columns.
df_last.head()
지역명 연도 평당분양가격 Exclusive Area
0 서울 2015 10 19275.3 전체
1 서울 2015 10 18651.6 60㎡
2 서울 2015 10 19410.6 60㎡~85㎡
3 서울 2015 10 18879.3 85㎡~102㎡
4 서울 2015 10 19400.7 102㎡~

Leave a comment