[ pandas ] 데이터 타입변경, 데이터 범주화

2021. 8. 30. 21:19

데이터 타입변경

lf_bus = pd.read_csv('./저상버스.csv')
display(lf_bus.head())
lf_bus.info()

	노선번호	인가대수	저상대수	보유율	배차간격	저상버스간격	차량count
0	100	32	32	1.0	8	8	135
1	742	31	24	1.0	8	10	105
2	2312	21	3	0.0	8	56	19
3	4312	29	14	0.0	8	17	65
4	5634	15	11	1.0	8	11	99

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294 entries, 0 to 293
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   노선번호     294 non-null    object 
 1   인가대수     294 non-null    int64  
 2   저상대수     294 non-null    int64  
 3   보유율      293 non-null    float64
 4   배차간격     294 non-null    int64  
 5   저상버스간격   294 non-null    object 
 6   차량count  294 non-null    object 
dtypes: float64(1), int64(3), object(3)
memory usage: 16.2+ KB

데이터 프레임을 확인하면 모든 데이터의 값이 숫자형태로 보이지만 사실은 오브젝트 형태인 녀석들이 있다 이러면 숫자 연산이 불가능 하기때문에 숫자형태로 형변환이 필요하다.

print(lf_bus['저상버스간격'])
print(lf_bus['저상버스간격'].astype('int'))

0       8
1      10
2      56
3      17
4      11
       ..
288     8
289     8
290    11
291     8
292     8
Name: 저상버스간격, Length: 293, dtype: object
0       8
1      10
2      56
3      17
4      11
       ..
288     8
289     8
290    11
291     8
292     8
Name: 저상버스간격, Length: 293, dtype: int32

이렇게 바꾼다 . 이것을 포문을 이용해서 바꾸고싶은 열만 바꾸면

int_cols = lf_bus.columns.tolist()
int_cols.remove('노선번호')
print(int_cols)

for col in int_cols:
    lf_bus[col] = lf_bus[col].astype('int')

lf_bus.info()

['인가대수', '저상대수', '보유율', '배차간격', '저상버스간격', '차량count']
<class 'pandas.core.frame.DataFrame'>
Int64Index: 293 entries, 0 to 292
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   노선번호     293 non-null    object
 1   인가대수     293 non-null    int32 
 2   저상대수     293 non-null    int32 
 3   보유율      293 non-null    int32 
 4   배차간격     293 non-null    int32 
 5   저상버스간격   293 non-null    int32 
 6   차량count  293 non-null    int32 
dtypes: int32(6), object(1)
memory usage: 11.4+ KB

내가 지정한 컬럼을 int32타입으로 바꿀 수 있다.

데이터 범주화

count= lf_bus['차량count']
lf_bus['범주'] = pd.cut(count,4, include_lowest= False)
print(lf_bus['범주'])
print('*'*30)
print(count.describe().iloc[3:])
print('*'*30)
lf_bus['범주'] = pd.cut(count,count.describe().iloc[3:], include_lowest= True)
print(lf_bus['범주'])
print('*'*30)
lf_bus['범주'] = pd.cut(count,count.describe().iloc[3:], labels = False, include_lowest= False)
print(lf_bus['범주'])
print('*'*30)
lf_bus['범주'] = pd.cut(count,count.describe().iloc[3:], labels = ['a','b','c','d'], include_lowest= False)
print(lf_bus['범주'])

0      (103.0, 135.0]
1      (103.0, 135.0]
2       (6.872, 39.0]
3        (39.0, 71.0]
4       (71.0, 103.0]
            ...      
288    (103.0, 135.0]
289    (103.0, 135.0]
290     (71.0, 103.0]
291    (103.0, 135.0]
292    (103.0, 135.0]
Name: 범주, Length: 293, dtype: category
Categories (4, interval[float64, right]): [(6.872, 39.0] < (39.0, 71.0] < (71.0, 103.0] < (103.0, 135.0]]
******************************
min      7.0
25%     68.0
50%     92.0
75%    122.0
max    135.0
Name: 차량count, dtype: float64
******************************
0      (122.0, 135.0]
1       (92.0, 122.0]
2       (6.999, 68.0]
3       (6.999, 68.0]
4       (92.0, 122.0]
            ...      
288    (122.0, 135.0]
289    (122.0, 135.0]
290     (92.0, 122.0]
291    (122.0, 135.0]
292    (122.0, 135.0]
Name: 범주, Length: 293, dtype: category
Categories (4, interval[float64, right]): [(6.999, 68.0] < (68.0, 92.0] < (92.0, 122.0] < (122.0, 135.0]]
******************************
0      3.0
1      2.0
2      0.0
3      0.0
4      2.0
      ... 
288    3.0
289    3.0
290    2.0
291    3.0
292    3.0
Name: 범주, Length: 293, dtype: float64
******************************
0      d
1      c
2      a
3      a
4      c
      ..
288    d
289    d
290    c
291    d
292    d
Name: 범주, Length: 293, dtype: category
Categories (4, object): ['a' < 'b' < 'c' < 'd']

pd.cut 함수로 데이터의 범주화가 가능하다.

원래는 if, else문을 활용해서 함수정의 후 map으로 각각 적용했는데

pd.cut을 이용하면 간단하게 범주화가 가능하다. 데이터타입은 labels = False가 아니라면

기본적으로 카테고리 데이터이다.

라벨을 지정하면 카테고리 명을 설정가능하다.

데이터를 자르는 경계를 설정하지않으면 등간격으로 잘라준다.

include_lowest= True 를 선택시 왼쪽 간격을 포함, False시 미포함

( 는 <를 ] =<를 의미

'pandas' 카테고리의 다른 글

[ pandas ] to_datetime 시간데이터 다루기 기본 (0)	2022.02.14
[pandas] 결측값처리 (0)	2021.12.02
[pandas] os모듈을 활용하여 원하는 여러 파일들 가져오기 (0)	2021.12.01
[ pandas 기초] 컬럼명변경 pandas.rename (0)	2021.12.01
[ pandas ] 시리즈 Series (1)	2021.08.20

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

구랩

[ pandas ] 데이터 타입변경, 데이터 범주화

데이터 타입변경

데이터 범주화

'pandas' 카테고리의 다른 글

+ Recent posts

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역