[Pandas] 판다스로 배우는 DataFrame (1)

Data Science Fellowship/Python

[Pandas] 판다스로 배우는 DataFrame (1)

graph-dev 2024. 1. 18. 04:15

728x90

판다스는 안해보면 까먹는 도구입니다. 데이터캠프로 공부한 내용을 다시 정리해봤습니다.

판다스로 특정 열(column)의 평균과 중앙값 찾기

가령, sales라는 데이터프레임이 있습니다. 내부를 확인해보고, 평균과 중앙값을 볼 수 있습니다. 특정 컬럼을 기준으로 볼 수 있죠.

# Print the head of the sales DataFrame
print(sales.head())

# Print the info about the sales DataFrame
print(sales.info())

# Print the mean of weekly_sales
print(sales["weekly_sales"].mean())

# Print the median of weekly_sales
def pct50(column):
    return column.quantile(0.5)

print(sales["weekly_sales"].agg(pct50))

아 근데, median()은 자체 메서드도 있습니다. 추가해보았습니다.

sales["weekly_sales"].median()

위의 pct50 메서드와 같은 결과가 나올 것입니다. 어렵게 갔네요.

<script.py> output:
       store type  department       date  weekly_sales  is_holiday  temperature_c  fuel_price_usd_per_l  unemployment
    0      1    A           1 2010-02-05      24924.50       False          5.728                 0.679         8.106
    1      1    A           1 2010-03-05      21827.90       False          8.056                 0.693         8.106
    2      1    A           1 2010-04-02      57258.43       False         16.817                 0.718         7.808
    3      1    A           1 2010-05-07      17413.94       False         22.528                 0.749         7.808
    4      1    A           1 2010-06-04      17558.09       False         27.050                 0.715         7.808
    
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 10774 entries, 0 to 10773
    Data columns (total 9 columns):
     #   Column                Non-Null Count  Dtype         
    ---  ------                --------------  -----         
     0   store                 10774 non-null  int64         
     1   type                  10774 non-null  object        
     2   department            10774 non-null  int32         
     3   date                  10774 non-null  datetime64[ns]
     4   weekly_sales          10774 non-null  float64       
     5   is_holiday            10774 non-null  bool          
     6   temperature_c         10774 non-null  float64       
     7   fuel_price_usd_per_l  10774 non-null  float64       
     8   unemployment          10774 non-null  float64       
    dtypes: bool(1), datetime64[ns](1), float64(4), int32(1), int64(1), object(1)
    memory usage: 641.9+ KB
    None
    
    23843.95014850566
    12049.064999999999

최대값과 최소값 찾기

이번에는, 날짜의 최신 날짜와 처음 날짜를 max(), min() 메서드로 확인해보겠습니다.

# Print the maximum of the date column
print(sales['date'].max())

# Print the minimum of the date column
print(sales['date'].min())

최신 날짜는 2012년 10월 26일이고, 처음 날짜는 2010년 2월 5일이네요!

2012-10-26 00:00:00
2010-02-05 00:00:00

IQR을 판다스로 찾아내기

IQR은 inter-quartile range의 약자로, 사분위수 범위라고도 부릅니다. 75%, 25%의 해당하는 분위수의 차이로 구합니다. 이상치(outlier)를 잡을 때 유용하다고 합니다.

.agg() 메서드를 활용했습니다.

# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)
    
# Print IQR of the temperature_c column
print(sales['temperature_c'].agg(iqr))

여러 컬럼을 한번에 적용해볼 수도 있습니다. 그냥 데이터프레임에, 컬럼을 [[]] 이중 대괄호로 넣어주면 끝입니다.

# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg(iqr))

temperature_c           16.583
fuel_price_usd_per_l     0.073
unemployment             0.565
dtype: float64

이 .agg() 메서드에 여러개를 넣을 수도 있습니다. 가령 iqr에다가 median인 중앙값도 함께 출력합니다. agg() 내부에 두 개를 한번에 넣으면 되겠습니다. numpy의 median 메서드를 넣어보면 됩니다.

# Import NumPy and create custom IQR function
import numpy as np
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, np.median]))


        temperature_c  fuel_price_usd_per_l  unemployment
iqr            16.583                 0.073         0.565
median         16.967                 0.743         8.099

편리합니다. 판다스 많이 씁시다.

'Data Science Fellowship > Python' 카테고리의 다른 글

[Pandas] DataFrame index 추출해보기 (1)	2024.01.13
[Python] if-else 조건문과 elif 이야기 (0)	2023.12.09
[Python] Boolean 연산자와 넘파이 이야기 (0)	2023.12.09
[Python] Pandas and DataFrame (0)	2023.11.30
[Python] Built-in Data Type: 기본과 집합 자료형 (0)	2023.11.29

현재글[Pandas] 판다스로 배우는 DataFrame (1)

가짜연구소에서 그래프로 설득하기 러너로 활동 중인 학생입니다. 배우고 익힌 것을 나누고 싶습니다.

jquery, ncuc, 네이버클라우드, React, terraform, 자바, 초조한전망대, 도커, docker, ncloud, AWS, 티스토리챌린지, file safer, 가짜연구소, Nclouder, Java, Oracle, 리액트, 오블완, SQL,

Today :
Yesterday :

Graph 공부하는 학생