빅데이터 분석 기사 실기 2유형 정리(2025년 2유형 체험 문제 + 기출문제 9~6회)

달님🌙 2025. 6. 20. 15:42

2유형 2025년 체험문제 풀이
아답터 풀이영상 참고

긴 풀이법

# 출력을 원하실 경우 print() 함수 활용
# 예시) print(df.head())

# getcwd(), chdir() 등 작업 폴더 설정 불필요
# 파일 경로 상 내부 드라이브 경로(C: 등) 접근 불가

import pandas as pd

train = pd.read_csv("data/customer_train.csv")
test = pd.read_csv("data/customer_test.csv")


# 사용자 코딩

# 답안 제출 참고
# 아래 코드는 예시이며 변수명 등 개인별로 변경하여 활용
# pd.DataFrame변수.to_csv("result.csv", index=False)


# 총구매액은 회기문제! (작년엔 분류문제 영상임  모델만 다를뿐임)

# 1. 데이터 확인 (not null이 적은 환불금액은 결측치가 존재한다는 사실을 알수있음!!)
# print(train.info())
# print(test.info())

# print(train)


# 2. 전처리 
# 1) (먼저 독립변수x와 종속변수 y를 나눠줘야함 train/test set 분리)
x_train = train.drop(['회원ID','총구매액'], axis =1) 
# 행기준으로 삭제할거면 axis=0, 열기준으로 삭제는 axis = 1
# test 데이터에 총구매액을 구하는거니까 train에 이미 들어있는 총구매액을 drop 시킴!
# 회원ID도 그냥 행마다 하나씩 인덱싱 되어있는 고유값 느낌의라 제거! (제거안하면 과적합될수가있음) -> test도 같이 제거
x_test = test.drop(['회원ID'], axis = 1)
y = train['총구매액']


# shape을 통해 항상 계속 확인해주기
# print(x_train.shape, y.shape, x_test.shape)

# 2) 결측치 처리
# 환불금액이 결측치라는 말 = 환불을 하지않았따!
x_train['환불금액'] = train['환불금액'].fillna(0)
x_test['환불금액'] = test['환불금액'].fillna(0)

# print(x_train.isna().sum())
# print(x_test.isna().sum())



# 3) 수치형 변수 스케일링
# min-max 스케일링 vs 스탠다드 스케일링 ( 두가지 방법 존재)
from sklearn.preprocessing import MinMaxScaler
# import sklearn
# print(dir(sklearn.preprocessing))

scaler = MinMaxScaler()
# 수치형컬럼은 주구매상품 주구매지점 빼고 나머지임! 직접 입력해도되지만
num_columns = x_train.select_dtypes(exclude= 'object').columns
# print(num_columns)


x_train[num_columns] = scaler.fit_transform(x_train[num_columns])
# test 데이터는 fit 하는거아님! 그냥 fit 한 train 데이터를 적용만 하는거임!!
x_test[num_columns] = scaler.transform(x_test[num_columns])

# # 4) 범주형 변수 인코딩
# train : 피자 치킨 콜라 사이다 -> 0 1 2 3
# test : 피자 치킨 콜라 사이다 맥주 -> 0 1 2 3 4 # -> 없던 맥주가 생긴다고 오류남!!

# 차집합을 사용하여 test - train 이 아무것도 없는지 확인해야함!
# # train - test 는 뭐가 나와도됨 (어차피 트레이닝한 데이터를 test에서 안쓰는 것일 뿐이므로)

# print(set(x_test['주구매상품']) - set(x_train['주구매상품']))
# print(set(x_train['주구매상품']) - set(x_test['주구매상품'])) # 값이 있어도됨
# print(set(x_test['주구매지점']) - set(x_train['주구매지점']))

# 라벨 인코딩 vs 원핫인코딩
#라벨
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
x_train['주구매상품'] = encoder.fit_transform(x_train['주구매상품'])
x_test['주구매상품'] = encoder.transform(x_test['주구매상품'])

x_train['주구매지점'] = encoder.fit_transform(x_train['주구매지점'])
x_test['주구매지점'] = encoder.transform(x_test['주구매지점'])


## train : ABCDE = 01234 <- 이렇게 하는게 fit_transform
## test : ABDE = 0134 <- 이렇게 되길원하는게 transform 인데 이걸 fit_transform 하면 엉뚱하게 0123 이 되어버림!!!

## 전처리 끝 

# 3.  데이터 분리!!!
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x_train, y, test_size = 0.2)
print(x_train.shape, x_val.shape, y_train.shape, y_val.shape)


# 4. 모델 학습 및 검증
from sklearn.ensemble import RandomForestRegressor
# print(dir(RandomForestRegressor))
model = RandomForestRegressor()
model.fit(x_train, y_train)
y_val_pred = model.predict(x_val)


# 5. 평가
from sklearn.metrics import root_mean_squared_error, r2_score
rmse = root_mean_squared_error(y_val, y_val_pred)
r2 = r2_score(y_val, y_val_pred)
print(rmse, r2)


# # 6. 저장
y_pred = model.predict(x_test)
print(y_pred)

result = pd.DataFrame(y_pred, columns=['pred'])
result.to_csv('result.csv', index= False)

# # 7. 생성결과 확인
result = pd.read_csv('result.csv')
print(result)

짧은 풀이법

# 출력을 원하실 경우 print() 함수 활용
# 예시) print(df.head())

# getcwd(), chdir() 등 작업 폴더 설정 불필요
# 파일 경로 상 내부 드라이브 경로(C: 등) 접근 불가

import pandas as pd

train = pd.read_csv("data/customer_train.csv")
test = pd.read_csv("data/customer_test.csv")

# 사용자 코딩
# 1) 데이터 확인 -> 2) 전처리 분리 -> 3) 다시분리(split)
# 4) 학습/검증(랜포) -> 5) 평가(metrics -> rsme) -> 6)저장
# 답안 제출 참고
# 아래 코드는 예시이며 변수명 등 개인별로 변경하여 활용
# pd.DataFrame변수.to_csv("result.csv", index=False)

# 1. 데이터확인
# print(train.info)

# 2. 전처리
# 1) set 분리
x = train.drop(['총구매액'], axis = 1)
y = train['총구매액']
x_full = pd.concat([x,test], axis = 0)
x_full = x_full.drop(['회원ID'], axis = 1)
# print(x_full.shape)

# 2) 결측치 채우기
# print(x_full.isnull().sum())
x_full['환불금액'] = x_full['환불금액'].fillna(0)

# print(x_full.isnull().sum())

# 3) 범주형 변수 인코딩 (랜덤포레스트 방식은 수치형스케일링 SKIP해도됨!!)# get_dummies 는 알아서 범주형만 인코딩해줌 -> 열 갯수가 늘어나면 되는것임 ( 주구매상품,지점의 42개들이 각각 컬럼으로 생성되어 값이 true/false 가 됨)
x_full = pd.get_dummies(x_full)
# print(x_full.shape)
# print(dir(pd))


# 3. 데이터분리
x_train = x_full[:train.shape[0]] # train.shape[0] 을 그냥 3500이라고 적어도됨
x_test = x_full[train.shape[0]:]
# print(x_train.shape, x_test.shape)

# 4. 학습용 데이터 나누기 (8:2로 나눌거니까 사이즈는  0.2이고, 3500개의 8:2는 2800임!)
# 8은 학습할거고 2는 검증할거임 그래서 검증이 x_val (밸리데이션)
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x_train, y, test_size = 0.2)

# print(x_train, x_val, y_train, y_val)

# print(x_train.shape, x_val.shape, y_train.shape, y_val.shape)

# 4. 모델 학습 및 검증
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(x_train, y_train)
y_val_pred = model.predict(x_val)
# print(y_val_pred)

# 5. 평가
from sklearn.metrics import root_mean_squared_error, r2_score
rmse = root_mean_squared_error(y_val, y_val_pred)
r2 = r2_score(y_val, y_val_pred)
# print(rmse, r2)

# 6. 결과저장 및 확인
y_pred = model.predict(x_test)
result = pd.DataFrame(y_pred)
result.to_csv('result.csv', index = False)
result = pd.read_csv('result.csv')
print(result)

기출문제 9~6회 정리

기출 8회 풀이



#2유형

# 2유형 기출 8회
import pandas as pd
train = pd.read_csv("https://raw.githubusercontent.com/lovedlim/bigdata_analyst_cert/main/part4/ch8/churn_train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/lovedlim/bigdata_analyst_cert/main/part4/ch8/churn_test.csv")

# 1) 데이터 확인 
# print(train.info())
# print(test.info())
print(train.shape) #4116,19
print(test.shape) # 1764, 18

# 2) 전처리
# 1.concat 
# 2. 결측치(타겟 숫자 컬럼 drop, id컬럼 drop, null채우기) 
# 3. 인코딩
# 4. 다시 훈련데이터 학습데이터 분리
# print(train.isnull().sum())
# print(test.isnull().sum())
# print(train['customerID'].nunique()) # drop 시켜야함 

# print(train.columns)
x = train.drop(['TotalCharges'], axis = 1)
y = train['TotalCharges']
# print(x.shape) # concat에 대괄호!!
x_full = pd.concat([x, test], axis=0)
# print(x_full.shape)
x_full = x_full.drop(['customerID'], axis = 1)
# print(x_full.shape)
#fillna(0)할필요없음 -> 결측치없음

# 인코딩
x_full = pd.get_dummies(x_full)

x_train = x_full[:4116]
x_test = x_full[x_train.shape[0]:]
# print(x_train.shape, x_test.shape)


# 3) 검증데이터 분리 split
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x_train, y, test_size = 0.2, random_state = 0)


# 4) 모델링
from sklearn.ensemble import RandomForestRegressor # 분류는 RandomForestClassifier
import sklearn
print(sklearn.ensemble.__all__)
model = RandomForestRegressor(random_state = 0)
model.fit(x_train, y_train) # fit은 train들끼리 하는것임!!
y_val_pred = model.predict(x_val) # y의 검증데이터값은 x의 검증데이터값에서 예측됨!!
# print(y_val_pred)

# 5) 평가 
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_val,y_val_pred)
# print(mae)

# 6) 저장
y_pred = model.predict(x_test)
result = pd.DataFrame(y_pred, columns = ['pred'])
result = pd.DataFrame({'pred' : y_pred})

result.to_csv('result.csv', index=False)
result = pd.read_csv('result.csv')
print(result)

기출 9회 2유형

# 문제정의
# 평가: f1-macro
# target: 농약검출여부
# 최종파일: result.csv(컬럼 1개 pred)

# 라이브러리 및 데이터 불러오기
import pandas as pd
# train = pd.read_csv("farm_train.csv")
# test = pd.read_csv("farm_test.csv")
train = pd.read_csv("https://raw.githubusercontent.com/lovedlim/bigdata_analyst_cert_v2/main/part4/ch9/farm_train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/lovedlim/bigdata_analyst_cert_v2/main/part4/ch9/farm_test.csv")


# 1. 데이터확인
# print(train.shape) # 4000 9 
# print(test.shape) # 1000 8

# 2. 전처리
# (1) concat 과 drop
# print(train.isnull().sum())
x_train = train.drop(['농약검출여부'], axis = 1)
x_full = pd.concat([x_train, test])
y = train['농약검출여부']
# print(x_full.shape)

# print(x_train)
# (2) 결측치 fillna
#없음
# (3) 인코딩
x_full = pd.get_dummies(x_full)
# print(x_full.shape)
# (4) 분리
x_train = x_full[:4000]
x_test = x_full[4000:]


# 3. set 분리 (모델셀렉션 스플릿)
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x_train, y, test_size = 0.2, random_state = 0)
print(x_train.shape, x_val.shape, y_train.shape, y_val.shape )

# 4. 모델링(앙상블 랜포)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state = 0)
model.fit(x_train, y_train)
pred = model.predict(x_val)

# 5. 평가(매트릭스 lgb <- 주어짐)
from sklearn.metrics import f1_score
fs = f1_score(y_val, pred, average = 'macro')
# print(help(f1_score))
print(fs)

# 6. 저장(x_test를 pred하고 result를 데이터프레임)
pred = model.predict(x_test) 
result = pd.DataFrame({'pred': pred})
result.to_csv("result.csv", index= False)

저작자표시 비영리 변경금지 (새창열림)