[TIL] 데이터분석 데브코스 62일차 (2) - 감정 분석(Sentiment Analysis)/텍스트 데이터 전처리/Yelp 데이터 감정 분석 실습

데이터 분석 Data Analytics/프로그래머스 데이터분석 데브코스 2기

[TIL] 데이터분석 데브코스 62일차 (2) - 감정 분석(Sentiment Analysis)/텍스트 데이터 전처리/Yelp 데이터 감정 분석 실습

상급닌자연습생 2024. 5. 18. 00:46

감정 분석(Sentiment Analysis)

: 텍스트에서 작성자의 감정 상태나 태도를 파악하고 분류하는 과정으로, 텍스트 마이닝과 자연어 처리 분야에 속한다.

감정의 기본 범주

긍정적 (Positive)
부정적 (Negative)
중립적 (Neutral)

감정 분석의 적용 분야

소셜 미디어 감성 모니터링

소셜 미디어 플랫폼 내 게시글 분석해 대중의 감정과 태도 파악
특정 사건, 제품, 브랜드, 정치적 이슈 등에 대한 대중의 반응을 모니터링
마케팅의 효과를 분석

고객 서비스 분석 및 소비자 인사이트

고객 서비스 대화, 콜 센터 통화 내용, 이메일 등 분석으로 고객의 만족 파악
고객의 불만을 해결하기 위한 인사이트 제공
새로운 제품과 서비스 개발 시 시장의 요구와 기대를 분석

헬스 케어 및 의료

환자의 감정 상태 분석
우울증, 불안 장애 등 조기 징후 파악

전통적인 텍스트 전처리

1. Tokenize

: 원문 글을 분석에 사용할 기본 개념 단위(Token)로 분리하는 과정

📌 토큰 (Token)
: 분석의 기본 단위가 되는 개체

- 토큰의 정의는 문제에 따라 사용자가 정의하기 나름 → 문장 자체 or 문단 자체가 될 수도 있음
- ex. 띄어쓰기로 글을 나누면 → 토큰 = 단어

2. Stop Words 제거

: Token으로 나뉜 개체에서 빈번히 많이 사용되지만 의미가 없어 분석에 도움이 되지 않는 Token을 제거하는 과정

Stop Words는 미리 사전에 정의하고 해당 단어가 나오면 제거
ex. Token = 단어 인 경우 :
- 한글 : ʻ그리고’, ʻ아’, ʻ내가’ 등은 큰 의미가 없음
- 영어 : ʻThe’, ʻa’, ʻand’ 등은 큰 의미가 없음
- 하지만 이런 단어는 매우 매우 많이 쓰임

📌 Stop Words
: 텍스트 마이닝 분석의 중요한 철학인 "많이 나온 Token은 중요한 역할을 한다."를 배반하는 Token

3. Stemming

: 나뉜 개체(Token)에서 접두사 or 접미사를 제거하여 기본 줄기(Stem)의 형태를 갖도록 만드는 과정

장점 : 빠르고 간단하게 처리 가능
단점 : 문맥을 고려하지 않아 잘못된 결과를 반환할 수 있음(ex. University와 Universe를 같은 줄기로 처리)
ex. Token = 단어 인 경우 : 특정 단어(Token)를 그것의 기본 형태(어간)으로 축소하는 과정
ex. Running → [-ing 제거] → Run
ex. Runner → [-er 제거] → Run

+) Lemmatization

: 특정 단어의 품사와 문맥을 고려하여 의미론적 기본 형태(Lemma)로 변환하는 과정

Stemming에 비해 복잡하고 시간이 오래걸리지만 정확도가 좋음
ex. ʻare’ , ʻis’, ʻam’ 모두 ʻbe’로 변경
ex. ʻRunning’, ʻRunner’ 모두 ʻRun’으로 변경 + ʻRan’도 ʻRun’으로 변경

딥러닝 기반 텍스트 전처리

: 전통적인 방식에서 중요하게 생각했던 처리 과정을 일부 변형해서 사용

1. Tokenize  

단어 단위의 Token에서 더 나아가, 단어를 더 쪼개는 Subword 방식의 Tokenize를 사용
기반 모델이 선택한 Tokenize 방식을 사용

2. Stop Words (잘 안씀) 

딥러닝 모델은 종종 문맥 속에서 단어의 중요성을 자동으로 파악 → 명시적으로 Stop Words를 제거할 필요가 적음

3. Stemming & Lemmatization  (잘 안씀)

단어 형태의 표준화는 과정 자체를 딥러닝 모델이 스스로 학습하도록 처리

감정 분석 실습

사용 데이터

로컬 음식점에 대한 사용자들의 다양한 리뷰를 공유하는 플랫폼 Yelp

🔗 실습 링크 : https://www.kaggle.com/datasets/marklvl/sentiment-labelled-sentences-data-set

Sentiment Labelled Sentences Data Set

From Group to Individual Labels using Deep Features, Kotzias et. al,. KDD 2015

www.kaggle.com

(※ 위 링크에서 사용할 데이터는 yelp_labelled.txt)

문제 정의

사용자의 텍스트 리뷰를 바탕으로 그들의 감정 상태를 예측하는 AI 모델 구현하기

► 100개의 feature로 출력이 2개인 이진 분류 문제로 Logistic Regression을 적용할 예정

입력	텍스트 리뷰 문장들
출력	감정 상태 - 1 : 만족 - 0 : 불만족

1단계. 데이터 로드

import pandas as pd

file_path = 'yelp_labelled.txt'
data = pd.read_csv(file_path, names=['text', 'sentiment'], sep='\t')

data.head()

## 종속변수 확인
sentiment_counts = data['sentiment'].value_counts()
print(sentiment_counts)

## 종속변수 시각화
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
sentiment_counts.plot(kind='bar', color='blue', alpha=0.7)
plt.title('Sentiment Counts')
plt.xlabel('Sentiment')
plt.ylabel('Frequency')
plt.xticks([0, 1], ['Negative', 'Positive'], rotation=0)
plt.show()

► 만족(1)과 불만족(0)이 각각 500개씩 있음을 확인할 수 있다.

2단계. 데이터 전처리

1) 단어 단위로 Tokenize

# 예시 문장 임의로 선택
example_text = data['text'][0]
print(example_text)

# 소문자로 통일
example_text = example_text.lower()
print(example_text)

비 단어적 요소를 제거해보자.

`re` 패키지 : (regular expression) 어떤 패턴, 특성을 갖고 있는 단어 혹은 문자의 집합을 찾을 때 사용
`sub(바꿔야하는 요소, 대체 요소, 탐색 대상)` : 탐색 대상에서 바꿔야하는 요소를 대체 요소로 변환하는 함수
- `'\W'` : 단어가 아닌 것들

# 비 단어적 요소 제거
# re 패키지 이용
import re

example_text = re.sub(r'\W', ' ', example_text)
print(example_text)

# tokenize 적용
example_text = example_text.split()
print(example_text)

2) Stop Words 제거

이제, 분석에 사용하지 않을 Stop Words를 제거해야 한다.

하지만 이미 Stop Words가 정의되어 있기 때문에, 정의된 것들을 불러와서 사용하기만 하면 된다.

# stop words 제거
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords') # 사전에 정의된 stop words를 불러온다(stop words는 다양한 언어로 구성)
stop_words = set(stopwords.words('english')) # stop words 중 영어를 사용하겠다.

# []안 해석
# 타겟팅 하고 있는 문장(example_text)의 각 단어(text)에 대해서, 해당 단어(text)가 stop_words에 없으면(not in) 해당 단어(text)만 남겨라
example_text = [text for text in example_text if text not in stop_words]
print(example_text)

3) 위의 (1), (2) 과정을 하나의 함수로 정의

# 최종 함수로 표현
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def preprocessing(text) :
    text = text.lower() # (1) 소문자로 변환 
    text = re.sub(r'\W', ' ', text) # (2) 비 언어적 표현 제거
    text = text.split() # (3) 토큰화
    text = [t for t in text if t not in stop_words] # (4) stop words 제거
    return text

정의된 함수를 적용해보자. `apply` 함수를 사용하면 된다.

# 기존 데이터 프레임 업데이트
# apply 함수 사용
data['preprocessed'] = data['text'].apply(preprocessing)
data.head()

3단계. 문장 임베딩 생성

1) GloVe 모델 적용하기

GloVe로 단어들의 임베딩 값을 생성한다.

100차원 짜리 GloVe 모델을 불러오자.

from gensim.downloader import load

# 너무 큰 차원은 시간이 오려걸려 100 차원으로 로드 진행
# 약 2분 소요
glove = load('glove-wiki-gigaword-100')

예시로 한 문장 적용하면 아래와 같다.

preprocessed_word = data['preprocessed'][0]

for word in preprocessed_word :
    print(word, glove[word])

2) TF-IDF 적용하기

GloVe를 적용한 값들을 단순하게 평균 내면 서로 다른 의미의 문장임에도 같은 결과를 내보낼 수 있다.
그래서 각 단어마다 중요도를 나타낸 TF-IDF를 적용한 임베딩을 활용한다.

📌 TF-IDF
: 문서 안에서 특정 단어의 중요도를 평가하는 통계적인 방법
► TF-IDF = TF * IDF (의미 : 문서 d 안에서 단어 t가 갖는 상대적 중요도)
- 높은 TF-IDF 값을 갖는 단어 : 해당 문서에서 더 많은 정보를 제공
- 전체 문서 집합 D에서 보다 의미 있는 특징을 갖고 있음

TF(단어의 빈도, Term Frequency)
: 문서 내 전체 단어 중 해당 단어가 얼마나 출현했는지 빈도
(t : 단어, d: 문서)

IDF (역 문서 빈도, Inverse Document Frequency)
: 특정 단어가 얼마나 여러 문서에 등장하는지
- 모든 문서에 자주 등장한 단어 : 중요도 ↓
- 특정 문서에만 자주 등장한 단어 : 중요도 ↑
(D : 전체 문서 집합)

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

tfidf_matrix = vectorizer.fit_transform([' '.join(doc) for doc in data['preprocessed']]) # 모든 문장을 모아 하나의 큰 문서 덩어리를 생성
tfidf_feature_names = vectorizer.get_feature_names_out()

tfidf_feature_names

# 찾고자 하는 단어의 위치는 numpy로 찾을 수 있음
import numpy as np
np.where(tfidf_feature_names == 'wow')[0][0]

# 각 문서 안에 찾고자 하는 단어들의 중요도를 출력해보자

for word in preprocessed_word :
    doc_idx = 0 # 0 : 원래 문서의 0번째 문장이므로
    word_idx = np.where(tfidf_feature_names == word)[0][0] # 단어 집합으로부터 word의 위치를 찾아냄

    value = tfidf_matrix.toarray()[doc_idx][word_idx]
    # toarray() : tf-idf matrix가 0이 많은 sparse matrix여서 이를 원래 0과 숫자값으로 이루어진 형태로 변환

    print(f'{word}의 tf-idf 값 : {value:.4f}')

3) 위의 (1), (2) 과정을 하나의 함수로 정의

import numpy as np

def sentence_embedding(tfidf_matrix, tfidf_feature_names, doc, doc_idx):
    embeddings = []
    for word in doc:
        # GloVe에서 학습한 데이터와 tf-idf에서 학습한 데이터만 취급
        # 그렇지 않은 데이터는 무시됨
        if word in glove and word in tfidf_feature_names:
            # 여기서는 transform을 쓰지 않고 이미 학습한 matrix에서 indexing으로 가져옴
            # 속도 효율성이 높음
            # 만약 처음 보는 문장에 대해서 TF-IDF를 한다면 transform이 필수!
            word_idx = np.where(tfidf_feature_names==word)[0][0]
            tfidf_weight = tfidf_matrix.toarray()[doc_idx, word_idx]

            embeddings.append(glove[word] * tfidf_weight) # 각각의 임베딩값 구해서 추가하기

    # 이들을 평균내서 반환        
    return np.mean(embeddings, axis=0) if embeddings else np.zeros(100)  # GloVe 차원에 맞춰 조정

0번째 위치의 문장으로 함수를 적용해보면 아래와 같은 임베딩 결과가 출력된다.

sentence_embedding(tfidf_matrix, tfidf_feature_names, preprocessed_word, 0) # 예시 문장은 0번째 위치의 문장!

이제 모든 텍스트 데이터에 대해서 함수를 적용하고, 문장 임베딩 값을 담은 새로운 컬럼을 추가해보자.

# 문장 임베딩 값을 새로운 열로 저장
# 여러 입력을 넣어주기 위해 lambda 함수를 활용하고
# 특정 행을 의미하는 row를 이용해 행 번호를 넣어줌 (row.name)
data['sentence_emb'] = data.apply(lambda row: sentence_embedding(tfidf_matrix,
                                                                 tfidf_feature_names,
                                                                 row['preprocessed'],
                                                                 row.name), axis=1)  # row.name : 맨 앞에있는 몇 번째를 나타냄
data.head()

Q. `tfidf_matrix`를 만들었는데 이걸 왜 또 임베딩 합수의 입력으로 넣어주나요?

A. `tfidf_matrix`는 문서들의 집합에 종속적이다.
따라서 입력으로 들어가는 문서의 집합이 바뀌게 되면 새로운 `tfidf_matrix`를 다시 만들어야 한다.
문서의 집합이 바뀌면 새로운 학습이긴 하지만, '내가 학습으로 사용한 데이터나 실제 추론을 해서 평가로 만들어진 데이터들이 대략적으로 비슷한 분포와 특성을 갖고 있을 것이다.'라는 가정을 바탕으로 만들어진 행렬이다.
때문에, 분포만 비슷하다면 새로운 데이터를 기반으로 `tfidf_matrix`를 다시 만들어도 어느정도 비슷한 결과가 출력되게 된다.

4단계. 수치로 변환된 임베딩 값으로 로지스틱 회귀 모델 학습

1) 데이터 분리

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X = np.stack(data['sentence_emb'].values)
y = np.stack(data['sentiment'].values)

# 훈련 데이터와 테스트 데이터로 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

2) 모델 생성 및 학습

# 로지스틱 회귀 모델 생성 및 학습
model = LogisticRegression()
model.fit(X_train, y_train)

3) 검증 데이터로 예측

# 학습 결과 확인을 위해 검증 데이터 추론 진행
predictions = model.predict(X_test)

5단계. 모델 평가

# 분류 문제이므로 분류 과정에서 많이 사용하는 평가 척도 사용

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, predictions)

# Confusion Matrix 시각화
plt.figure(figsize=(8, 6))
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.colorbar()
tick_marks = np.arange(2)
plt.xticks(tick_marks, ['Negative', 'Positive'])
plt.yticks(tick_marks, ['Negative', 'Positive'])

# 각 칸에 실제 값 표시
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > cm.max() / 2 else "black")

plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Precision, Recall, F1-Score 계산
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)

# 결과 출력
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")

6단계. 새로운 가상 데이터로 문장 분석 결과 확인

1) 새로운 가상 데이터 불러오기

# 음식점에 대한 만족도 관련 텍스트를 임의로 생성

examples = [
    "This restaurant had the best service I've ever experienced.", # 긍정적
    "Extremely disappointed with the late delivery.", # 부정적
    "The ambiance was enchanting and very relaxing.", # 긍정적
    "Unfortunately, the food was bad and uninspired.", # 부정적
    "Amazing cocktails and a vibrant atmosphere!", # 긍정적
    "Waited an hour for our table, even with a reservation.", # 부정적
    "The pasta dish was a delightful surprise with its rich flavors.", # 긍정적
    "Too noisy to enjoy our meal, and the tables were too close together.", # 부정적
    "Exceptional customer service and a very friendly staff.", # 긍정적
    "The dessert was undercooked and not what we expected.", # 부정적
]

2) 위의 1~5단계를 함수화하여 적용하기

# 예제 문장 전처리
preprocessed_examples = [preprocessing(text) for text in examples]

# 원래 문장의 분포와 비슷한 문장들을 활용해 TF-IDF 메트릭스를 다시 생성하고
examples_vectorizer = TfidfVectorizer()
examples_tfidf_matrix = examples_vectorizer.fit_transform([' '.join(doc) for doc in preprocessed_examples])
examples_tfidf_feature_names = examples_vectorizer.get_feature_names_out()

# TF-IDF 값과 GloVe 임베딩을 결합하여 문장 임베딩 생성
example_sentence_embs = []
for doc_idx, doc in enumerate(preprocessed_examples):
    example_sentence_embs.append(sentence_embedding(examples_tfidf_matrix,
                                                    examples_tfidf_feature_names,
                                                    doc, doc_idx))

# 모델을 이용해 감정 분석 수행
example_sentence_embs = np.array(example_sentence_embs)
predictions = model.predict(example_sentence_embs)

# 결과 출력
for idx, (text, pred) in enumerate(zip(examples, predictions)):
    origin_sent = '긍정적' if idx % 2 == 0 else '부정적'
    pred_sent = '긍정적' if pred == 1 else '부정적'

    print(f"문장: {text} \n원래 감정 : {origin_sent} / 예측 : {pred_sent}", end='\n\n')