[TIL] 데이터분석 데브코스 63일차 (2) - 자연어 처리 프레임워크/문장분류(Sentence Classification)/딥러닝 기반 문장 분류

데이터 분석 Data Analytics/프로그래머스 데이터분석 데브코스 2기

[TIL] 데이터분석 데브코스 63일차 (2) - 자연어 처리 프레임워크/문장분류(Sentence Classification)/딥러닝 기반 문장 분류

상급닌자연습생 2024. 5. 19. 23:54

자연어 처리 (Natural Language Processing)

목표

컴퓨터가 사람의 언어를 이해하고 해석하는 것
글을 활용해서 문제를 해결하고 향상된 사용자 경험을 제공

자연어 처리 문제

텍스트 이해 (Text Understanding)  

질의응답 (QA, Question Answering)
문장 이해 (Reading Comprehension)
정보 검색 (Information Retrieval)

텍스트 생성 (Text Generation

문장 생성 (Text Generation)  
요약 (Text Summarization)  
번역 (Neural Machine Translation)

텍스트 분류 및 태깅 (Text Classification & Tagging) 

문장 분류 (Text Classification)  
개체명 인식 (NER, Named Entity Recognition)
품사 태깅 (POS tagging, Part of Speech tagging)

텍스트 관계 추출 (Text Relation Extraction)

문장 관계 추출 (Relation Extraction)

자연어 처리 프레임워크

1. Natural Language Tool Kit (NLTK)

: 전통적인 NLP 기법을 구현한 패키지 모음

전처리, 딥러닝 이전의 NLP 방법 활용

2. PyTorch, TorchText

: Facebook에서 개발한 ML 오픈소스 라이브러리

딥러닝에 특화된 프레임워크

TorchText

: PyTorch에서 제공하는 NLP에 특화된 내부 라이브러리

(초기 ~ 최신의) 딥러닝 모델을 쉽게 구현 가능한 인터페이스 제공
데이터 전/후 처리와 모델 학습에 필요한 여러 모듈 제공

3. HuggingFace

: NLP에 특화된 커뮤니티 기반의 라이브러리

NLP 외에도 다양한 분야(이미지, 음성, 생성 등)의 연구 성과를 공유, 활용 가능
사용과 학습을 위한 유용한 기능 내포 : 데모 사이트(Spaces) 사용, 데이터 업로드, 다운로드(Datasets) 가능
연구 결과물을 공통된 인터페이스로 강제
코드 진행의 통일성 제공

4. KoNLPy

: NLP 중 '한글 데이터' 처리에 특화된 Python 라이브러리

한국어에 특화된 전처리 기법 보유 : 혀태소 분석, 품사 태깅 및 추출

문장 분류 (Sentence Classification)

: NLP 문제 중 하나로, 텍스트 데이터를 활용하여 정해진 클래스 중 어떤 클래스에 속하는지 판단하는 것

목표

텍스트의 의미를 이해하고 구조화된 방식으로 분류하는 것

하위 문제(=문장 분류에 속하는 것들)

감정 분석 (Sentiment Analysis)
주제 분류 (Topic Classification) : 보기가 주어지고 글이 속한 주제를 탐지하고 분류
의도 분석 (Intent Detection) : 발화의 의도 파악 → 빠른 업무 배분 가능
언어 감지 (Language Detection) : 번역에서 언어 감지

상위 문제(=복잡한 문제 해결을 위해 문장 분류를 사용하는 것들)

텍스트 요약
텍스트 생성
QA 챗봇

딥러닝 기반의 문장 분류

1. 순환 신경망 (Recurrent Neural Network; RNN)

: 사람이 글을 읽고 이해하는 과정을 모방하여 설계한 딥러닝 기반의 초기 텍스트 처리 모델

프로세스

단어를 하나씩 입력 받는다.
이전에 이해한 내용을 바탕으로 새로운 정보를 생성한다.
모든 단어를 처리할때 까지 1, 2번 과정을 반복 순환(Recurrent) 한다.
마지막에 생성된 정보를 바탕으로 분류를 진행한다.

2. 주의 메커니즘 (Attention Mechanism)

: 입력으로 받은 텍스트 정보에서 딥러닝 모델이 주의(Attention)를 집중할 단어를 자동으로 판단하는 텍스트 분석 알고리즘

단어의 정보를 모델 스스로 판단 → 성능 향상, 다양한 정보 추출 가능
Attenion 메커니즘을 활용해 분류 문제를 풀기 위해서는
- Attention 메커니즘이 입혀진 단어를 바탕으로 생성된 문장 임베딩 값을 활용해야 함
- 정리된 문장의 정보를 바탕으로 타겟 클래스를 예측
- 단어의 전체적인 의미를 아우르는 하나의 문장 정보를 출력

전이 학습(Transfer Learning), 미세 조정 (Fine-Tuning)

: 분류 문제로 데이터의 특성을 익힌 모델을 보다 복잡한 문제에 적용하는 과정

딥러닝 기반 문장 분류 실습 (with. HuggingFace)

🔗 Hugging Face : https://huggingface.co/models

Models - Hugging Face

huggingface.co

위의 Hugging Face 페이지에 접속해서 상단의 [Models]를 클릭해보자.

우리의 과제인 자연어 처리를 위해 좌측의 메뉴 바에서 [Natural Language Processing]을 찾았다면 문장 분류를 가리키는 [Text Calssification]을 선택하자.

우리는 트위터 데이터를 활용한 roberta 기반의 감정분석 모델을 사용할 것이다.

아래의 링크를 클릭하거나 검색장에 twitter roberta base sentiment latest를 입력해서 클릭해보자.

https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest

cardiffnlp/twitter-roberta-base-sentiment-latest · Hugging Face

Twitter-roBERTa-base for Sentiment Analysis - UPDATED (2022) This is a RoBERTa-base model trained on ~124M tweets from January 2018 to December 2021, and finetuned for sentiment analysis with the TweetEval benchmark. The original Twitter-based RoBERTa mode

huggingface.co

twitter-roberta-base-sentiment-latest 모델

2018.01 ~ 2021.12 까지의 트윗을 학습 → 출력 : 긍정 / 부정 / 중립
BERT 모델을 개선한 RoBERTa를 기반으로 트윗 데이터를 추가 학습한 모델

HuggingFace의 전처리 (Tokenize)

딥러닝 모델의 전처리는 모델마다 다르며, HuggingFace는 특정 모델에 맞는 전처리 코드를 제공한다.

`Auto` : 사용하려는 모델의 HF 페이지 일므을 입력으로 넣으면 해당 모델의 전처리 Tokenizer을 자동 제공
Stop words, Stemming 사용 X
토큰화 결과 : Token의 Index, 기타 모델 입력 데이터

from transformers import AutoTokenizer

MODEL = f"모델제작자/모델명"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

tokenizer(text)

Token의 Index

: Token과 Embedding Vector 사이를 연결해주는 mapping value로, Embedding Vector가 쌓여있는 순서를 의미

장점1. 메모리적 이점

Token = 글의 형태 → 메모리 많이 차지
Token의 Index = 정수값 → 메모리 절약

장점2. Embedding Vector을 모델에 귀속시켜 관리가 용이함

딥러닝 모델은 Word2Vec, GloVe와 같이 독립적인 임베딩을 사용하지 X
(∵ 딥러닝 모델의 환경과 Word2Vec, GloVe의 환경이 다르다.)
모델의 특성에 맞는 임베딩 모델을 각자 설계 ► 모델 구조 앞단에 임베딩 삽임

HuggingFace의 Model 호출

`Auto` 모델 로더를 바탕으로 HF 페이지 이름을 ㅗ간단하게 학습된 모델 호출 가능

from transformers import AutoModelForSequenceClssification

MODEL = f"모델제작자/모델명"
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

Config

: 어떤 설정(사용한 데이터, 학습 횟수, GPU 등)으로 모델을 학습시켰는지에 대한 정보

1단계. 데이터 로드

import pandas as pd

file_path = 'yelp_labelled.txt'
data = pd.read_csv(file_path, names=['text', 'sentiment'], sep='\t')

data.head()

2단계. 전처리

from transformers import AutoTokenizer

MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

예시 문장을 살펴보자.

text = data['text'][0]
print(text)

# 출력 결과는 원래 딕셔너리 리스트 형태로 나옴
# 딥러닝에 넣으려면 그에 맞게 형태를 바꿔줘야함
# return_tensors를 'pt' (pytorch)로 미리 설정
tokenized = tokenizer(text, return_tensors='pt')
print(tokenized)

※ 띄어쓰기 단위로 토큰화하는게 아니다.

Index를 원래 문구로 돌려보자.

# 실제 Token의 값을 글자의 형태로 확인하려면
# convert_ids_to_tokens 함수를 활용
tokenizer.convert_ids_to_tokens(tokenized['input_ids'].tolist()[0])
# 이 모델의 Token은 단어의 시작을 나타내는 글자에도 추가 처리를 진행!

3단계. 모델 호출

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(MODEL)

print(model)

# [참고] index를 바탕으로 embedding 값을 알아보기
# 모델의 구조를 바탕으로 embedding 부분만 추론 진행

emb = model.roberta.embeddings(**{'input_ids':tokenized['input_ids']})
print('Embedding 값 : ')
print(emb)

# 9개의 Token이 입력되었고,
# 각 Token은 768짜리 vector로 변환
print('Embedding 출력의 크기 : ', emb.shape)

4단계. 모델 사용

# 모델을 사용한다면 Token 값으로 나온
# input_ids를 key_value 형태로 넣어주면 됨
# https://huggingface.co/docs/transformers/main/en/model_doc/roberta#transformers.RobertaForSequenceClassification.forward.input_ids

output = model(input_ids = tokenized['input_ids'])

# 혹은 사용 예제에서 처럼
# Unpacking 사용도 가능
# https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest#full-classification-example

output = model(**tokenized)

5단계. 출력 확인 및 해석

# 모델의 출력 결과
# 학습 과정에서도 쓰이는 다양한 결과가 존재
print(output)

# 타겟하는 정보인 점수 값을 가져옴
scores = output[0][0].detach().numpy()
print(scores)

점수를 확률의 형태로 변환해보자.

# 점수를 확률의 형태로 변환
# 과정에성 softmax 함수를 사용하고
# 외부 패키지의 함수를 활용

from scipy.special import softmax

scores = softmax(scores)
print(scores)

결괏값을 해석하기 위해 Config 값을 로딩하면, 모델이 학습하던 환경에서의 설정 값을 살펴볼 수 있다.

# 모델 학습 정보를 확인하기 위한 config 값 로딩

from transformers import AutoConfig
config = AutoConfig.from_pretrained(MODEL)

config

# 최종적으로 확률이 제일 큰 값의 인덱스를 바탕으로 감정을 판단
import numpy as np

max_prob_index = np.argmax(scores)
results = config.id2label[max_prob_index]
results

위의 과정을 함수로 정의해보자.

# 최종 함수로 표현
def sentiment_analysis(sentence, tokenizer, model, config):
    tokenized = tokenizer(sentence, return_tensors='pt')
    output = model(input_ids = tokenized['input_ids'])

    scores = output[0][0].detach().numpy()
    scores = softmax(scores)

    max_prob_index = np.argmax(scores)
    results = config.id2label[max_prob_index]
    return results, max_prob_index

sent = data['text'][2]
res, idx = sentiment_analysis(sent, tokenizer, model, config)
print(sent)
print('-> ', idx, res)

6단계. 평가 데이터 생성

from sklearn.model_selection import train_test_split

X = np.stack(data['text'].values)
y = np.stack(data['sentiment'].values)

# 훈련 데이터와 테스트 데이터로 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

비교 분석을 위해 임의로 함수를 만들었다.

기존 감정 분석 때는 0 = Negative, 1 = Positive로 2개의 출력값을 갖고 있었는데

딥러닝을 활용한 분석에서는 0 = Negative, 1 = Neutral, 2 = Positive로 3개의 출력값을 갖는다.

따라서 0이 나오면 0으로, 1이 나오면 버리고(-1로), 2가 나오면 1로 변환시키는 함수를 정의했다.

def convertResults(index):
    if index == 0 : return index
    elif index == 1 : return -1
    elif index == 2 : return 1

# 평가 데이터에 대해 전체 추론 진행
predictions = []

for test in X_test :
    res, idx = sentiment_analysis(test, tokenizer, model, config)
    idx = convertResults(idx)
    predictions.append(idx)

predictions = np.array(predictions)

predictions

# 분류 문제이므로 분류 과정에서 많이 사용하는 평가 척도 사용

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, predictions)

import matplotlib.pyplot as plt
# Confusion Matrix 시각화
plt.figure(figsize=(8, 6))
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.colorbar()
tick_marks = np.arange(3)
plt.xticks(tick_marks, ['Neutral', 'Negative', 'Positive'])
plt.yticks(tick_marks, ['Neutral', 'Negative', 'Positive'])

# 각 칸에 실제 값 표시
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > cm.max() / 2 else "black")

plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

7단계. 가상 데이터를 활용해서 문장 분석 결과 확인

# 음식점에 대한 만족도 관련 텍스트를 임의로 생성

examples = [
    "This restaurant had the best service I've ever experienced.", # 긍정적
    "Extremely disappointed with the late delivery.", # 부정적
    "The ambiance was enchanting and very relaxing.", # 긍정적
    "Unfortunately, the food was bad and uninspired.", # 부정적
    "Amazing cocktails and a vibrant atmosphere!", # 긍정적
    "Waited an hour for our table, even with a reservation.", # 부정적
    "The pasta dish was a delightful surprise with its rich flavors.", # 긍정적
    "Too noisy to enjoy our meal, and the tables were too close together.", # 부정적
    "Exceptional customer service and a very friendly staff.", # 긍정적
    "The dessert was undercooked and not what we expected.", # 부정적
]

# 가상 데이터를 활용한 예측 진행

for idx, exp in enumerate(examples) :
    res, index = sentiment_analysis(exp, tokenizer, model, config)
    origin_sent = '긍정적' if idx % 2 == 0 else '부정적'
    if index == 0 : pred_sent = '부정적'
    elif index == 1 : pred_sent = '중립'
    else : pred_sent = '긍정적'

    print(f"문장: {exp} \n원래 감정 : {origin_sent} / 예측 : {pred_sent}", end='\n\n')

► 머신러닝 모델로 돌렸을 때는 2개 정도 틀렸는데, 딥러닝 모델로 돌리니까 다 맞음

확실히 성능이 높다는 것을 알 수 있다.