텍스트 데이터 전처리 - 정제(Cleaning)

🐱‍👤지식닌자 2023. 8. 18. 15:36

2023. 8. 18. 15:36

728x90

정제(Cleaning) 작업은 토큰화(Tokenization) 전, 후로 이루어 질 수 있다.

토큰화 전에는 토큰화 작업에 방해가 되는 특성들을 검사하고, 해당 특성에 해당하는 데이터를 필터링하는 작업이 정제 과정이 되며, 토큰화 후에는 자연어 처리 작업을 이해하는 데 방해가 되는 특성과 노이즈를 검사하고 해당 데이터를 필ㅊ터링하는 과정이 정제 과정이 된다.

보편적으로 정제 작업은 다음과 같이 3가지로 나누어 진다.

불용어(Stopword) 처리
불필요한 태그 및 특수 문자 제거
코퍼스 내 등장 빈도가 적은 단어 제거

1. 불용어(Stopword) 처리

불용어(Stopword)란 문장 내에서 빈번하게 등장하지만 분석에는 큰 의미가 없는 단어를 말한다.

예를 들어 "the", "a", "an"과 같은 관사나 " is", "are", "was", "were"과 같은 동사가 이에 해당한다. 이러한 단어들은 단어 자체가 가지는 정보가 거의 없기 때문에 제외할 필요가 있다.

불용어를 제거하고 토큰화 작업을 하면 단어 사전을 효율적으로 관리할 수 있다. 즉, 토큰화 작업의 효율을 높이기 위해 불용어 처리를 한다. NLTK 라이브러리에는 불용어 목록이 있는데, 불용어는 가변적일 수 있기 때문에 개발자가 직접 정의할 수 있다.

# 코랩 환경
!pip install nltk

import nltk
nltk.download('punkt') # 문장 토크나이저 데이터
nltk.download('stopwords') # 불용어 목록 데이터

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


sentence = "Removing stop words helps make tokenization more efficient!"

stop_words = set(stopwords.words('english'))
tokens = word_tokenize(sentence)

cleaned_tokens = []
for word in tokens:
    if word not in stop_words:
        cleaned_tokens.append(word)
        
"""
리스트 컴프리헨션
cleaned_tokens = [word for word in tokens if word not in stop_words]
"""

print(tokens)
print(cleaned_tokens)

출력 결과>>

['Removing', 'stop', 'words', 'helps', 'make', 'tokenization', 'more', 'efficient', '!']
['Removing', 'stop', 'words', 'helps', 'make', 'tokenization', 'efficient', '!']
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

*한국어 불용어 목록

https://www.ranks.nl/stopwords/korean

2. 불필요한 태그 및 특수문자 제거

웹 크롤링 시 불필요한 태그나 특수 문자들이 자연어 코퍼스에 포함되어 있는 경우가 있다. 단어 사전을 효율적으로 관리하기 위해 이러한 것들을 정제할 필요가 있다.

import re

clean = re.compile('^[a-zA-Z0-9+-_.]+@[a-zA-Z0-9+-_.]+\.[a-zA-Z0-9+-_.]+$')
# 메타 문자 '.'(문자 하나를 의미)를 이스케이프 문자 '\'를 앞에 붙여 일반 문자(마침표)로 해석한다.
#^: 문자열의 시작, $: 문자열의 끝

emails = ['pytorchee@ggamail.com',
            'tensorfloww@hahamail.co.kr',
            'kerakera@papamail.net',
            'hugging@hugmail.com',
            '@lalamail.com',
            'ww.dada.com',
            'mama@mail.']

email_list = [mail for mail in emails if clean.match(mail)]
email_list

출력 결과>>

['pytorchee@ggamail.com',
 'tensorfloww@hahamail.co.kr',
 'kerakera@papamail.net',
 'hugging@hugmail.com']

메일 주소가 True인지 False인지 출력해보기

for mail in emails:
    is_valid = clean.match(mail) is not None
    print(f'{mail}: {is_valid}')

>> 출력 결과

pytorchee@ggamail.com: True
tensorfloww@hahamail.co.kr: True
kerakera@papamail.net: True
hugging@hugmail.com: True
@lalamail.com: False
ww.dada.com: False
mama@mail.: False

3. 코퍼스 내 등장 빈도가 적은 단어 제거

빈도 수가 너무 적어 자연어 처리에 도움이 되지 않는 단어들을 제거할 수도 있다.

# 코랩 환경
!pip install nltk

import nltk
nltk.download('punkt') # 문장 토크나이저 데이터
nltk.download('stopwords') # 불용어 목록 데이터

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist


sentence = "Removing, tokenization removing stops stop stop words words helps helps helps make make tokenization tokenization more more more efficient!"

stop_words = set(stopwords.words('english'))
tokens = word_tokenize(sentence)

cleaned_tokens = []
for word in tokens:
    if word not in stop_words:
        cleaned_tokens.append(word)
        
"""
리스트 컴프리헨션
cleaned_tokens = [word for word in tokens if word not in stop_words]
"""

print(tokens)
print(cleaned_tokens)

# 빈도 분포 계산
fdist = FreqDist(cleaned_tokens)

# 등장 빈도가 1 이하인 단어 제거
min_frequency = 2
cleaned_tokens = [word for word in cleaned_tokens if fdist[word] > min_frequency]

print(cleaned_tokens)

['Removing', ',', 'tokenization', 'removing', 'stops', 'stop', 'stop', 'words', 'words', 'helps', 'helps', 'helps', 'make', 'make', 'tokenization', 'tokenization', 'more', 'more', 'more', 'efficient', '!']
['Removing', ',', 'tokenization', 'removing', 'stops', 'stop', 'stop', 'words', 'words', 'helps', 'helps', 'helps', 'make', 'make', 'tokenization', 'tokenization', 'efficient', '!']

['tokenization', 'helps', 'helps', 'helps', 'tokenization', 'tokenization']

728x90

'자연어 처리(NLP) 공부' 카테고리의 다른 글

텍스트 데이터 전처리 - 정규화(Normalization) (0)	2023.08.18
정규표현식으로 특정 문자 삭제하기 (0)	2023.08.18
자연어 처리 개요 (0)	2023.08.16
자동 완성 구현을 위한 데이터? (0)	2023.08.15
N-gram, 바이그램(Bi-gram) (0)	2023.08.14

아는 것의 미학 🌼