자연어 처리(NLP) - 불용어(stopwords) 제거

🐱‍👤지식닌자 2023. 3. 15. 04:25

2023. 3. 15. 04:25

728x90

불용어는 문장에서 자주 등장하지만 분석에는 큰 의미가 없는 단어를 말한다.
예를 들어 "the", "a", "an"과 같은 관사나 " is", "are", "was", "were"과 같은 동사가 이에 해당한다.

Python의 Natural Language Toolkit(NLTK) 패키지에서 불용어 제거를 위한 라이브러리를 제공한다.

import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

text = "This is an example sentence to demonstrate stopword removal."

tokens = word_tokenize(text)

filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)

위 코드에서 stopwords.words('english')는 영어 불용어 사전을 로드한다. 이후 word_tokenize() 함수를 사용해 텍스트를 토큰화하고, 불용어가 포함되지 않은 단어만 추출해 filtered_tokens 리스트에 저장한다.

출력하면 다음과 같은 결과가 나온다.

['example', 'sentence', 'demonstrate', 'stopword', 'removal', '.']

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
print(stopwords.words('english')[:5])
# ['i', 'me', 'my', 'myself', 'we']

nltk.download('punkt')
from nltk.tokenize import word_tokenize

input_sentence = "We have studied hard for the exam since last October."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(input_sentence)
result = []
for w in word_tokens:
    if w not in stop_words:
        result.append(w)
print(word_tokens)
print(result)

출력하면 다음과 같은 결과가 나온다.

['We', 'have', 'studied', 'hard', 'for', 'the', 'exam', 'since', 'last', 'October', '.'] ['We', 'studied', 'hard', 'exam', 'since', 'last', 'October', '.']

728x90

'자연어 처리(NLP) 공부' 카테고리의 다른 글

자연어 처리(NLP) - 어간 추출(Stemming), 표제어 추출(Lemmatization) (2)	2023.03.16
자연어 처리(NLP) - 정수 인코딩(Integer Encoding) (0)	2023.03.15
자연어 처리(NLP) - 텍스트 데이터 전처리(preprocessing) 과정 (0)	2023.03.13
자연어 처리(NLP) - 토큰화(Tokenization) (2)	2023.03.11
자연어 처리(NLP) 공부 순서 (2)	2023.03.11

아는 것의 미학 🌼

자연어 처리(NLP) - 불용어(stopwords) 제거

'자연어 처리(NLP) 공부' 카테고리의 다른 글

+ Recent posts

티스토리툴바