문자열 데이터의 구두점과 Stopwords(불용어) 제거하기

Notice

Recent Posts

Recent Comments

Link

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

개발공부

문자열 데이터의 구두점과 Stopwords(불용어) 제거하기 본문

Python/Machine Learning

문자열 데이터의 구두점과 Stopwords(불용어) 제거하기

mscha 2022. 5. 11. 17:30

구두점 제거

아래와 같은 테스트 문자열의 구두점을 제거해보자.

Test = 'Here is a mini challenge, that will teach you how to remove stopwords and punctuations!'

파이썬의 string은 구두점을 제공해준다.

import string
string.punctuation

제거하는 코드는 다음과 같다.

Test_punc_removed = [char for char in Test if char not in string.punctuation]
Test_punc_removed_join = ''.join(Test_punc_removed)
Test_punc_removed_join

그럼 아래와같은 결과를 얻는다.

StopWords (불용어) 제거

nltk라이브러리를 통해 stopwords를 가져올 수 있다.

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

my_stopwords = stopwords.words('english')

Test_punc_removed_join_clean = []
for word in Test_punc_removed_join.split() :
  if word.lower() not in my_stopwords :
    Test_punc_removed_join_clean.append(word)
    
Test_punc_removed_join_clean

이제 위의 두가지 단계를 하나의 함수로 만들어보면 아래와 같이 만들 수 있다.

import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
my_stopwords = stopwords.words('english')

# 문장 클리닝 함수
def message_cleaning(sentence) :
  # 1. 구두점 제거
  Test_punc_removed = [char for char in sentence if char not in string.punctuation]
  # 2. 각 글자들을 하나의 문자열로 합친다.
  Test_punc_removed_join = ''.join(Test_punc_removed)
  # 3. 문자열에 불용어가 포함되어 있는지 확인해서, 불용어 제거한다.
  Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in my_stopwords]
  # 4. 결과로 남은 단어들만 리턴한다.
  return Test_punc_removed_join_clean

'Python > Machine Learning' 카테고리의 다른 글

WordCloud 라이브러리 사용법, STOPWORDS(불용어) 처리, 배경 색, 배경 모양(mask) 설정 (0)	2022.05.11
문자열 데이터를 숫자로 바꿔주는 CountVectorizer 와 analyzer 파라미터, fit, transform (0)	2022.05.11
Grid Search란 ? sklearn 라이브러리의 GridSearchCV 사용법 (0)	2022.05.11
Hierachical Clustering과 Dendrogram (0)	2022.05.10
sklearn라이브러리를 이용한 K-Means의 WCSS와 Elbow Method (0)	2022.05.10

'Python/Machine Learning' Related Articles

개발공부

문자열 데이터의 구두점과 Stopwords(불용어) 제거하기 본문

문자열 데이터의 구두점과 Stopwords(불용어) 제거하기

구두점 제거

StopWords (불용어) 제거

'Python > Machine Learning' 카테고리의 다른 글

티스토리툴바