문자열 데이터를 숫자로 바꿔주는 CountVectorizer 와 analyzer 파라미터, fit, transform

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

개발공부

문자열 데이터를 숫자로 바꿔주는 CountVectorizer 와 analyzer 파라미터, fit, transform 본문

Python/Machine Learning

문자열 데이터를 숫자로 바꿔주는 CountVectorizer 와 analyzer 파라미터, fit, transform

mscha 2022. 5. 11. 17:50

CountVectorizing 카운터 벡터라이징

여러개의 문자열데이터에서 각 단어별로 쪼개 해당 단어가 몇개씩 나왔는지

계산해 변환해준다,

예제

from sklearn.feature_extraction.text import CountVectorizer

sample_data = ['This is the first document',
               'I loved them',
               'This document is the second document',
               'I am loving you',
               'And this is the third one']

vec = CountVectorizer()

#fit_transform() 함수를 사용해 sample_data의 문자열의 모든 단어에 대해
# 각 리스트요소마다 각 단어에 대해 몇개씩 있는지를 학습시킨다.
count_vec = vec.fit_transform(sample_data)

count_vec.shape

fit_transform(smaple_data)를 통해 학습, 변환이 되어

아래와 같이 각단어마다의 갯수를 얻을 수 있다.

count_vec.toarray()

각 자리마다의 단어명은 아래와 같다.

vec.get_feature_names()

예제

'i am fine thank you' 라는 단어가 있을 때 위에서 학습한 vec에 맞춰 count vectorizing 해보기

test =  ['i am fine thank you']
new_data = np.array(test)
new_X = vec.transform(new_data)
new_X = new_X.toarray()
new_X

위의 테스트 문자열은 학습데이터와 비교해서

첫 단어인 am이 1개, 마지막 단어인 you가 1개 있고 나머지 단어는 없다는 것을 알 수 있다.

CountVectorizer 함수의 analyzer 파라미터

카운트 벡터라이저의 애널라이저 파라미터에 미리 안든 함수를 셋팅해주면

알아서 문자열에 적용해주고, 그후 숫자로 변환해준다

아래와 같은

구두점과 불용어를 제거해주는

message_cleaning() 함수가 있을 때

import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
my_stopwords = stopwords.words('english')

# 문장 클리닝 함수
def message_cleaning(sentence) :
  # 1. 구두점 제거
  Test_punc_removed = [char for char in sentence if char not in string.punctuation]
  # 2. 각 글자들을 하나의 문자열로 합친다.
  Test_punc_removed_join = ''.join(Test_punc_removed)
  # 3. 문자열에 불용어가 포함되어 있는지 확인해서, 불용어 제거한다.
  Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in my_stopwords]
  # 4. 결과로 남은 단어들만 리턴한다.
  return Test_punc_removed_join_clean

아래와 같이 analyzer 를 설정하면

카운트 벡터라이저가 알아서 문자열에 대해 함수를 적용한 후

숫자로 바꿔준다.

'Python > Machine Learning' 카테고리의 다른 글

Prophet 라이브러리 사용법 (0)	2022.05.12
WordCloud 라이브러리 사용법, STOPWORDS(불용어) 처리, 배경 색, 배경 모양(mask) 설정 (0)	2022.05.11
문자열 데이터의 구두점과 Stopwords(불용어) 제거하기 (0)	2022.05.11
Grid Search란 ? sklearn 라이브러리의 GridSearchCV 사용법 (0)	2022.05.11
Hierachical Clustering과 Dendrogram (0)	2022.05.10

'Python/Machine Learning' Related Articles

개발공부

문자열 데이터를 숫자로 바꿔주는 CountVectorizer 와 analyzer 파라미터, fit, transform 본문

문자열 데이터를 숫자로 바꿔주는 CountVectorizer 와 analyzer 파라미터, fit, transform

CountVectorizing 카운터 벡터라이징

예제

'i am fine thank you' 라는 단어가 있을 때 위에서 학습한 vec에 맞춰 count vectorizing 해보기

CountVectorizer 함수의 analyzer 파라미터

'Python > Machine Learning' 카테고리의 다른 글

티스토리툴바