Giving a tupla of the form (wordString, posTagString) like ('guitar', 'NN'), return the lemmatized word #the second parameter is an "optional" in case of missing key in the dictionary Punctuation = set(stopwords.words('english')) Here's my code: from rpus import wordnet as wn Lemma_pos_token = lemmatization_using_pos_tagger.pos_tag(tokens)Īfter searching from internet, I've found this solution: from sentence to "bag of words" derived after splitting, pos_tagging, lemmatizing and cleaning (from punctuation and "stopping words") operations. #step 1 split document into sentence followed by tokenization Lemmatization_using_pos_tagger = LemmatizationWithPOSTagger() Pos_tokens = ) for (word,pos_tag) in pos] for pos in pos_tokens] # convert into feature set of ), ('can', 'can', ). # find the pos tagginf for each tokens [('What', 'WP'), ('can', 'MD'), ('I', 'PRP'). # As default pos in lemmatization is Noun Return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) Tokens = Ĭlass LemmatizationWithPOSTagger(object): Self.splitter = ('tokenizers/punkt/english.pickle') Split the document into sentences and tokenize each sentence The staff of these restaurants is nice and the eggplant is not bad' #example text text = 'What can I say about this place. Steps to convert : Document->Sentences->Tokens->POS->Lemmas import nltk Lemmatizer.lemmatize('going', wordnet.VERB)Ĭheck the return value before passing it to the Lemmatizer because an empty string would give a KeyError. You can then use the return value with the lemmatizer: from import WordNetLemmatizer The following function would map the treebank tags to WordNet part of speech names: from rpus import wordnet > 'taggers/maxent_treebank_pos_tagger/english.pickle'Īs it was trained with the Treebank corpus, it also uses the Treebank tag set. With nltk.tag._POS_TAGGER: nltk.tag._POS_TAGGER The function will load a pretrained tagger from a file. First of all, you can use nltk.pos_tag() directly without training it.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |