Update: NLTK has its own algorithm for TF-IDF, please forgive my ignorance.
http://www.nltk.org/_modules/nltk/text.html#TextCollection.tf_idf
=====================
I’ve been working on extracting information from a large amount of text which mainly consists of thousands of journal article’s abstract. TF–IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Wikipedia has a page introducing this algorithm in detail, so I will not discuss more about this.
I was expecting there might be TF-IDF function in python’s NLTK library; however, I did not find (I am a beginner of NLTK, so may be due to my ignorance). So I wrote the function by myself. Several important points worth noting:
- Encoding is tricky, but codecs is a great helper.
- Require a lot of time to computing. The computing of IDF requires the transverse of all the articles for each term. May be my algorithm is ineffective.
- There are ghost terms which exist in the whole corpus but cannot be found while searching within each article individually. I am using the variable ghost_terms to debug.
import nltk import codecs from nltk import word_tokenize, FreqDist from math import log raw_file = codecs.open('V6.4all-utf8', 'r', 'utf8') ab_corpus_list = [] for x in raw_file: if x.startswith('AB'): ab_corpus_list.append(x.lower()) pass print 'ab_corpus_list complete' ab_corpus_string = '' for x in ab_corpus_list: ab_corpus_string = ab_corpus_string + x pass print 'ab_corpus_string complete' ab_corpus_token = word_tokenize(ab_corpus_string.lower()) print 'ab_corpus_token complete' word_freq_total = FreqDist(ab_corpus_token) print 'word_freq_total complete' ghost_terms = [] def idf(tm): num_doc_tm = 0 for x in ab_corpus_list: if FreqDist(word_tokenize(x))[tm] != 0: num_doc_tm += 1 pass if num_doc_tm == 0: ghost_terms.append(tm) return 0 if num_doc_tm != 0: return log(float(len(ab_corpus_list)) / num_doc_tm) tf_idf = [] progress = 0 for x in word_freq_total: tf_idf_medium = [] tf_idf_score = float(word_freq_total[x]) * float(idf(x)) tf_idf_medium.append(tf_idf_score) tf_idf_medium.append(x) tf_idf.append(tf_idf_medium) progress += 1 print str((round(float(progress) / len(word_freq_total), 4) * 100)) + '%' tf_idf.sort() print tf_idf[-10:] tf_idf_file = open('result.txt', 'wb') for x in tf_idf: tf_idf_file.writelines(str(x[0]) + ", " + str(x[1]) + 'n') pass print ghost_terms ghost_terms_file = open('ghost_terms.txt', 'wb') for x in ghost_terms: ghost_terms_file.writelines(x) pass