In trying to finalise my PhD revisions, I am giving some background on text categorisation.
Extremely briefly, the problem of text categorisation is this: you have a document and some (usually pre-defined, unless you’re clustering) categories. For example, the categories might be news and editorial. Or academic article, newspaper article and blog entry. The choice of categories is application dependent.
Then you have a document you wish to assign to a category. Is it news, or editorial? The typical way of doing this is to assemble a set of training examples: pre-assigned news and editorial pieces. Then you measure the similarity of your new document to the pre-assigned collections, and whichever category it is most like is your document’s category. You might notice that I have not here defined “measure the similarity” and “most like”: that’s often the research question. How can you represent the collections efficiently so that they can be compared against new documents? What are good measures of similarity?
A fairly common way to picture this is (for historical reasons, as we’ll see), a vector. For each word in the vocabulary (the vocabulary being the set of terms used in every document in the training examples, typically, sometimes you might try and smooth the morphology out or similar), you construct a numerical representation. Say the vocabulary is no-good, bad, rotten, and a document reads “no-good no-good bad”, you might describe it as a vector <2 1 0>, showing two uses of the first vocabulary item, 1 of the second and none of the third. (Again, whether you count vocabulary items, or weight them in various ways, is a research question. You may also notice that this counting-of-occurences model is a “bag of words” approach, that is, it does not distinguish between “bad rotten” and “rotten bad” even though in language word order and syntactic structure is meaningful. It’s possible to transform the vectors so that this orthogonality of individual words does not hold.)
For reasons that I won’t go into here, I am trying to discuss this model briefly in my PhD thesis — actually, more briefly than I did above — and therefore looking to cite the originator of the idea. I started coming across citations in other papers that looked something like: “Gerard Salton [and others] (1975). A vector space model for information retrieval.” Sounds good. It’s got the key words in it, and quite a few citations!
I like to sight before citing though, which means I found this interesting paper:
David Dubin (2004). The Most Influential Paper Gerard Salton Never Wrote, Library Trends 52(4):748–764.
Gerard Salton is often credited with developing the vector space model (VSM) for information retrieval (IR). Citations to Salton give the impression that the VSM must have been articulated as an IR model sometime between 1970 and 1975. However, the VSM as it is understood today evolved over a longer time period than is usually acknowledged, and an articulation of the model and its assumptions did not appear in print until several years after those assumptions had been criticized and alternative models proposed. An often cited overview paper titled “A Vector Space Model for Information Retrieval” (alleged to have been published in 1975) does not exist, and citations to it represent a confusion of two 1975 articles, neither of which were overviews of the VSM as a model of information retrieval. Until the late 1970s, Salton did not present vector spaces as models of IR generally but rather as models of specific computations. Citations to the phantom paper reflect an apparently widely held misconception that the operational features and explanatory devices now associated with the VSM must have been introduced at the same time it was first proposed as an IR model.
Naturally such a subtle treatment of the history of the model is not great for my immediate purposes: I need That One Citation! (As best I can tell from Dubin, if I have to pick one it should be G. Salton, (1979). Mathematics and information retrieval. Journal of Documentation, 35(1), 1–29.) but it’s fun to come across the analysis of an idea in this form.
Update: if you want a reasonable overview of text classification/topic classification/topic assignment, the survey of choice seems to be Fabrizio Sebastiani (2002). Machine learning in automated text categorization, ACM Computing Surveys, 34(1):1–47. You know, modulo 11 years now.08.19.12
In belated honour of my breakfast in New York, Sunday July 8.
Warning for baby loss discussion.
I really have to question why seeing someone else processing their emotions is her pet peeve.
Do I believe a miscarriage and neonatal death is the same thing — of course not. If they were the same thing, they would share the same term. But just because I see them as apples and oranges doesn’t mean that I don’t also see them as fruit. They are both loss.
Readers would not guess from the “national conversation” that the construction industry is sitting on a story as grave in its implications as the phone-hacking affair – graver I will argue. You are unlikely to have heard mention of it for a simple and disreputable reason: the victims are working-class men rather than celebrities… The construction companies could not be clearer that men who try to enforce minimum safety standards are their enemies. The files included formal letters notifying a company that a worker was the official safety rep on a site as evidence against him.
By most measures, I should have technical entitlement in spades… [and yet] I am very intimidated by the technically entitled.
You know the type. The one who was soldering when she was 6. The one who raises his hand to answer every question–and occasionally try to correct the professor. The one who scoffs at anyone who had a score below the median on that data structures exam (“idiots!”). The one who introduces himself by sharing his StackOverflow score.
A fun upcoming KDD 2012 paper out of Microsoft, “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained” (PDF), has a lot of great insights into A/B testing and real issues you hit with A/B testing. It’s a light and easy read, definitely worthwhile.
We present … puzzling outcomes of controlled experiments that we analyzed deeply to understand and explain … [requiring] months to properly analyze and get to the often surprising root cause … It [was] not uncommon to see experiments that impact annual revenue by millions of dollars … Reversing a single incorrect decision based on the results of an experiment can fund a whole team of analysts.
When Bing had a bug in an experiment, which resulted in very poor results being shown to users, two key organizational metrics improved significantly: distinct queries per user went up over 10%, and revenue per user went up over 30%! …. Degrading algorithmic results shown on a search engine result page gives users an obviously worse search experience but causes users to click more on ads, whose relative relevance increases, which increases short-term revenue … [This shows] it’s critical to understand that long-term goals do not always align with short-term metrics.
One of the various Longform collections, and like many of them, a crime piece:
On June 4, 1989, the bodies of Jo, Michelle and Christe were found floating in Tampa Bay. This is the story of the murders, their aftermath, and the handful of people who kept faith amid the unthinkable.
As almost everybody knows at this point, I have resigned my position at the University of New Mexico. Effective this July, I am working for Google, in their Cambridge (MA) offices.
Countless people, from my friends to my (former) dean have asked “Why? Why give up an excellent [some say 'cushy'] tenured faculty position for the grind of corporate life?”
Honestly, the reasons are myriad and complex, and some of them are purely personal. But I wanted to lay out some of them that speak to larger trends at UNM, in New Mexico, in academia, and in the US in general. I haven’t made this move lightly, and I think it’s an important cautionary note to make: the factors that have made academia less appealing to me recently will also impact other professors.
Since its legalization in 2002, commercial surrogacy in India has grown into a multimillion-dollar industry, drawing couples from around the world. IVF procedures in the unregulated Indian clinics generally cost a fraction of what they would in Europe or the U.S., with surrogacy as little as one-tenth the price. Mainstream press reports in English-language publications occasionally devote a line or two to the ethical implications of using poor women as surrogates, but with few exceptions, these women’s voices have not been heard.
Sociologist Amrita Pande of the University of Cape Town set out to speak directly with the “workers” to see how they are affected by such “work.”
xkcd suddenly exploded in my circles in 2006, thanks to the comic Randall Munroe calls Computational Linguists and most people refer to as “Fuck Computational Linguistics” getting around at the annual conference of the Association for Computational Linguistics.
There’s been requests for the xkcd store to sell it before, but it’s never been done.
I just ordered a batch through Sticker Mule, both of the full comic and of a smaller badge version I did. (They will do proofs of them, I’ll be interested to see if the “Fuck” bugs them.) In order to do so I did a vector version of the comic (via Inkscape’s “trace bitmap”), and because the original comic, and these variants, are under Creative Commons Attribution NonCommercial, I can share them with you here. If you want them, order copies from the sticker vendor of your choice!
Smaller badge-like variant:
The vector versions aren’t very clean, but neither is the original comic, so I’m hoping these look like the spirit of the original, rather than a nasty hack.
Reminder: these are licensed for free noncommercial use (the precise condition is noncommercial use with attribution to the original author, modifications OK). So don’t sell them!