As the scandal over Rupert Murdoch’s News Corporation’s illegal phone hacking activities broke to television audiences around the world, I could not help but wonder ‘why?’ And I am sure many others asked themselves the same question. What prompted Murdoch’s executives to condone illegal activities aimed at listening into private conversations? Obvious, you might say: getting the latest scoop on a murder investigation, or the most salacious tidbit about the royal family. But let us delve deeper and ask again, as a child might, ‘why?’ So that more readers would read the News of the World, of course! Stupid question? What drove so many people, estimated at over 4 million, a significant fraction of Britain’s population, to follow the tabloid press so avidly? The daily newspaper remains a primary source of news for the vast majority of the world’s population. Of course, most people also read more serious papers than the News of the World. Still, what is it that drives some news items to become headlines rather than be relegated to the corner of an inside page?
Shannon and Advertising
The scientific answer is Information; capitalized here because there is more to the term than as understood in its colloquial usage. You may call it voyeurism in the case of News of the World, or the hunger to know what is happening around the world for, say, the New York Times. Both forms of inquiry suffer from the need to filter the vast numbers of everyday events that take place every second, so as to determine those that would most likely be of interest to readers. The concept of Information is best illustrated by comparing the possible headlines “Dog Bites Man” and “Man Bites Dog.” Clearly the latter, being a far rarer event, is more likely to prompt you to read the story than the former, more commonplace occurrence.
In 1948, Claude E. Shannon published a now classic paper entitled “A Mathematical Theory of Communication.” By then the telegraph, telephone, and radio had spawned a whole new communications industry with the AT&T company at its locus. Shannon, working at AT&T Bell Laboratories, was concerned with how fast one could communicate meaning, or information in its colloquial sense, over wires or even the wireless. In defining a new theory with which to solve such practical problems, he also arrived at a precise mathematical definition of Information. Shannon’s Information measured the information (colloquial) content of a message in terms of the extent to which its being successfully transmitted reduced some degree of uncertainty on the part of the receiver. Thus, whether a telegraph operator transmitted “the price of AT&T stock just rose by five cents,” or “ATT + 5c,” the information content being transmitted was the same, at least to two equally intelligent receivers. Shannon quantified the amount of information in terms of the chance, or probability, of the event whose occurrences were being communicated. Thus, if it was quite normal for AT&T’s stock to rise by 5 cents, the information content was lower than for a rarer event, say the stock suddenly falling by 5 dollars. Similarly, the story “Man Bites Dog,” being a rather rare event, has a far greater information content than “Dog Bites Man.” The rarer the news, the more likely it is to catch our interest, and it therefore makes the headlines. Why? The paper wants you to buy a copy and read the story. In passing, you glance at the advertisements placed strategically close by, which is what an advertiser has paid good money for.
True, but what if some of us only read the sports pages?
Think of yourself at a party where you hear snippets of many conversations simultaneously, even as you focus on and participate in one particular interaction. Often you may pick up cues that divert your attention, nudging you to politely shift to another conversation circle. Interest is piqued by the promise both of an unlikely or original tale and one that is closely aligned with your own predilections, be they permanent or temporary. We all “listen for” the unexpected, and even more so for some subjects as compared to the rest. The same thing is going on when we read a newspaper, or, for that matter, search, surf, or scan stories on the web. We usually know, at least instinctively or subconsciously, what should surprise or interest us. But the newspaper does not. Its only measure of success is circulation, which is also what advertisers have to rely on to decide how much space to book with the particular paper. Apart from this the only additional thing an advertiser can do is discover, ex post facto, whether or not their money was well spent. Did Christmas sales actually go up or not? If the latter, well, the damage has already been done. Moreover, which paper should they pull their ads from for the next season? No clue. In Shannon’s language, the indirect message conveyed by a paper’s circulation, or for that matter ex post facto aggregate sales, contains precious little Information, in terms of doing little to reduce the uncertainty of which pages we are actually reading and thereby which advertisements should be catching our eye.
Of course, it is well known that Google and the Internet-based advertising industry it engendered have changed the rules of the game, as we shall describe in some detail very shortly. But it is interesting to view what they have actually achieved from the perspective of Shannon’s information theory, which was itself concerned more with the transmission of signals over wires and the ether. In our case we should look instead at other kinds of signals, such a paper’s circulation, or an advertiser’s end-of-season sales figures. Think of these as being at the receiving end, again speaking in terms more familiar to Shannon’s world. And then there is the actual signal that is transmitted by you and me, i.e., the stories we seek out and actually read. The transmission loss along this communication path, from actual reader behaviour to the “lag” measures of circulation or sales, is huge, both in information content as well as delay. If such a loss were suffered in a telegraph network, it would be like getting the message “AT&T goes out of business,” a year after the transmission of the original signal, which might have reported a sudden dip in share price. No stock trader would go anywhere near such a medium!
Shannon was concerned both with precisely measuring the information content of a signal and with how efficiently and effectively information could be transmitted along a channel, such as a telephone wire. He defined the information content of any particular value of a signal as the probability of its occurrence. Thus, if the signal in question was the toss of a fair coin, then the information content of the signal “heads” would be defined in terms of the probability of this value showing up, which is exactly 1/2. Provided of course that the coin was fair. A conman’s coin that had two heads would of course yield no information when it inevitably landed on its head, with probability 1. Recall our discussion of logarithmic-time algorithms in Chapter 1, such as binary search. As it turns out, Shannon information is defined, surprising as it may seem, in terms of the logarithm of the inverse probability. Thus the information content conveyed by the fair coin toss is log 2, which is exactly 1, and that for the conman’s coin is log 1, which, as expected, turns out to be 0. Similarly, the roll of a fair six-sided dice has an information content of log 6, which is about 2.58, and for the unusual case of an eight-sided dice, log 8 is exactly 3.
It turns out, as you might have suspected, that the logarithm crept into the formal definition of information for good reason. Recall once more how we searched for a word in a list using binary search in a logarithmic number of steps: by asking, at each step, which half of the list to look at; as if being guided through a maze, “go left,” then “go right.” Now, once we are done, how should we convey our newly discovered knowledge, i.e., the place where our word actually occurs in the list? We might remember the sequence of decisions we made along the way and record the steps we took to navigate to our word of interest; these are, of course, logarithmic in number. So, recording the steps needed to reach one specific position out of n total possibilities requires us to record at most log n “lefts” or “rights,” or equivalently, log n zeros and ones.
Say the discovered position was the eighth one, i.e., the last in our list of eight. To arrive at this position we would have had to make a “rightwards” choice each time we split the list; we could record this sequence of decisions as 111. Other sequences of decisions would similarly have their rendition in terms of exactly three symbols, each one or zero: for example, 010 indicates that starting from the “middle” of the list, say position 4, we look leftward once to the middle of the first half of the list, which ends up being position 2.
Shannon, and earlier Hartley, called these zero–one symbols “bits,” heralding the information age of “bits and bytes” (where a byte is just a sequence of eight bits). Three bits can be arranged in exactly eight distinct sequences, since 2 × 2 × 2 = 8, which is why log 8 is 3. Another way of saying this is that because these three bits are sufficient to represent the reduction in uncertainty about which of the eight words is being chosen, so the information content in the message conveying the word position is 3. Rather long-winded? Why not merely convey the symbol “8″? Would this not be easier? Or were bits more efficient?
It makes no difference. The amount of information is the same whether conveyed by three bits or by one symbol chosen from eight possibilities. This was first shown by Shannon’s senior at Bell Labs, Hartley, way back in 1928 well before Shannon’s arrival there. What Shannon did was take this definition of information and use it to define, in precise mathematical terms, the capacity of any channel for communicating information. For Shannon, channels were wired or wireless means of communication, using the technologies of telegraph, telephone, and later radio. Today, Shannon’s theory is used to model data communication on computer networks, including of course, the Internet. But as we have suggested, the notion of a channel can be quite general, and his information theory has since been applied in areas as diverse as physics to linguistics, and of course web technology.
If the information content of a precise message was the degree to which it reduced uncertainty upon arrival, it was important, in order to define channel capacity, to know what the uncertainty was before the signal’s value was known. As we have seen earlier, exactly one bit of information is received by either of the messages, “heads” or “tails,” signaling the outcome of a fair coin toss. We have also seen that no information is conveyed for a two-headed coin, since it can only show one result. But what about a peculiar coin that shows up heads a third of the time and tails otherwise? The information conveyed by each signal, “heads” or “tails,” is now different: each “head,” which turns up 1/3 of the time, conveys log 3 bits of information, while “tails” shows up with a probability 2/3 conveying log 3 bits. Shannon defined the term entropy to measure the average information conveyed over a large number of outcomes, which could be calculated precisely as the information conveyed by each outcome, weighted by the probability of that outcome. So the entropy of the fair coin signal is 1/2 × 1 + 1/2 × 1 = 1, since each possible outcome conveys one bit, and moreover each outcome occurs half the time, on the average. Similarly, a sequence of tosses of the two-headed coin has zero entropy. However, for the loaded coin, the entropy becomes 1 log 3 + 2 log 3 , which works out to just under 0.7; 332 a shade less than that of the fair coin.
Shannon was interested in a theory describing the transmission of information over any channel whatsoever. So he needed to figure out the relationship between the uncertainties in the two signals at each end of a communications channel, more precisely their entropies. He defined the idea of mutual information between the signal sent down a channel versus the one actually received. If the two signals corresponded closely to each other, with only occasional discrepancies, then the mutual information between them was high, otherwise it was lower. A simple way to understand mutual information is to imagine that you are the receiver of a signal, continuously getting messages over a communication channel such as a telegraph or radio. But you have no idea how closely the received messages match those that were sent. Now suppose you somehow got independent reports of what messages were actually sent, say by magic, or by a messenger on horseback who arrived days later. You could work out how often the channel misled you. The amount by which these reports would surprise you, such as how often there were transmission errors, would allow you to measure how good or bad the earlier transmissions were. As earlier with our coin tosses, the degree of surprise, on average, should be nothing but the entropy of these special reports, which Shannon called conditional entropy. If the conditional entropy was high, i.e., the reports often surprised you by pointing out errors in transmission, then the mutual information between the sent and received signals should be low. If the reports did not surprise you much, behaving almost like a loaded coin that always gave the same result as your observation of the received signal, then the conditional entropy was low and the mutual information high. Shannon defined the mutual information as the difference between the entropy of whatever was actually being transmitted and the conditional entropy.
For example, suppose that you are communicating the results of a fair coin toss over a communication channel that makes errors 1/3 of the time. The conditional entropy, measuring your surprise at these errors, is the same as for the loaded coin described earlier, i.e., close to 0.7. The entropy of the transmitted signal, being a fair coin, is 1; it, and the mutual information, is the difference, or 0.3, indicating that the channel transmission does somewhat decrease your uncertainty about the source signal. On the other hand, if as many as half the transmissions were erroneous, then the conditional entropy would equal that of the fair coin, i.e., exactly 1, making the mutual information zero. In this case the channel transmission fails to convey anything about the coin tosses at the source.
Next Shannon defined the capacity of any communication channel as the maximum mutual information it could possibly exhibit as long as an appropriate signal was transmitted. Moreover, he showed how to actually calculate the capacity of a communication channel, without necessarily having to show which kind of signal had to be used to achieve this maximum value. This was a giant leap of progress, for it provided engineers with the precise knowledge of how much information they could actually transmit over a particular communication technology, such as a telegraph wire over a certain distance or a radio signal of a particular strength, and with what accuracy. At the same time it left them with the remaining task of actually trying to achieve that capacity in practice, by, for example, carefully encoding the messages to be transmitted.
Now, let us return to the world of advertising and the more abstract idea of treating paper circulation or sales figures as a signal about our own behavior of seeking and reading. In terms of Shannon’s information theory, the mutual information between reader behavior and measures such as circulation or sales is quite low. Little can be achieved to link these since the channel itself, i.e., the connection between the act of buying a newspaper and aggregate circulation or product sales, is a very tenuous one.
The Penny Clicks
Enter online advertising on the Internet. Early Internet “banner” advertisements, which continue to this day, merely translated the experience of traditional print advertising onto a web page. The more people viewed a page, the more one had to pay for advertising space. Instead of circulation, measurements of the total number of “eyeballs” viewing a page could easily be derived from page hits and other network-traffic statistics. But the mutual information between eyeballs and outcomes remained as weak as for print media. How weak became evident from the dot.com bust of 2001. Internet companies had fueled the preceding bubble by grossly overestimating the value of the eyeballs they were attracting. No one stopped to question whether the new medium was anything more than just that, i.e., a new way of selling traditional advertising. True, a new avenue for publishing justified some kind of valuation, but how much was never questioned. With 20/20 hindsight it is easy to say that someone should have questioned the fundamentals better. But hindsight always appears crystal clear. At the same time, history never fails to repeat itself.
As of this writing, a new bubble is looming in the world of social networking. Just possibly, a deeper analysis, based perhaps on the concept of mutual information, might reveal some new insight. Is the current enthusiasm for the potential profitability of “new age” social networking sites justified? Only time will tell. In the meanwhile, recent events such as the relative lukewarm response to Facebook’s initial public offering in mid-2012 do give us reason to pause and ponder. Perhaps some deeper analyses using mutual information might come in handy. To see how, let us first look at what the Google and other search engines did to change the mutual information equation between consumers and advertisers, thereby changing the fundamentals of online advertising and, for that matter, the entire media industry.
An ideal scenario from the point of view of an advertiser would be to have to pay only when a consumer actually buys their product. In such a model the mutual information between advertising and outcome would be very high indeed. Making such a connection is next to impossible in the print world. However, in the world of web pages and clicks, in principle this can be done by charging the advertiser only when an online purchase is made. Thus, instead of being merely a medium for attracting customer attention, such a website would instead become a sales channel for merchants. In fact Groupon uses exactly such a model: Groupon sells discount coupons to intelligently selected prospects, while charging the merchants a commission if and only if its coupons are used for actual purchases.
In the case of a search engine, such as Yahoo! or Google, however, consumers may choose to browse a product but end up not buying it because the product is poor, for no fault of the search engine provided. So why should Google or Yahoo! waste their advertising space on such ads? Today online advertisers use a model called “pay-per-click,” or PPC, which is somewhere in between, where an advertiser pays only if a potential customer clicks their ad, regardless of whether that click gets converted to a sale. At the same time, the advertiser does not pay if a customer merely looks at the ad, without clicking it. The PPC model was first invented by Bill Gross, who started GoTo.com in 1998. But it was Google that made PPC really work by figuring out the best way to charge for ads in this model. In the PPC model, the mutual information between the potential buyer and the outcome is lower than for, say, a sales channel such as Groupon. More importantly, however, the mutual information is highly dependent on which ad the consumer sees. If the ad is close to the consumer’s intent at the time she views it, there is a higher likelihood that she will click, thereby generating revenue for the search engine and a possible sale for the advertiser.
What better way to reduce uncertainty and increase the mutual information between a potential buyer’s intent and an advertisement, than to allow advertisers to exploit the keywords being searched on? However, someone searching on “dog” may be interested in dog food. On the other hand, they may be looking to adopt a puppy. The solution was to get out of the way and let the advertisers figure it out. Advertisers bid for keywords, and the highest bidder’s ad gets placed first, followed by the next highest and so on. The “keyword auction,” called AdWords by Google, is a continuous global event, where all kinds of advertisers, from large companies to individuals, can bid for placements against the search results of billions of web users. This “keyword auction” rivals the largest stock markets in volume, and is open to anyone who has a credit card with which to pay for ads!
Once more, as in the case of PPC, one should point out that the concept of a keyword auction was not actually Google’s invention. GoTo.com, later acquired by Overture and then by Yahoo!, actually introduced keyword auctions. But there was a problem with their model. The PPC-auction model allowed advertisers to offer to pay only for those keywords that would, in their view, best increase the mutual information between a buyer’s intent and the possible outcome of their viewing an ad. Still, the model would work only if the ads actually got displayed often enough. The problem was competition. Once Nike knew that Adidas ads were appearing first against some keywords, say “running shoes,” they would up their bid in an effort to displace their rival. Since the auction took place online and virtually instantaneously, Nike could easily figure out exactly what Adidas’s bid was (and vice versa), and quickly learn that by bidding a mere cent higher they would achieve first placement. Since the cost of outplacing a rival was so low, i.e., a very small increment to one’s current bid, Adidas would respond in turn, leading to a spiraling of costs. While this may have resulted in short-term gains for the search engine, in the long run advertisers did not take to this model due to its inherent instability.
Google first figured out how to improve the situation: instead of charging an advertiser the price they bid, Google charges a tiny increment over the next-highest bidder. Thus, Nike might bid 40 cents for “running shoes,” and Adidas 60 cents. But Adidas gets charged only 41 cents per click. Nike needs to increase its bid significantly in order to displace Adidas for the top placement, and Adidas can increase this gap without having to pay extra. The same reasoning works for each slot, not just the first one. As a result, the prices bid end up settling down into a stable configuration based on each bidder’s comfort with the slot they get, versus the price they pay. Excessive competition is avoided by this “second price” auction, and the result is a predictable and usable system. It wasn’t too long before other search engines including Yahoo! also switched to this second-price auction model to ensure more “stability” in the ad market.
What does the second-price auction give an advertiser from the perspective of mutual information? By bidding on keywords, merchants can place their ads more intelligently, using keywords to gauge the intent of the searcher they want to target. Further, they pay for ads only when someone clicks on one. Both these factors, i.e., targeting ads to keywords and linking payment to clicks, increase the mutual information between each advertising dollar spent and an actual sale. In fact, the correlation between the advertising expense and hits on a merchant’s website is perfect, i.e., the mutual information is exactly 1, since the merchant pays only if a user actually visits the merchant’s site. The remaining uncertainty of whether such a visit actually translates to a sale is out of the hands of the search engine, and instead depends on how good a site and product the merchant can manage. Another way of looking at PPC is that the advertiser is paying to increase “circulation” figures for his site, ensuring that eyeballs read the material he wants people to read, rather than merely hoping that they glance at his ad while searching for something else.
Statistics of Text
However effective search-engine advertising might be, nevertheless a bidder on Google’s AdWords (or Yahoo!’s “sponsored-search”equivalent) can only place advertisements on a search-results page, targeting only searchers who are looking for something. What about those reading material on the web after they have found what they wanted through search, or otherwise? They might be reading a travel site, blog, or magazine. How might such readers also be presented with ads sold through a keyword auction? Google’s solution, called AdSense, did precisely this. Suppose you or I have published a web page on the Internet. If we sign up for AdSense, Google allows us to include a few lines of computer code within our web page that displays contextually relevant ads right there, on our web page, just as if it were Google’s own page. Google then shares the revenue it gets from clicks on these ads with us, the authors of the web page. A truly novel business model: suddenly large numbers of independent web-page publishers became Google’s partners through whom it could syndicate ads sold through AdWords auctions.
Of course, as before in the case of the “second-price auction” idea, other search engines including Yahoo! quickly followed Google’s lead and developed AdSense clones. At the same time, they struggled to match Google’s success in this business: Yahoo! shut down its AdSense clone called “Publisher Network” in 2010, only to restart it again very recently in 2012, this time in partnership with Media.net, a company that now powers contextual search for both Yahoo! as well as Microsoft’s Bing search engine.
So how does AdSense work? The AdWords ads are sold by keyword auction, so if Google could somehow figure out the most important keywords from within the contents of our web page, it could use these to position ads submitted to the AdWords auction in the same manner as done alongside Google search results. Now, we may think that since Google is really good at search, i.e., finding the right documents to match a set of keywords, it should be easy to perform the reverse, i.e., determine the best keywords for a particular document. Sounds simple, given Google’s prowess in producing such great search results. But not quite. Remember that the high quality of Google search was due to PageRank, which orders web pages by importance, not words. It is quite likely that, as per PageRank, our web page is not highly ranked. Yet, because of our loyal readers, we do manage to get a reasonable number of visitors to our page, enough to be a worthwhile audience for advertisers: at least we think so, which is why we might sign up for AdSense. “Inverting” search sounds easy, but actually needs much more work.
The keywords chosen for a particular web page should really represent the content of the page. In the language of information theory, the technique for choosing keywords should make sure that the mutual information between web pages and the keywords chosen for them should be as high as possible. As it turns out, there is such a technique, invented as long ago as 1972, called TF-IDF, which stands for “term frequency times inverse document frequency.” The core idea here is the concept of inverse document frequency, or IDF, of a word (also called ‘term’). The idea behind IDF is that a word that occurs in many documents, such as the word “the,” is far less useful for searching for content than one that is rare, such as “intelligence.” All of us intuitively use this concept while searching for documents on the web; rarely do we use very common words. Rather, we try our best to choose words that are likely to be highly selective, occurring more often in the documents we seek, and thereby give us better results. The IDF of a word is computed from a ratio—the total number of web pages divided by the number of pages that contain a particular word. In fact the IDF that seemed to work best in practice was, interestingly enough, the logarithm of this ratio. Rare words have a high IDF, and are therefore better choices as keywords.
The term frequency, or TF, on the other hand, is merely the number of times the word occurs in some document. Multiplying TF and IDF therefore favours generally rare words that nevertheless occur often in our web page. Thus, out of two equally rare words, if one occurs more often in our web page, we would consider that a better candidate to be a keyword, representative of our content.
TF-IDF was invented as a heuristic, based only on intuition, and without any reference to information theory. Nevertheless, you might well suspect such a relationship. The presence of a rare word might be viewed as conveying more information than that of more common ones, just as does a message informing us that some unexpected event has nevertheless occurred. Similarly the use of the logarithm, introduced in the TF-IDF formula due to its practical utility, points to a connection with Shannon’s theory that also uses logarithms to define information content. Our intuition is not too far off; recent research has indeed shown that the TF-IDF formulation appears quite naturally when calculating the mutual information between “all words” and “all pages.” More precisely, it has been shown that the mutual information between words and pages is proportional to the sum, over all words, of the TF-IDFs of each word taken in isolation. Thus, it appears that by choosing, as keywords, those words in the page that have the highest TF-IDF, we are also increasing the mutual information and thereby reducing the uncertainty regarding the intent of the reader.
Is keyword guessing enough? What if an article mentions words such as “race,” “marathon,” and “trophy,” but omits a mention of “running” or “shoes”? Should an AdWords bidder, such as Nike or Adidas, be forced to imagine all possible search words against which their ads might be profitably placed? Is it even wise to do so? Perhaps so, if the article in question was indeed about running marathon races. On the other hand, an article with exactly these keywords might instead be discussing a national election, using the words “race,” “marathon,” and “trophy” in a totally different context. How could any keyword-guessing algorithm based on TF-IDF possibly distinguish between these situations? Surely it is asking too much for a computer algorithm to understand the meaning of the article in order to place it in the appropriate context. Surprisingly though, it turns out that even such seemingly intelligent tasks can be tackled using information-theoretic ideas like TF-IDF.
Just as TF-IDF measures the relative frequency of a word in a page weighted by its relative rarity overall, we can also consider pairs of words occurring together. For example, the word “marathon” and the terms “42 kilometres” or “26 miles” are likely to occur together in at least some articles dealing with actual running. On the other hand, words such as “election,” “voters,” or “ballot” are likely to occur together in news about campaigning and politics. Can a computer algorithm figure out such relationships by itself, without actually “understanding” the content, whatever that means? The frequency with which each pair of words occur together, averaged over all pages, can certainly be calculated. Essentially we need to count co-occurrences of words, i.e., the number of times words occur together. But, just as in the case of individual words, it is also a good idea to weight each such co-occurrence by the IDF of both words in the pair. By doing this, the co-occurrence of a word with a very common word, such as “the,” is not counted, since its IDF will be almost zero. In other words we take a pair of words and multiply their TF-IDF scores in every document, and then add up all these products. The result is a measure of the correlation of the two words as inferred from their co-occurrences in whatever very large set of documents is available, such as all web pages. Of course, this is done for every possible pair of words as well. No wonder Google needs millions of servers.
Exploiting such word–word correlations based on co-occurrences of words in documents is the basis of “Latent Semantic Analysis,” which involves significantly more complex mathematics than the procedure just outlined. Surprisingly, it turns out that Latent Semantic Analysis (or LSA) can perform tasks that appear to involve “real understanding,” such as resolving ambiguities due to the phenomenon of polysemy, where the same word, such as “run,” has different meanings in different contexts. LSA-based algorithms can also figure out the many millions of different topics that are discussed, in billions of pages, such as “having to do with elections” versus “having to do with running,” and also automatically determine which topic, or topics, each page is most likely about.
Sounds incredible? Maybe a simple example can throw some light on how such topic analysis takes place. For the computer, a topic is merely a bunch of words; computer scientists call this the “bag of words” model. For good measure, each word in a topic also has its TF-IDF score, measuring its importance in the topic weighted by its overall rarity across all topics. A bag of words, such as “election,” “running,” and “campaign,” could form a topic associated with documents having to do with elections. At the same time, a word such as “running” might find a place in many topics, whereas one such as “election” might span fewer topics.
Such topics can form the basis for disambiguating a web page on running marathons from a political one. All that is needed is a similarity score, again using TF-IDF values, between the page and each topic: for each word we multiply its TF-IDF in the page in question with the TF-IDF of the same word in a particular topic, and sum up all these products. In this manner we obtain scores that measure the relative contribution of a particular topic to the content of the page. Thus, using such a procedure, Google’s computers can determine that an article we may be reading is 90% about running marathons and therefore place Nike’s advertisement for us to see, while correctly omitting this ad when we read a page regarding elections. So, not only does Google watch what we read, it also tries to “understand” the content, albeit ‘merely’ by using number crunching and statistics such as TF-IDF.
It is important to note that while the computer might place many words such as “election,” “running,” and “campaign,” in a topic that we easily recognize as “dealing with elections,” it usually cannot come up with a meaningful title for this topic. For it, a topic is a bag of words, and just that, without any other “meaning.” The problem of finding ‘good’ labels for such automatically detected topics remains difficult for computers to do. Topic labeling is also closely related to the problem of automatically creating a “back-of-the-book” index, which was briefly mentioned in Chapter 1. As in the case of topic titles, entries in a back-of-the-book index need to be succinct and informative, summarizing the most important concepts being discussed. Bags of words will not do. Accurate automatic back-of-the-book indexing is still an open problem, as discussed in a recent paper by András Csomai and Rada Mihalcea: “Although there is a certain degree of computer assistance, consisting of tools that help the professional indexer to organize and edit the index, there are however no methods that would allow for a complete or nearly-complete automation.”
* * *
It seems that Google is always listening to us: what we search for, what we read, even what we write in our emails. Increasingly sophisticated techniques are used, such as TF-IDF, LSA, and topic analysis, to bring this process of listening closer and closer to “understanding”—at least enough to place ads intelligently so as to make more profits.
Therein lies the rub. Is Google really understanding what we say? How hard does it need to try? Are TF-IDF-based techniques enough, or is more needed? Very early on after Google launched AdSense, people tried, not surprisingly, to fool the system. They would publish web pages full of terms such as “running shoes,” “buying,” and “price,” without any coherent order. The goal was to ensure that their pages were returned in response to genuine search queries. When visitors opened such a page they would realize that it contained junk. But it was hoped that even such visitors might, just maybe, click on an advertisement placed there by AdSense, thereby making money for the publisher of the junk page. Google needed to do more than rely only on the bag-of-words model. It needed to extract deeper understanding to combat such scams, as well as much more. Thus, inadvertently driven by the motive of profitable online advertising, web companies such as Google quite naturally strayed into areas of research having to deal with language, meaning, and understanding. The pressure of business was high. They also had the innocence of not necessarily wanting to solve the “real” problem of language or understanding—just good enough would do—and so they also made a lot of progress.
Reprinted from “The Intelligent Web” by Gautam Shroff with permission from Oxford University Press USA. Copyright © 2014 Oxford University Press USA and published by Oxford University Press USA. All rights reserved.