What makes Rupert Murdoch tick? The science behind media greed
Murdoch's paper hacked phones. But why? How the lust for advertising pay-dirt drives media companies to madness
Topics: Advertising, Books, Editor's Picks, Media, News International, Phone-hacking, Rupert Murdoch, The Intelligent Web, Technology News, Media News, Business News
As the scandal over Rupert Murdoch’s News Corporation’s illegal phone hacking activities broke to television audiences around the world, I could not help but wonder ‘why?’ And I am sure many others asked themselves the same question. What prompted Murdoch’s executives to condone illegal activities aimed at listening into private conversations? Obvious, you might say: getting the latest scoop on a murder investigation, or the most salacious tidbit about the royal family. But let us delve deeper and ask again, as a child might, ‘why?’ So that more readers would read the News of the World, of course! Stupid question? What drove so many people, estimated at over 4 million, a significant fraction of Britain’s population, to follow the tabloid press so avidly? The daily newspaper remains a primary source of news for the vast majority of the world’s population. Of course, most people also read more serious papers than the News of the World. Still, what is it that drives some news items to become headlines rather than be relegated to the corner of an inside page?
Shannon and Advertising
The scientific answer is Information; capitalized here because there is more to the term than as understood in its colloquial usage. You may call it voyeurism in the case of News of the World, or the hunger to know what is happening around the world for, say, the New York Times. Both forms of inquiry suffer from the need to filter the vast numbers of everyday events that take place every second, so as to determine those that would most likely be of interest to readers. The concept of Information is best illustrated by comparing the possible headlines “Dog Bites Man” and “Man Bites Dog.” Clearly the latter, being a far rarer event, is more likely to prompt you to read the story than the former, more commonplace occurrence.
In 1948, Claude E. Shannon published a now classic paper entitled “A Mathematical Theory of Communication.” By then the telegraph, telephone, and radio had spawned a whole new communications industry with the AT&T company at its locus. Shannon, working at AT&T Bell Laboratories, was concerned with how fast one could communicate meaning, or information in its colloquial sense, over wires or even the wireless. In defining a new theory with which to solve such practical problems, he also arrived at a precise mathematical definition of Information. Shannon’s Information measured the information (colloquial) content of a message in terms of the extent to which its being successfully transmitted reduced some degree of uncertainty on the part of the receiver. Thus, whether a telegraph operator transmitted “the price of AT&T stock just rose by five cents,” or “ATT + 5c,” the information content being transmitted was the same, at least to two equally intelligent receivers. Shannon quantified the amount of information in terms of the chance, or probability, of the event whose occurrences were being communicated. Thus, if it was quite normal for AT&T’s stock to rise by 5 cents, the information content was lower than for a rarer event, say the stock suddenly falling by 5 dollars. Similarly, the story “Man Bites Dog,” being a rather rare event, has a far greater information content than “Dog Bites Man.” The rarer the news, the more likely it is to catch our interest, and it therefore makes the headlines. Why? The paper wants you to buy a copy and read the story. In passing, you glance at the advertisements placed strategically close by, which is what an advertiser has paid good money for.
True, but what if some of us only read the sports pages?
Think of yourself at a party where you hear snippets of many conversations simultaneously, even as you focus on and participate in one particular interaction. Often you may pick up cues that divert your attention, nudging you to politely shift to another conversation circle. Interest is piqued by the promise both of an unlikely or original tale and one that is closely aligned with your own predilections, be they permanent or temporary. We all “listen for” the unexpected, and even more so for some subjects as compared to the rest. The same thing is going on when we read a newspaper, or, for that matter, search, surf, or scan stories on the web. We usually know, at least instinctively or subconsciously, what should surprise or interest us. But the newspaper does not. Its only measure of success is circulation, which is also what advertisers have to rely on to decide how much space to book with the particular paper. Apart from this the only additional thing an advertiser can do is discover, ex post facto, whether or not their money was well spent. Did Christmas sales actually go up or not? If the latter, well, the damage has already been done. Moreover, which paper should they pull their ads from for the next season? No clue. In Shannon’s language, the indirect message conveyed by a paper’s circulation, or for that matter ex post facto aggregate sales, contains precious little Information, in terms of doing little to reduce the uncertainty of which pages we are actually reading and thereby which advertisements should be catching our eye.
Of course, it is well known that Google and the Internet-based advertising industry it engendered have changed the rules of the game, as we shall describe in some detail very shortly. But it is interesting to view what they have actually achieved from the perspective of Shannon’s information theory, which was itself concerned more with the transmission of signals over wires and the ether. In our case we should look instead at other kinds of signals, such a paper’s circulation, or an advertiser’s end-of-season sales figures. Think of these as being at the receiving end, again speaking in terms more familiar to Shannon’s world. And then there is the actual signal that is transmitted by you and me, i.e., the stories we seek out and actually read. The transmission loss along this communication path, from actual reader behaviour to the “lag” measures of circulation or sales, is huge, both in information content as well as delay. If such a loss were suffered in a telegraph network, it would be like getting the message “AT&T goes out of business,” a year after the transmission of the original signal, which might have reported a sudden dip in share price. No stock trader would go anywhere near such a medium!
Shannon was concerned both with precisely measuring the information content of a signal and with how efficiently and effectively information could be transmitted along a channel, such as a telephone wire. He defined the information content of any particular value of a signal as the probability of its occurrence. Thus, if the signal in question was the toss of a fair coin, then the information content of the signal “heads” would be defined in terms of the probability of this value showing up, which is exactly 1/2. Provided of course that the coin was fair. A conman’s coin that had two heads would of course yield no information when it inevitably landed on its head, with probability 1. Recall our discussion of logarithmic-time algorithms in Chapter 1, such as binary search. As it turns out, Shannon information is defined, surprising as it may seem, in terms of the logarithm of the inverse probability. Thus the information content conveyed by the fair coin toss is log 2, which is exactly 1, and that for the conman’s coin is log 1, which, as expected, turns out to be 0. Similarly, the roll of a fair six-sided dice has an information content of log 6, which is about 2.58, and for the unusual case of an eight-sided dice, log 8 is exactly 3.
It turns out, as you might have suspected, that the logarithm crept into the formal definition of information for good reason. Recall once more how we searched for a word in a list using binary search in a logarithmic number of steps: by asking, at each step, which half of the list to look at; as if being guided through a maze, “go left,” then “go right.” Now, once we are done, how should we convey our newly discovered knowledge, i.e., the place where our word actually occurs in the list? We might remember the sequence of decisions we made along the way and record the steps we took to navigate to our word of interest; these are, of course, logarithmic in number. So, recording the steps needed to reach one specific position out of n total possibilities requires us to record at most log n “lefts” or “rights,” or equivalently, log n zeros and ones.
Say the discovered position was the eighth one, i.e., the last in our list of eight. To arrive at this position we would have had to make a “rightwards” choice each time we split the list; we could record this sequence of decisions as 111. Other sequences of decisions would similarly have their rendition in terms of exactly three symbols, each one or zero: for example, 010 indicates that starting from the “middle” of the list, say position 4, we look leftward once to the middle of the first half of the list, which ends up being position 2.
Shannon, and earlier Hartley, called these zero–one symbols “bits,” heralding the information age of “bits and bytes” (where a byte is just a sequence of eight bits). Three bits can be arranged in exactly eight distinct sequences, since 2 × 2 × 2 = 8, which is why log 8 is 3. Another way of saying this is that because these three bits are sufficient to represent the reduction in uncertainty about which of the eight words is being chosen, so the information content in the message conveying the word position is 3. Rather long-winded? Why not merely convey the symbol “8”? Would this not be easier? Or were bits more efficient?
It makes no difference. The amount of information is the same whether conveyed by three bits or by one symbol chosen from eight possibilities. This was first shown by Shannon’s senior at Bell Labs, Hartley, way back in 1928 well before Shannon’s arrival there. What Shannon did was take this definition of information and use it to define, in precise mathematical terms, the capacity of any channel for communicating information. For Shannon, channels were wired or wireless means of communication, using the technologies of telegraph, telephone, and later radio. Today, Shannon’s theory is used to model data communication on computer networks, including of course, the Internet. But as we have suggested, the notion of a channel can be quite general, and his information theory has since been applied in areas as diverse as physics to linguistics, and of course web technology.
If the information content of a precise message was the degree to which it reduced uncertainty upon arrival, it was important, in order to define channel capacity, to know what the uncertainty was before the signal’s value was known. As we have seen earlier, exactly one bit of information is received by either of the messages, “heads” or “tails,” signaling the outcome of a fair coin toss. We have also seen that no information is conveyed for a two-headed coin, since it can only show one result. But what about a peculiar coin that shows up heads a third of the time and tails otherwise? The information conveyed by each signal, “heads” or “tails,” is now different: each “head,” which turns up 1/3 of the time, conveys log 3 bits of information, while “tails” shows up with a probability 2/3 conveying log 3 bits. Shannon defined the term entropy to measure the average information conveyed over a large number of outcomes, which could be calculated precisely as the information conveyed by each outcome, weighted by the probability of that outcome. So the entropy of the fair coin signal is 1/2 × 1 + 1/2 × 1 = 1, since each possible outcome conveys one bit, and moreover each outcome occurs half the time, on the average. Similarly, a sequence of tosses of the two-headed coin has zero entropy. However, for the loaded coin, the entropy becomes 1 log 3 + 2 log 3 , which works out to just under 0.7; 332 a shade less than that of the fair coin.
Shannon was interested in a theory describing the transmission of information over any channel whatsoever. So he needed to figure out the relationship between the uncertainties in the two signals at each end of a communications channel, more precisely their entropies. He defined the idea of mutual information between the signal sent down a channel versus the one actually received. If the two signals corresponded closely to each other, with only occasional discrepancies, then the mutual information between them was high, otherwise it was lower. A simple way to understand mutual information is to imagine that you are the receiver of a signal, continuously getting messages over a communication channel such as a telegraph or radio. But you have no idea how closely the received messages match those that were sent. Now suppose you somehow got independent reports of what messages were actually sent, say by magic, or by a messenger on horseback who arrived days later. You could work out how often the channel misled you. The amount by which these reports would surprise you, such as how often there were transmission errors, would allow you to measure how good or bad the earlier transmissions were. As earlier with our coin tosses, the degree of surprise, on average, should be nothing but the entropy of these special reports, which Shannon called conditional entropy. If the conditional entropy was high, i.e., the reports often surprised you by pointing out errors in transmission, then the mutual information between the sent and received signals should be low. If the reports did not surprise you much, behaving almost like a loaded coin that always gave the same result as your observation of the received signal, then the conditional entropy was low and the mutual information high. Shannon defined the mutual information as the difference between the entropy of whatever was actually being transmitted and the conditional entropy.
For example, suppose that you are communicating the results of a fair coin toss over a communication channel that makes errors 1/3 of the time. The conditional entropy, measuring your surprise at these errors, is the same as for the loaded coin described earlier, i.e., close to 0.7. The entropy of the transmitted signal, being a fair coin, is 1; it, and the mutual information, is the difference, or 0.3, indicating that the channel transmission does somewhat decrease your uncertainty about the source signal. On the other hand, if as many as half the transmissions were erroneous, then the conditional entropy would equal that of the fair coin, i.e., exactly 1, making the mutual information zero. In this case the channel transmission fails to convey anything about the coin tosses at the source.
Next Shannon defined the capacity of any communication channel as the maximum mutual information it could possibly exhibit as long as an appropriate signal was transmitted. Moreover, he showed how to actually calculate the capacity of a communication channel, without necessarily having to show which kind of signal had to be used to achieve this maximum value. This was a giant leap of progress, for it provided engineers with the precise knowledge of how much information they could actually transmit over a particular communication technology, such as a telegraph wire over a certain distance or a radio signal of a particular strength, and with what accuracy. At the same time it left them with the remaining task of actually trying to achieve that capacity in practice, by, for example, carefully encoding the messages to be transmitted.
Now, let us return to the world of advertising and the more abstract idea of treating paper circulation or sales figures as a signal about our own behavior of seeking and reading. In terms of Shannon’s information theory, the mutual information between reader behavior and measures such as circulation or sales is quite low. Little can be achieved to link these since the channel itself, i.e., the connection between the act of buying a newspaper and aggregate circulation or product sales, is a very tenuous one.
The Penny Clicks
Enter online advertising on the Internet. Early Internet “banner” advertisements, which continue to this day, merely translated the experience of traditional print advertising onto a web page. The more people viewed a page, the more one had to pay for advertising space. Instead of circulation, measurements of the total number of “eyeballs” viewing a page could easily be derived from page hits and other network-traffic statistics. But the mutual information between eyeballs and outcomes remained as weak as for print media. How weak became evident from the dot.com bust of 2001. Internet companies had fueled the preceding bubble by grossly overestimating the value of the eyeballs they were attracting. No one stopped to question whether the new medium was anything more than just that, i.e., a new way of selling traditional advertising. True, a new avenue for publishing justified some kind of valuation, but how much was never questioned. With 20/20 hindsight it is easy to say that someone should have questioned the fundamentals better. But hindsight always appears crystal clear. At the same time, history never fails to repeat itself.
As of this writing, a new bubble is looming in the world of social networking. Just possibly, a deeper analysis, based perhaps on the concept of mutual information, might reveal some new insight. Is the current enthusiasm for the potential profitability of “new age” social networking sites justified? Only time will tell. In the meanwhile, recent events such as the relative lukewarm response to Facebook’s initial public offering in mid-2012 do give us reason to pause and ponder. Perhaps some deeper analyses using mutual information might come in handy. To see how, let us first look at what the Google and other search engines did to change the mutual information equation between consumers and advertisers, thereby changing the fundamentals of online advertising and, for that matter, the entire media industry.
An ideal scenario from the point of view of an advertiser would be to have to pay only when a consumer actually buys their product. In such a model the mutual information between advertising and outcome would be very high indeed. Making such a connection is next to impossible in the print world. However, in the world of web pages and clicks, in principle this can be done by charging the advertiser only when an online purchase is made. Thus, instead of being merely a medium for attracting customer attention, such a website would instead become a sales channel for merchants. In fact Groupon uses exactly such a model: Groupon sells discount coupons to intelligently selected prospects, while charging the merchants a commission if and only if its coupons are used for actual purchases.
In the case of a search engine, such as Yahoo! or Google, however, consumers may choose to browse a product but end up not buying it because the product is poor, for no fault of the search engine provided. So why should Google or Yahoo! waste their advertising space on such ads? Today online advertisers use a model called “pay-per-click,” or PPC, which is somewhere in between, where an advertiser pays only if a potential customer clicks their ad, regardless of whether that click gets converted to a sale. At the same time, the advertiser does not pay if a customer merely looks at the ad, without clicking it. The PPC model was first invented by Bill Gross, who started GoTo.com in 1998. But it was Google that made PPC really work by figuring out the best way to charge for ads in this model. In the PPC model, the mutual information between the potential buyer and the outcome is lower than for, say, a sales channel such as Groupon. More importantly, however, the mutual information is highly dependent on which ad the consumer sees. If the ad is close to the consumer’s intent at the time she views it, there is a higher likelihood that she will click, thereby generating revenue for the search engine and a possible sale for the advertiser.
