×

Usamos cookies para ayudar a mejorar LingQ. Al visitar este sitio, aceptas nuestras politicas de cookie.


image

TED Talks, What we learned from 5 million books

What we learned from 5 million books

Erez Lieberman Aiden: Everyone knows that a picture is worth a thousand words.But we at Harvard were wondering if this was really true. (Laughter)So we assembled a team of experts,spanning Harvard, MIT,The American Heritage Dictionary, The Encyclopedia Britannica and even our proud sponsors, the Google.And we cogitated about this for about four years. And we came to a startling conclusion.Ladies and gentlemen, a picture is not worth a thousand words.In fact, we found some pictures that are worth 500 billion words.

Jean-Baptiste Michel: So how did we get to this conclusion?So Erez and I were thinking about ways to get a big picture of human culture and human history: change over time. So many books actually have been written over the years.So we were thinking, well the best way to learn from them is to read all of these millions of books.Now of course, if there's a scale for how awesome that is, that has to rank extremely, extremely high.Now the problem is there's an X-axis for that,which is the practical axis. This is very, very low.

(Applause)

Now people tend to use an alternative approach,which is to take a few sources and read them very carefully. This is extremely practical, but not so awesome.What you really want to do is to get to the awesome yet practical part of this space.So it turns out there was a company across the river called Google who had started a digitization project a few years back that might just enable this approach.They have digitized millions of books.So what that means is, one could use computational methods to read all of the books in a click of a button.That's very practical and extremely awesome. ELA: Let me tell you a little bit about where books come from.Since time immemorial, there have been authors.These authors have been striving to write books.And this became considerably easier with the development of the printing press some centuries ago. Since then, the authors have won on 129 million distinct occasions,publishing books.Now if those books are not lost to history, then they are somewhere in a library,and many of those books have been getting retrieved from the libraries and digitized by Google,which has scanned 15 million books to date.

Now when Google digitizes a book, they put it into a really nice format.Now we've got the data, plus we have metadata. We have information about things like where was it published,who was the author, when was it published. And what we do is go through all of those records and exclude everything that's not the highest quality data.What we're left with is a collection of five million books,500 billion words,a string of characters a thousand times longer than the human genome --a text which, when written out, would stretch from here to the Moon and back10 times over --a veritable shard of our cultural genome. Of course what we did when faced with such outrageous hyperbole ...(Laughter)was what any self-respecting researchers would have done.We took a page out of XKCD,and we said, "Stand back.We're going to try science." (Laughter)

JM: Now of course, we were thinking,well let's just first put the data out therefor people to do science to it.Now we're thinking, what data can we release? Well of course, you want to take the books and release the full text of these five million books.Now Google, and Jon Or want in particular,told us a little equation that we should learn.So you have five million, that is, five million authors and five million plaintiffs is a massive lawsuit.So, although that would be really, really awesome,again, that's extremely, extremely impractical. (Laughter)

Now again, we kind of caved in,and we did the very practical approach, which was a bit less awesome. We said, well instead of releasing the full text,we're going to release statistics about the books.So take for instance "A gleam of happiness. "It's four words; we call that a four-gram.We're going to tell you how many times a particular four-gram appeared in books in 1801, 1802, 1803, all the way up to 2008.That gives us a time series of how frequently this particular sentence was used over time. We do that for all the words and phrases that appear in those books,and that gives us a big table of two billion lines that tell us about the way culture has been changing.

ELA: So those two billion lines, we call them two billion n-grams. What do they tell us? Well the individual n-grams measure cultural trends.Let me give you an example.Let's suppose that I am thriving,then tomorrow I want to tell you about how well I did. And so I might say, "Yesterday, I throve. "Alternatively, I could say, "Yesterday, I thrived. "Well which one should I use? How to know?

As of about six months ago,the state of the art in this field is that you would, for instance, go up to the following psychologist with fabulous hair, and you'd say, "Steve, you're an expert on the irregular verbs. What should I do?" And he'd tell you, "Well most people say thrived, but some people say throve." And you also knew, more or less, that if you were to go back in time 200 years and ask the following statesman with equally fabulous hair, (Laughter) "Tom, what should I say?" He'd say, "Well, in my day, most people throve, but some thrived. "So now what I'm just going to show you is raw data.Two rows from this table of two billion entries.What you're seeing is year by year frequency of "thrived" and "throve" over time. Now this is just two out of two billion rows.So the entire data set is a billion times more awesome than this slide.

(Laughter)

(Applause)

JM: Now there are many other pictures that are worth 500 billion words. For instance, this one. If you just take influenza,you will see peaks at the time where you knew big flu epidemics were killing people around the globe.

ELA: If you were not yet convinced,sea levels are rising,so is atmospheric CO2 and global temperature.

JM: You might also want to have a look at this particular n-gram,and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist. (Laughter)

ELA: You can get at some pretty abstract concepts with this sort of thing.For instance, let me tell you the history of the year 1950.Pretty much for the vast majority of history,no one gave a damn about 1950.In 1700, in 1800, in 1900,no one cared.Through the 30s and 40s,no one cared.Suddenly, in the mid-40s,there started to be a buzz.People realized that 1950 was going to happen,and it could be big. (Laughter) But nothing got people interested in 1950like the year 1950. (Laughter) People were walking around obsessed.They couldn't stop talking about all the things they did in 1950,all the things they were planning to do in 1950,all the dreams of what they wanted to accomplish in 1950.In fact, 1950 was so fascinating that for years thereafter,people just kept talking about all the amazing things that happened,in '51, '52, '53.Finally in 1954,someone woke up and realized that 1950 had gotten somewhat passé. (Laughter) And just like that, the bubble burst.

(Laughter)

And the story of 1950is the story of every year that we have on record, with a little twist, because now we've got these nice charts. And because we have these nice charts, we can measure things.We can say, "Well how fast does the bubble burst?" And it turns out that we can measure that very precisely.Equations were derived, graphs were produced,and the net result is that we find that the bubble bursts faster and faster with each passing year.We are losing interest in the past more rapidly.

JM: Now a little piece of career advice. So for those of you who seek to be famous,we can learn from the 25 most famous political figures,authors, actors and so on.So if you want to become famous early on, you should be an actor,because then fame starts rising by the end of your 20s --you're still young, it's really great. Now if you can wait a little bit, you should be an author, because then you rise to very great heights,like Mark Twain, for instance: extremely famous.But if you want to reach the very top, you should delay gratification and, of course, become a politician.So here you will become famous by the end of your 50s,and become very, very famous afterward.So scientists also tend to get famous when they're much older.Like for instance, biologists and physics tend to be almost as famous as actors.One mistake you should not do is become a mathematician. (Laughter)If you do that,you might think, "Oh great. I'm going to do my best work when I'm in my 20s. "But guess what, nobody will really care. (Laughter)

ELA: There are more sobering notes among the n-grams.For instance, here's the trajectory of Marc Chagall,an artist born in 1887. And this looks like the normal trajectory of a famous person. He gets more and more and more famous,except if you look in German.If you look in German, you see something completely bizarre,something you pretty much never see,which is he becomes extremely famous and then all of a sudden plummets,going through a nadir between 1933 and 1945, before rebounding afterward.And of course, what we're seeing is the fact Marc Chagall was a Jewish artist in Nazi Germany. Now these signals are actually so strong that we don't need to know that someone was censored. We can actually figure it out using really basic signal processing. Here's a simple way to do it.Well, a reasonable expectation is that somebody's fame in a given period of time should be roughly the average of their fame before and their fame after. So that's sort of what we expect.And we compare that to the fame that we observe.And we just divide one by the other to produce something we call a suppression index. If the suppression index is very, very, very small, then you very well might be being suppressed. If it's very large, maybe you're benefiting from propaganda. JM: Now you can actually look at the distribution of suppression indexes over whole populations.So for instance, here --this suppression index is for 5,000 people picked in English books where there's no known suppression --it would be like this, basically tightly centered on one. What you expect is basically what you observe. This is distribution as seen in Germany --very different, it's shifted to the left. People talked about it twice less as it should have been.But much more importantly, the distribution is much wider.There are many people who end up on the far left on this distribution who are talked about 10 times fewer than they should have been.But then also many people on the far right who seem to benefit from propaganda.This picture is the hallmark of censorship in the book record.

ELA: So culturomics is what we call this method.It's kind of like genomics.Except genomics is a lens on biology through the window of the sequence of bases in the human genome.Culturomics is similar.It's the application of massive-scale data collection analysis to the study of human culture.Here, instead of through the lens of a genome,through the lens of digitized pieces of the historical record.The great thing about culturomics is that everyone can do it.Why can everyone do it?Everyone can do it because three guys,Jon Or want, Matt Gray and Will Brockman over at Google,saw the prototype of the Ngram Viewer,and they said, "This is so fun.We have to make this available for people. "So in two weeks flat -- the two weeks before our paper came out --they coded up a version of the Ngram Viewer for the general public.And so you too can type in any word or phrase that you're interested in and see its n-gram immediately --also browse examples of all the various books in which your n-gram appears. JM: Now this was used over a million times on the first day,and this is really the best of all the queries. So people want to be their best, put their best foot forward.But it turns out in the 18th century, people didn't really care about that at all.They didn't want to be their best, they wanted to be their beft. So what happened is, of course, this is just a mistake.It's not that strove for mediocrity,it's just that the S used to be written differently, kind of like an F.Now of course, Google didn't pick this up at the time,so we reported this in the science article that we wrote.But it turns out this is just a reminder that, although this is a lot of fun,when you interpret these graphs, you have to be very careful, and you have to adopt the base standards in the sciences. ELA: People have been using this for all kinds of fun purposes. (Laughter) Actually, we're not going to have to talk, we're just going to show you all the slides and remain silent. This person was interested in the history of frustration.There's various types of frustration.If you stub your toe, that's a one A "argh. "If the planet Earth is annihilated by the Vogons to make room for an interstellar bypass,that's an eight A "aaaaaaaargh. "This person studies all the "arghs,"from one through eight A's.And it turns out that the less-frequent "arghs"are, of course, the ones that correspond to things that are more frustrating --except, oddly, in the early 80s.We think that might have something to do with Reagan. (Laughter)

JM: There are many usages of this data,but the bottom line is that the historical record is being digitized.Google has started to digitize 15 million books.That's 12 percent of all the books that have ever been published.It's a sizable chunk of human culture.There's much more in culture: there's manuscripts, there newspapers, there's things that are not text, like art and paintings.These all happen to be on our computers,on computers across the world.And when that happens, that will transform the way we have to understand our past, our present and human culture. Thank you very much.

(Applause)

What we learned from 5 million books Was wir aus 5 Millionen Büchern gelernt haben Ce que nous ont appris 5 millions de livres 500万冊の本から学んだこと O que aprendemos com 5 milhões de livros Что мы узнали из 5 миллионов книг

Erez Lieberman Aiden: Everyone knows that a picture is worth a thousand words.But we at Harvard were wondering if this was really true. エレズ リーバーマン エイデン:百聞は一見にしかずというのは誰もが知っている。 Erez Lieberman Aiden: Todo mundo sabe que uma imagem vale mais que mil palavras. Mas nós, em Harvard, estávamos imaginando se isso era realmente verdade. (Laughter)So we assembled a team of experts,spanning Harvard, MIT,The American Heritage Dictionary, The Encyclopedia Britannica and even our proud sponsors, the Google.And we cogitated about this for about four years. (ハーバード大学、マサチューセッツ工科大学、アメリカン・ヘリテージ・ディクショナリー、エンサイクロペディア・ブリタニカ、そして私たちの誇りであるスポンサー、グーグルまで。 And we came to a startling conclusion.Ladies and gentlemen, a picture is not worth a thousand words.In fact, we found some pictures that are worth 500 billion words. そして、我々は驚くべき結論に達した。皆さん、写真は千の言葉には値しない。

Jean-Baptiste Michel: So how did we get to this conclusion?So Erez and I were thinking about ways to get a big picture of human culture and human history: change over time. ジャン=バティスト・ミシェル:では、なぜこのような結論に至ったのでしょうか?エレズと私は、人類の文化や歴史の全体像を把握する方法について考えていました。 So many books actually have been written over the years.So we were thinking, well the best way to learn from them is to read all of these millions of books.Now of course, if there's a scale for how awesome that is, that has to rank extremely, extremely high.Now the problem is there's an X-axis for that,which is the practical axis. لقد تم بالفعل كتابة العديد من الكتب على مر السنين ، لذلك كنا نفكر ، حسنًا ، أفضل طريقة للتعلم منها هي قراءة كل هذه الملايين من الكتب ، الآن بالطبع ، إذا كان هناك مقياس لمدى روعة ذلك ، فقد لترتيب مرتفع للغاية ، المشكلة الآن هي أن هناك محور س لذلك ، وهو المحور العملي. そこで私たちは、彼らから学ぶ最善の方法は、この何百万冊もの本をすべて読むことだと考えた。もちろん、それがどれだけ素晴らしいことかを示す尺度があるとすれば、それは極めて、極めて上位にランクされなければならない。 This is very, very low. これは非常に低い。

(Applause)

Now people tend to use an alternative approach,which is to take a few sources and read them very carefully. Agora, as pessoas tendem a usar uma abordagem alternativa, que consiste em pegar algumas fontes e lê-las com muito cuidado. This is extremely practical, but not so awesome.What you really want to do is to get to the awesome yet practical part of this space.So it turns out there was a company across the river called Google who had started a digitization project a few years back that might just enable this approach.They have digitized millions of books.So what that means is, one could use computational methods to read all of the books in a click of a button.That's very practical and extremely awesome. ELA: Let me tell you a little bit about where books come from.Since time immemorial, there have been authors.These authors have been striving to write books.And this became considerably easier with the development of the printing press some centuries ago. ELA: Deixe-me contar um pouco sobre a origem dos livros. Desde tempos imemoriais, existem autores. Esses autores têm se esforçado para escrever livros. E isso se tornou consideravelmente mais fácil com o desenvolvimento da imprensa alguns séculos atrás. Since then, the authors have won on 129 million distinct occasions,publishing books.Now if those books are not lost to history, then they are somewhere in a library,and many of those books have been getting retrieved from the libraries and digitized by Google,which has scanned 15 million books to date.

Now when Google digitizes a book, they put it into a really nice format.Now we've got the data, plus we have metadata. We have information about things like where was it published,who was the author, when was it published. And what we do is go through all of those records and exclude everything that's not the highest quality data.What we're left with is a collection of five million books,500 billion words,a string of characters a thousand times longer than the human genome --a text which, when written out, would stretch from here to the Moon and back10 times over --a veritable shard of our cultural genome. Of course what we did when faced with such outrageous hyperbole ...(Laughter)was what any self-respecting researchers would have done.We took a page out of XKCD,and we said, "Stand back.We're going to try science." (Laughter)

JM: Now of course, we were thinking,well let's just first put the data out therefor people to do science to it.Now we're thinking, what data can we release? JM: Agora, é claro, estávamos pensando: bem, vamos primeiro divulgar os dados para que as pessoas façam ciência. Agora estamos pensando, que dados podemos liberar? Well of course, you want to take the books and release the full text of these five million books.Now Google, and Jon Or want in particular,told us a little equation that we should learn.So you have five million, that is, five million authors and five million plaintiffs is a massive lawsuit.So, although that would be really, really awesome,again, that's extremely, extremely impractical. Bem, é claro, você quer pegar os livros e liberar o texto completo desses cinco milhões de livros.Agora, Google e Jon Or querem em particular, nos disseram uma pequena equação que devemos aprender: você tem cinco milhões, ou seja, cinco milhões de autores e cinco milhões de demandantes é um processo maciço. (Laughter)

Now again, we kind of caved in,and we did the very practical approach, which was a bit less awesome. Agora, novamente, nós meio que nos envolvemos e fizemos uma abordagem muito prática, que foi um pouco menos impressionante. We said, well instead of releasing the full text,we're going to release statistics about the books.So take for instance "A gleam of happiness. Dissemos que, bem, em vez de divulgar o texto completo, divulgaremos estatísticas sobre os livros. Então, por exemplo, "Um vislumbre de felicidade. "It's four words; we call that a four-gram.We're going to tell you how many times a particular four-gram appeared in books in 1801, 1802, 1803, all the way up to 2008.That gives us a time series of how frequently this particular sentence was used over time. "São quatro palavras; chamamos isso de quatro gramas. Vamos dizer quantas vezes um determinado grama de quatro gramas apareceu nos livros de 1801, 1802, 1803, até 2008. Isso nos dá um tempo série de com que freqüência essa sentença específica foi usada ao longo do tempo. We do that for all the words and phrases that appear in those books,and that gives us a big table of two billion lines that tell us about the way culture has been changing.

ELA: So those two billion lines, we call them two billion n-grams. What do they tell us? Well the individual n-grams measure cultural trends.Let me give you an example.Let's suppose that I am thriving,then tomorrow I want to tell you about how well I did. And so I might say, "Yesterday, I throve. "Alternatively, I could say, "Yesterday, I thrived. "Como alternativa, eu poderia dizer:" Ontem, eu prosperei. "Well which one should I use? How to know?

As of about six months ago,the state of the art in this field is that you would, for instance, go up to the following psychologist with fabulous hair, and you'd say, "Steve, you're an expert on the irregular verbs. Há cerca de seis meses, o estado da arte nesse campo é que, por exemplo, você procuraria o psicólogo a seguir com cabelos fabulosos e diria: "Steve, você é um especialista em irregularidades. verbos. What should I do?" And he'd tell you, "Well most people say thrived, but some people say throve." And you also knew, more or less, that if you were to go back in time 200 years and ask the following statesman with equally fabulous hair, (Laughter) "Tom, what should I say?" E você também sabia, mais ou menos, que se você voltasse no tempo 200 anos e perguntasse ao seguinte estadista com cabelos igualmente fabulosos, (Risos) "Tom, o que devo dizer?" He'd say, "Well, in my day, most people throve, but some thrived. "So now what I'm just going to show you is raw data.Two rows from this table of two billion entries.What you're seeing is year by year frequency of "thrived" and "throve" over time. "Então agora o que eu vou mostrar para você são dados brutos. Duas linhas dessa tabela de dois bilhões de entradas. O que você está vendo é a frequência ano a ano de" prosperou "e" prosperou "ao longo do tempo. Now this is just two out of two billion rows.So the entire data set is a billion times more awesome than this slide.

(Laughter)

(Applause)

JM: Now there are many other pictures that are worth 500 billion words. For instance, this one. If you just take influenza,you will see peaks at the time where you knew big flu epidemics were killing people around the globe. Se você apenas tomar a gripe, verá picos no momento em que sabia que grandes epidemias de gripe estavam matando pessoas em todo o mundo.

ELA: If you were not yet convinced,sea levels are rising,so is atmospheric CO2 and global temperature.

JM: You might also want to have a look at this particular n-gram,and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist. JM: Você também pode dar uma olhada nesse grama em particular, e isso é para dizer a Nietzsche que Deus não está morto, embora você possa concordar que ele pode precisar de um publicitário melhor. (Laughter)

ELA: You can get at some pretty abstract concepts with this sort of thing.For instance, let me tell you the history of the year 1950.Pretty much for the vast majority of history,no one gave a damn about 1950.In 1700, in 1800, in 1900,no one cared.Through the 30s and 40s,no one cared.Suddenly, in the mid-40s,there started to be a buzz.People realized that 1950 was going to happen,and it could be big. ELA: Você pode obter alguns conceitos bastante abstratos com esse tipo de coisa. Por exemplo, deixe-me contar a história do ano de 1950. Muito para a grande maioria da história, ninguém deu a mínima para 1950. Em 1700, Em 1800, em 1900, ninguém se importava. Nos anos 30 e 40, ninguém se importava. De repente, em meados dos anos 40, começou a haver um burburinho. (Laughter) But nothing got people interested in 1950like the year 1950. (Laughter) People were walking around obsessed.They couldn't stop talking about all the things they did in 1950,all the things they were planning to do in 1950,all the dreams of what they wanted to accomplish in 1950.In fact, 1950 was so fascinating that for years thereafter,people just kept talking about all the amazing things that happened,in '51, '52, '53.Finally in 1954,someone woke up and realized that 1950 had gotten somewhat passé. (Risos) As pessoas estavam andando obcecadas. Elas não paravam de falar sobre todas as coisas que fizeram em 1950, todas as coisas que planejavam fazer em 1950, todos os sonhos do que queriam realizar em 1950. 1950 foi tão fascinante que, durante anos depois, as pessoas continuaram falando sobre todas as coisas incríveis que aconteceram, em '51, '52, '53. (Laughter) And just like that, the bubble burst. (Risos) E assim, a bolha estourou.

(Laughter)

And the story of 1950is the story of every year that we have on record, with a little twist, because now we've got these nice charts. E a história de 1950 é a história de todos os anos que registramos, com um pouco de reviravolta, porque agora temos esses bons gráficos. And because we have these nice charts, we can measure things.We can say, "Well how fast does the bubble burst?" E como temos esses gráficos agradáveis, podemos medir as coisas. Podemos dizer: "Com que rapidez a bolha estourou?" And it turns out that we can measure that very precisely.Equations were derived, graphs were produced,and the net result is that we find that the bubble bursts faster and faster with each passing year.We are losing interest in the past more rapidly. E acontece que podemos medir isso com muita precisão. As derivações foram derivadas, os gráficos foram produzidos e o resultado líquido é que descobrimos que a bolha explode cada vez mais rápido a cada ano que passa. Estamos perdendo o interesse no passado mais rapidamente.

JM: Now a little piece of career advice. So for those of you who seek to be famous,we can learn from the 25 most famous political figures,authors, actors and so on.So if you want to become famous early on, you should be an actor,because then fame starts rising by the end of your 20s --you're still young, it's really great. Portanto, para aqueles que buscam ser famosos, podemos aprender com as 25 figuras políticas, autores, atores e outras mais famosas. até o final dos seus 20 anos - você ainda é jovem, é realmente ótimo. Now if you can wait a little bit, you should be an author, because then you rise to very great heights,like Mark Twain, for instance: extremely famous.But if you want to reach the very top, you should delay gratification and, of course, become a politician.So here you will become famous by the end of your 50s,and become very, very famous afterward.So scientists also tend to get famous when they're much older.Like for instance, biologists and physics tend to be almost as famous as actors.One mistake you should not do is become a mathematician. Agora, se você pode esperar um pouco, você deve ser um autor, porque então você alcança grandes alturas, como Mark Twain, por exemplo: extremamente famoso. Mas, se você quiser chegar ao topo, deve adiar a gratificação e, é claro, torne-se um político.Então, aqui você se tornará famoso no final dos seus 50 anos e se tornará muito, muito famoso depois.Então, os cientistas também tendem a se tornar famosos quando são muito mais velhos.Como, por exemplo, biólogos e física tendem ser quase tão famoso quanto os atores. Um erro que você não deve fazer é se tornar um matemático. (Laughter)If you do that,you might think, "Oh great. (Risos) Se você fizer isso, poderá pensar: "Oh, ótimo. I'm going to do my best work when I'm in my 20s. Vou fazer o meu melhor trabalho quando tiver 20 anos. "But guess what, nobody will really care. "Mas adivinhe, ninguém vai se importar. (Laughter)

ELA: There are more sobering notes among the n-grams.For instance, here's the trajectory of Marc Chagall,an artist born in 1887. ELA: Há mais notas preocupantes entre os n-gram. Por exemplo, aqui está a trajetória de Marc Chagall, um artista nascido em 1887. And this looks like the normal trajectory of a famous person. He gets more and more and more famous,except if you look in German.If you look in German, you see something completely bizarre,something you pretty much never see,which is he becomes extremely famous and then all of a sudden plummets,going through a nadir between 1933 and 1945, before rebounding afterward.And of course, what we're seeing is the fact Marc Chagall was a Jewish artist in Nazi Germany. Ele fica cada vez mais famoso, exceto se você olha em alemão. Se você olha em alemão, vê algo completamente bizarro, algo que quase nunca se vê, que é ele se torna extremamente famoso e, de repente, despenca. através de um nadir entre 1933 e 1945, antes de se recuperar depois. E, claro, o que estamos vendo é o fato de Marc Chagall ser um artista judeu na Alemanha nazista. Now these signals are actually so strong that we don't need to know that someone was censored. Agora, esses sinais são realmente tão fortes que não precisamos saber que alguém foi censurado. We can actually figure it out using really basic signal processing. Podemos realmente descobrir isso usando processamento de sinal realmente básico. Here's a simple way to do it.Well, a reasonable expectation is that somebody's fame in a given period of time should be roughly the average of their fame before and their fame after. Aqui está uma maneira simples de fazê-lo: Bem, uma expectativa razoável é que a fama de alguém em um determinado período de tempo seja aproximadamente a média de sua fama antes e depois. So that's sort of what we expect.And we compare that to the fame that we observe.And we just divide one by the other to produce something we call a suppression index. If the suppression index is very, very, very small, then you very well might be being suppressed. Se o índice de supressão for muito, muito, muito pequeno, você poderá estar sendo suprimido. If it's very large, maybe you're benefiting from propaganda. JM: Now you can actually look at the distribution of suppression indexes over whole populations.So for instance, here --this suppression index is for 5,000 people picked in English books where there's no known suppression --it would be like this, basically tightly centered on one. JM: Agora você pode realmente olhar para a distribuição de índices de supressão em populações inteiras. centrado em um. What you expect is basically what you observe. This is distribution as seen in Germany --very different, it's shifted to the left. People talked about it twice less as it should have been.But much more importantly, the distribution is much wider.There are many people who end up on the far left on this distribution who are talked about 10 times fewer than they should have been.But then also many people on the far right who seem to benefit from propaganda.This picture is the hallmark of censorship in the book record. As pessoas falaram sobre o assunto duas vezes menos do que deveria ter sido, mas, muito mais importante, a distribuição é muito mais ampla. Mas também muitas pessoas da extrema direita que parecem se beneficiar da propaganda. Essa imagem é a marca da censura no registro do livro.

ELA: So culturomics is what we call this method.It's kind of like genomics.Except genomics is a lens on biology through the window of the sequence of bases in the human genome.Culturomics is similar.It's the application of massive-scale data collection analysis to the study of human culture.Here, instead of through the lens of a genome,through the lens of digitized pieces of the historical record.The great thing about culturomics is that everyone can do it.Why can everyone do it?Everyone can do it because three guys,Jon Or want, Matt Gray and Will Brockman over at Google,saw the prototype of the Ngram Viewer,and they said, "This is so fun.We have to make this available for people. "So in two weeks flat -- the two weeks before our paper came out --they coded up a version of the Ngram Viewer for the general public.And so you too can type in any word or phrase that you're interested in and see its n-gram immediately --also browse examples of all the various books in which your n-gram appears. JM: Now this was used over a million times on the first day,and this is really the best of all the queries. JM: Agora, isso foi usado mais de um milhão de vezes no primeiro dia, e essa é realmente a melhor de todas as consultas. So people want to be their best, put their best foot forward.But it turns out in the 18th century, people didn't really care about that at all.They didn't want to be their best, they wanted to be their beft. Então, as pessoas querem dar o melhor de si, dar o melhor de si, mas no século 18 as pessoas não se importaram com isso de verdade. . So what happened is, of course, this is just a mistake.It's not that strove for mediocrity,it's just that the S used to be written differently, kind of like an F.Now of course, Google didn't pick this up at the time,so we reported this in the science article that we wrote.But it turns out this is just a reminder that, although this is a lot of fun,when you interpret these graphs, you have to be very careful, and you have to adopt the base standards in the sciences. ELA: People have been using this for all kinds of fun purposes. (Laughter) Actually, we're not going to have to talk, we're just going to show you all the slides and remain silent. This person was interested in the history of frustration.There's various types of frustration.If you stub your toe, that's a one A "argh. Essa pessoa estava interessada na história da frustração. Existem vários tipos de frustração. "If the planet Earth is annihilated by the Vogons to make room for an interstellar bypass,that's an eight A "aaaaaaaargh. "This person studies all the "arghs,"from one through eight A's.And it turns out that the less-frequent "arghs"are, of course, the ones that correspond to things that are more frustrating --except, oddly, in the early 80s.We think that might have something to do with Reagan. "Essa pessoa estuda todos os" arghs ", de um a oito A's. E acontece que os" arghs "menos frequentes são, é claro, aqueles que correspondem a coisas que são mais frustrantes - exceto, estranhamente, em no início dos anos 80. Achamos que isso pode ter algo a ver com Reagan. (Laughter)

JM: There are many usages of this data,but the bottom line is that the historical record is being digitized.Google has started to digitize 15 million books.That's 12 percent of all the books that have ever been published.It's a sizable chunk of human culture.There's much more in culture: there's manuscripts, there newspapers, there's things that are not text, like art and paintings.These all happen to be on our computers,on computers across the world.And when that happens, that will transform the way we have to understand our past, our present and human culture. JM: Existem muitos usos desses dados, mas o ponto principal é que o registro histórico está sendo digitalizado. O Google começou a digitalizar 15 milhões de livros. Isso representa 12% de todos os livros que já foram publicados. cultura humana.Há muito mais na cultura: há manuscritos, jornais, coisas que não são textos, como arte e pinturas.Tudo isso acontece em nossos computadores, em computadores de todo o mundo.E quando isso acontecer, isso transformará a maneira como temos que entender nosso passado, nossa cultura presente e humana. Thank you very much.

(Applause)