用大資料如何拯救世界?大綱

推薦人：來源: 閱讀: 2.92W 次大中小

Our ability to collect data far outpaces our ability to fully utilize it—yet those data may hold the key to solving some of the biggest global challenges facing us today.

我們蒐集資訊的能力遠遠強於分析使用的能力，然而，這些訊息可能包含了我們現如今正在面臨的全球性挑戰的解決辦法。

Take, for instance, the frequent outbreaks of waterborne illnesses as a consequence of war or natural disasters. The most recent example can be found in Yemen, where roughly 10,000 new suspected cases of cholera are reported each week—and history is riddled with similar stories. What if we could better understand the environmental factors that contributed to the disease, predict which communities are at higher risk, and put in place protective measures to stem the spread?

比如，戰後或自然災難引起的水源性傳播疾病頻繁爆發。最近的例子發生在葉門，每個星期葉門新發現約一萬例疑似霍亂病例。而且歷史總是相似的。如果我們能更好地理解環境因素對該病的影響，提前預測高風險社群，以保護性方法來阻止源頭傳播，將會怎麼樣呢？

Answers to these questions and others like them could potentially help us avert catastrophe.

這些問題和其他相似問題的答案可能會潛在地幫助我們阻止災難。

We already collect data related to virtually everything, from birth and death rates to crop yields and traffic flows. IBM estimates that each day, 2.5 quintillion bytes of data are generated. To put that in perspective: that's the equivalent of all the data in the Library of Congress being produced more than 166,000 times per 24-hour period. Yet we don't really harness the power of all this information. It's time that changed—and thanks to recent advances in data analytics and computational services, we finally have the tools to do it.

我們幾乎為每樣東西收集資料，從出生率死亡率到糧食變數和交通狀況。IBM公司估計每天有2.5個五萬億位元組的資料產生。從這個角度來看：這等同於美國國會圖書館每24小時產生的資料的16.6萬倍。但我們並不能掌控所有的資訊。但由於近來先進的資料分析和計算機服務，我們終於有了改變它的工具。

As a data scientist for Los Alamos National Laboratory, I study data from wide-ranging, public sources to identify patterns in hopes of being able to predict trends that could be a threat to global security. Multiple data streams are critical because the ground-truth data (such as surveys) that we collect is often delayed, biased, sparse, incorrect or, sometimes, nonexistent.

作為洛斯阿拉莫斯國家實驗室的資料科學家，我研究來自廣泛公共來源的資料，以確定模式，希望能夠預測可能對全球安全構成威脅的趨勢。多個數據流是至關重要的，因為我們收集的基本事實資料(比如調查)常常是延遲的、有偏見的、稀疏的、不正確的，有時甚至是不存在的。

For example, knowing mosquito incidence in communities would help us predict the risk of mosquito-transmitted disease such as dengue, the leading cause of illness and death in the tropics. However, mosquito data at a global (and even national) scale are not available.

舉個例子，瞭解蚊子在一個社群的叮咬發生率將會幫助我們預測蚊子的傳染登革熱病的風險，登革熱是導致熱帶地區疾病和死亡的首要原因。然而，目前還沒有全球(甚至全國)規模的蚊蟲資料

To address this gap, we're using other sources such as satellite imagery, climate data and demographic information to estimate dengue risk. Specifically, we had success predicting the spread of dengue in Brazil at the regional, state and municipality level using these data streams as well as clinical surveillance data and Google search queries that used terms related to the disease. While our predictions aren't perfect, they show promise. Our goal is to combine information from each data stream to further refine our models and improve their predictive power.

為了彌補這一差距，我們正在利用衛星影象、氣候資料和人口資訊等其他來源來估計登革熱風險。具體來說，我們成功地利用這些資料流、臨床監測資料和使用與疾病有關的術語的谷歌搜尋查詢，預測了登革熱在巴西的地區、州和市一級的蔓延。雖然我們的預測並不完美，但它們顯示出了希望。我們的目標是將來自每個資料流的資訊結合起來，以進一步完善我們的模型並提高它們的預測能力。

Similarly, to forecast the flu season, we have found that Wikipedia and Google searches can complement clinical data. Because the rate of people searching the internet for flu symptoms often increases during their onset, we can predict a spike in cases where clinical data lags.

同樣，為了預測流感季節，我們發現維基百科和谷歌搜尋可以補充臨床資料。由於人們在網際網路上搜索流感症狀的比率在發病期間經常增加，我們可以預測到臨床資料滯後的病例會出現激增。

We're using these same concepts to expand our research beyond disease prediction to better understand public sentiment. In partnership with the University of California, we're conducting a three-year study using disparate data streams to understand whether opinions expressed on social media map to opinions expressed in surveys.

我們用同樣的概念來擴充套件我們的研究以更好地理解大眾的想法。我們正在進行一項與加州大學合作的為期三年的研究，該研究運用不同的資料流來了解社交媒體上所表達的觀點是否與調查中所表述的一致。

For example, in Colombia, we are conducting a study to see whether social media posts about the peace process between the government and FARC, the socialist guerilla movement, can be ground-truthed with survey data. A University of California, Berkeley researcher is conducting on-the-ground surveys throughout Colombia—including in isolated rural areas—to poll citizens about the peace process. Meanwhile, at Los Alamos, we're analyzing social media data and news sources from the same areas to determine if they align with the survey data.

例如，在哥倫比亞，我們正在進行一項研究，看看關於政府和社會主義游擊隊運動之間和平程序的社交媒體帖子是否可以用調查資料來證實。加州大學伯克利分校的一名研究員正在哥倫比亞各地(包括偏遠的農村地區)進行實地調查，調查公民對和平程序的看法。與此同時，在洛斯阿拉莫斯，我們正在分析來自同一地區的社交媒體資料和新聞來源，以確定它們是否與調查資料一致。

If we can demonstrate that social media accurately captures a population's sentiment, it could be a more affordable, accessible and timely alternative to what are otherwise expensive and logistically challenging surveys. In the case of disease forecasting, if social media posts did indeed serve as a predictive tool for outbreaks, those data could be used in educational campaigns to inform citizens of the risk of an outbreak (due to vaccine exemptions, for example) and ultimately reduce that risk by promoting protective behaviors (such as washing hands, wearing masks, remaining indoors, etc. ).

如果我們能證明社交媒體能準確捕捉公眾情緒，相較於昂貴、交通十分不便的調查而言，它就可以成為一種更實惠、可獲取和及時的替代方法。如預測疾病時，如果社交媒體資料確實是有效預測疾病爆發的工具，這些資料就可以用來教育公眾，告訴他們有疾病爆發的風險（例如疫苗豁免），並最終通過促進保護性措施來減小危害（如吸收、戴口罩、待在室內等）。

All of this illustrates the potential for big data to solve big problems. Los Alamos and other national laboratories that are home to some of the world's largest supercomputers have the computational power augmented by machine learning and data analysis to take this information and shape it into a story that tells us not only about one state or even nation, but the world as a whole. The information is there; now it's time to use it.

所有這些都表明用大資料解決大問題的潛力。洛斯阿拉莫斯和其他國家實驗室擁有世界最大的超級電腦，且因為機器學習和資料分析，其運算能力更加強大，因此可以運用資訊，傳遞訊息，不僅僅惠及一個州，一個國家，而且是整個世界。資訊就在那裡，是時候使用它了。