用大數據如何拯救世界?大綱

推薦人：來源: 閱讀: 2.92W 次大中小

Our ability to collect data far outpaces our ability to fully utilize it—yet those data may hold the key to solving some of the biggest global challenges facing us today.

我們蒐集信息的能力遠遠強於分析使用的能力，然而，這些消息可能包含了我們現如今正在面臨的全球性挑戰的解決辦法。

Take, for instance, the frequent outbreaks of waterborne illnesses as a consequence of war or natural disasters. The most recent example can be found in Yemen, where roughly 10,000 new suspected cases of cholera are reported each week—and history is riddled with similar stories. What if we could better understand the environmental factors that contributed to the disease, predict which communities are at higher risk, and put in place protective measures to stem the spread?

比如，戰後或自然災難引起的水源性傳播疾病頻繁爆發。最近的例子發生在也門，每個星期也門新發現約一萬例疑似霍亂病例。而且歷史總是相似的。如果我們能更好地理解環境因素對該病的影響，提前預測高風險社區，以保護性方法來阻止源頭傳播，將會怎麼樣呢？

Answers to these questions and others like them could potentially help us avert catastrophe.

這些問題和其他相似問題的答案可能會潛在地幫助我們阻止災難。

We already collect data related to virtually everything, from birth and death rates to crop yields and traffic flows. IBM estimates that each day, 2.5 quintillion bytes of data are generated. To put that in perspective: that's the equivalent of all the data in the Library of Congress being produced more than 166,000 times per 24-hour period. Yet we don't really harness the power of all this information. It's time that changed—and thanks to recent advances in data analytics and computational services, we finally have the tools to do it.

我們幾乎爲每樣東西收集數據，從出生率死亡率到糧食變量和交通狀況。IBM公司估計每天有2.5個五萬億字節的數據產生。從這個角度來看：這等同於美國國會圖書館每24小時產生的數據的16.6萬倍。但我們並不能掌控所有的信息。但由於近來先進的數據分析和計算機服務，我們終於有了改變它的工具。

As a data scientist for Los Alamos National Laboratory, I study data from wide-ranging, public sources to identify patterns in hopes of being able to predict trends that could be a threat to global security. Multiple data streams are critical because the ground-truth data (such as surveys) that we collect is often delayed, biased, sparse, incorrect or, sometimes, nonexistent.

作爲洛斯阿拉莫斯國家實驗室的數據科學家，我研究來自廣泛公共來源的數據，以確定模式，希望能夠預測可能對全球安全構成威脅的趨勢。多個數據流是至關重要的，因爲我們收集的基本事實數據(比如調查)常常是延遲的、有偏見的、稀疏的、不正確的，有時甚至是不存在的。

For example, knowing mosquito incidence in communities would help us predict the risk of mosquito-transmitted disease such as dengue, the leading cause of illness and death in the tropics. However, mosquito data at a global (and even national) scale are not available.

舉個例子，瞭解蚊子在一個社區的叮咬發生率將會幫助我們預測蚊子的傳染登革熱病的風險，登革熱是導致熱帶地區疾病和死亡的首要原因。然而，目前還沒有全球(甚至全國)規模的蚊蟲數據

To address this gap, we're using other sources such as satellite imagery, climate data and demographic information to estimate dengue risk. Specifically, we had success predicting the spread of dengue in Brazil at the regional, state and municipality level using these data streams as well as clinical surveillance data and Google search queries that used terms related to the disease. While our predictions aren't perfect, they show promise. Our goal is to combine information from each data stream to further refine our models and improve their predictive power.

爲了彌補這一差距，我們正在利用衛星圖像、氣候數據和人口信息等其他來源來估計登革熱風險。具體來說，我們成功地利用這些數據流、臨牀監測數據和使用與疾病有關的術語的谷歌搜索查詢，預測了登革熱在巴西的地區、州和市一級的蔓延。雖然我們的預測並不完美，但它們顯示出了希望。我們的目標是將來自每個數據流的信息結合起來，以進一步完善我們的模型並提高它們的預測能力。

Similarly, to forecast the flu season, we have found that Wikipedia and Google searches can complement clinical data. Because the rate of people searching the internet for flu symptoms often increases during their onset, we can predict a spike in cases where clinical data lags.

同樣，爲了預測流感季節，我們發現維基百科和谷歌搜索可以補充臨牀數據。由於人們在互聯網上搜索流感症狀的比率在發病期間經常增加，我們可以預測到臨牀數據滯後的病例會出現激增。

We're using these same concepts to expand our research beyond disease prediction to better understand public sentiment. In partnership with the University of California, we're conducting a three-year study using disparate data streams to understand whether opinions expressed on social media map to opinions expressed in surveys.

我們用同樣的概念來擴展我們的研究以更好地理解大衆的想法。我們正在進行一項與加州大學合作的爲期三年的研究，該研究運用不同的數據流來了解社交媒體上所表達的觀點是否與調查中所表述的一致。

For example, in Colombia, we are conducting a study to see whether social media posts about the peace process between the government and FARC, the socialist guerilla movement, can be ground-truthed with survey data. A University of California, Berkeley researcher is conducting on-the-ground surveys throughout Colombia—including in isolated rural areas—to poll citizens about the peace process. Meanwhile, at Los Alamos, we're analyzing social media data and news sources from the same areas to determine if they align with the survey data.

例如，在哥倫比亞，我們正在進行一項研究，看看關於政府和社會主義游擊隊運動之間和平進程的社交媒體帖子是否可以用調查數據來證實。加州大學伯克利分校的一名研究員正在哥倫比亞各地(包括偏遠的農村地區)進行實地調查，調查公民對和平進程的看法。與此同時，在洛斯阿拉莫斯，我們正在分析來自同一地區的社交媒體數據和新聞來源，以確定它們是否與調查數據一致。

If we can demonstrate that social media accurately captures a population's sentiment, it could be a more affordable, accessible and timely alternative to what are otherwise expensive and logistically challenging surveys. In the case of disease forecasting, if social media posts did indeed serve as a predictive tool for outbreaks, those data could be used in educational campaigns to inform citizens of the risk of an outbreak (due to vaccine exemptions, for example) and ultimately reduce that risk by promoting protective behaviors (such as washing hands, wearing masks, remaining indoors, etc. ).

如果我們能證明社交媒體能準確捕捉公衆情緒，相較於昂貴、交通十分不便的調查而言，它就可以成爲一種更實惠、可獲取和及時的替代方法。如預測疾病時，如果社交媒體數據確實是有效預測疾病爆發的工具，這些數據就可以用來教育公衆，告訴他們有疾病爆發的風險（例如疫苗豁免），並最終通過促進保護性措施來減小危害（如吸收、戴口罩、待在室內等）。

All of this illustrates the potential for big data to solve big problems. Los Alamos and other national laboratories that are home to some of the world's largest supercomputers have the computational power augmented by machine learning and data analysis to take this information and shape it into a story that tells us not only about one state or even nation, but the world as a whole. The information is there; now it's time to use it.

所有這些都表明用大數據解決大問題的潛力。洛斯阿拉莫斯和其他國家實驗室擁有世界最大的超級電腦，且因爲機器學習和數據分析，其運算能力更加強大，因此可以運用信息，傳遞消息，不僅僅惠及一個州，一個國家，而且是整個世界。信息就在那裏，是時候使用它了。