论文标题
WARC-DL:可扩展的Web存档处理,用于深度学习
WARC-DL: Scalable Web Archive Processing for Deep Learning
论文作者
论文摘要
网络档案已经成长为PB。除了在过去30年中为许多社会和文化发展提供宝贵的背景知识外,它们还为机器学习提供了大量的培训数据。为了从深度学习的最新发展中受益,Web档案的使用需要可扩展的解决方案来支持对神经网络的推论和培训。迄今为止,没有以这种方式处理Web档案的公开库,一些现有应用程序使用解决方法。本文介绍了WARC-DL,这是一条针对Web存档处理的深度学习管道,可扩展到Petabytes。
Web archives have grown to petabytes. In addition to providing invaluable background knowledge on many social and cultural developments over the last 30 years, they also provide vast amounts of training data for machine learning. To benefit from recent developments in Deep Learning, the use of web archives requires a scalable solution for their processing that supports inference with and training of neural networks. To date, there is no publicly available library for processing web archives in this way, and some existing applications use workarounds. This paper presents WARC-DL, a deep learning-enabled pipeline for web archive processing that scales to petabytes.