Dataset Documentation ===================== Introduction ------------ Here at Google we are grateful for the years of research in image processing, Optical Character Recognition (OCR), document understanding, information retrieval and other fields. This work is helping us index the world's books and we would like to give something back to the community. The dataset you are looking at is a collection of 1000 public domain volumes that were scanned as part of the Google Book Search project. It is being distributed to support research in a variety of disciplines. Each volume comes with the scanned images, OCR output, page tags and basic metadata. The volumes in this dataset are written in 4 languages: English, French, Italian and Spanish. This document describes the organization of the dataset and the file formats. Google's mission is to organize the world's information and to make it universally accessible and useful. Google Book Search helps readers discover the world's books while helping authors and publishers reach new audiences. You can search through the full text of public domain books on the web at http://books.google.com/ Dataset organization -------------------- The file system structure of the dataset at the top-level looks like this: README.txt (this file) VERSION.txt (contains the dataset version and date) Volume_0000.zip (first volume of the dataset) Volume_0001.zip Volume_0002.zip ... Volume_0999.zip (last volume of the dataset) Each Volume_X.zip archive contains the following files: hOCR.html (OCR engine output for the volume; also contains basic metadata) Pagetags.txt (page tag information, e.g. front cover or back cover) Url (contains the url of the volume on Google Book Search) Images (contains an ordered list of image files for this volume) Image_0.JPEG (first image of this volume) Image_1.JPEG ... Image_N.JPEG (last image of this volume; the volume contains N+1 pages) File formats ------------ hOCR.html: OCR output in hOCR format with integrated basic metadata. You can find a link to the format definition on the Ocropus project page (see below). The number of 'ocr_page' div elements in this file matches the number of pages in the volume. The ocr_page elements map one-to-one to the Image_X.JPEG files of the volume. The metadata comprises the title, creator (author), publisher, date (publication year) and language of a volume in Simple Dublin Core meta elements in the hOCR document header. You can find out more about Dublin Core at http://dublincore.org/ hOCR is also used by the Ocropus project at http://code.google.com/p/ocropus/ and several tools are available at http://code.google.com/p/hocr-tools/ Pagetags.txt: Page tag information for all exported images, one line per image. Within each line, the records are separated by commas: the first column is the index of the image the tags apply to (0 = first image). The remaining columns contain the page tags for the image. Valid tags are: FRONT_COVER the front cover of the volume TITLE the title page COPYRIGHT the copyright page PREFACE a preface page TABLE_OF_CONTENTS a table of contents page CHAPTER_START a new chapter starts on this page REFERENCES a page with references, citations or a bibliography INDEX an index page BACK_COVER the back cover BLANK this page is blank IMAGE_ON_PAGE the page contains an image Images: The list of image files in the directory, one image per line. The images are in the same order as the pages in the volume, that is, the first image in the file is of the first page of the volume, the second image in the file is of the second page and so on. Image_X.JPEG: Image of page X of the volume in jpeg format (the first image has index X = 0) at 300dpi resolution. The image has undergone a small amount of processing: 1. Dewarping: "flattening" of the page to remove any curving of the paper. 2. Deskewing: rotation of the page so that text lines are running horizontally (text in normal orientation). 3. Cropping: the image is cropped down to the page size. 4. Cleanup: the page border is filled with the background color of the page. Dataset maintenance ------------------- This dataset might evolve over time as our image processing algorithms and OCR engine change. Each dataset release will be marked with a version number and release date in the VERSION.txt file. If you use this dataset in publications, please indicate which version of the dataset you were using (see below). Usage Guidelines ---------------- Google is proud to partner with libraries to digitize public domain materials and make them widely accessible. The following guidelines are not formal restrictions, but requests that we make of you as a user of this dataset. We ask that you: - Make only non-commercial use of the files in this dataset. We designed Google Book Search for use by individuals, and we request that you use these files for personal, non-commercial purposes. - Refrain from automated querying. Do not send automated queries of any sort to Google's system. If you are conducting research on machine translation, optical character recognition or other areas where access to a large amount of text is helpful, please contact us. We encourage the use of public domain materials for these purposes and may be able to help. - Keep it legal. Whatever your use, remember that you are responsible for ensuring that what you are doing is legal. We believe that the books in this dataset are in the public domain, but in general we can't offer guidance on whether any specific use of any specific book in Google Book Search is allowed. Please do not assume that a book's appearance in Google Book Search means it can be used in any manner anywhere in the world. - Maintain attribution. If you redistribute any part of this dataset, please include copies of this file and the VERSION.txt file. If you use the dataset in a publication, please cite it as follows (fill in V and Month/Year): Google Inc.: Book Search Dataset, Version V, Month/Year. or use this BibTeX template: @misc{booksearchdata, author="{Google Inc.}", title="Book Search Dataset", note="Version V", month="Month", year=Year } Contact/Questions ----------------- We hope you will find this dataset useful in your research and plan to continue to explore ways to support researchers through our Google Book Search initiative. We are very interested in research results derived from this corpus, so please share these results with us so we can understand the impact of this dataset and how we can continue to support this community. We realize that you may have a lot of questions about all the steps that went into creating the dataset. Google does not share a lot of operational details about our products - our focus is on the user experience. For this reason, we cannot provide you with details about our scanning and downstream processing technology beyond what's listed in this document. If you have other comments or questions about the dataset, please contact us at books-datasets-support@google.com.