About

This blog is a personal expression. It is about what interests me, where my focus is and how I can pass on my experiences. The postings will be when it makes sense to me and what I think will interest the readers. I look forward to feedback in the comments, however, this is not intended […]

Continue Reading »

Contact

John Latta
jnl@fourthwave.com

fourthwave.com

wave-report.com

This feature has not been activated yet. Install and activate the WordPress Popular Posts plugin.

Book Scanning

Book Scanning

By on December 11, 2013 in Book Scanning with No Comments

A Perspective on Personal Production Book Scanning

Printed books have a magnetism whose pull includes being held in the hand, being hefty, pages are turned and reading is a physical act. Yet, digital books, in contradistinction, can be searched, indexed and virtually marked. But there are significant drawbacks with purchased digital books. Two are that they cannot be resold and in one case they were removed from ereaders without the knowledge of buyers after purchased. The buyer has no control or physical possession of the books they “own” on their reader.

With the success of the Amazon Kindle, introduced in 2007, eBooks are taken for granted by many individuals. Pricewaterhouse Coopers reports that eBooks, in the US, account for over 25% of the book sales.

fsc__Monthly_e_book_market_share_in_the_U_S_2013_Statistic

There are dichotomies in today’s ebooks:

  • The pricing is keyed to the cost of the printed book and not related to the cost of production of the ebook – in spite of the complains by the book industry the gross margins seem artificially high;
    The cost of an ebook, for an out-of-print book, is frequently high compared to the cost of a used book; and
    The ebooks are intended for reading and not research.

Thus, personally created digital books, in PDF form, have an appeal because they avoid many of these issues. My personal library was 10,000 bound books and has since declined to 2,000+. The lure is the ability travel with a large library on an Apple iPad, mark the book, create references and text citations and enjoy reading. However, the lure and reality are distant. This has been a quest for the last 3 years.

This short note puts into context my quest and its current state.

The most recent spur for my activity around personal ebooks has been the effort to write the book World Citizen, which is described in more detail in this blog.

The needs related to the authoring of the book includes these elements tied to the reference books in ebook form:

  • While reading, books shall be able to be easily annotated and highlighted;
    Extract quotes based on the reading annotation;
    Search the books; and
    Extract a reference.

Each of these functions should be platform agnostic – a laptop, an iPad or even a smartphone.

I have many choices based on this quest, the equipment bought, the design and implementation of an 80/20 book scanner and the tests that have been run. These are shown in this illustration which provides the process flow for creating an eBook with my resources including service bureaus.

Book Review Process V2 - adjust

Being of the Maker mentality my original efforts were to build an “ideal” hardware solution, with the emphasis on the best physical book scanner I could design. This is seen in the far right side of the In-House Processing option. It is the 80/20 based scanner to be discussed below. Yet, in the span of 3 years much has changed. The technology for creating eBooks has significantly improved and there are now service bureaus that create eBooks from one’s own library.

First, the nature of the need and how to accomplish an ebook is best summarized in the illustration. For example,

  • There are currently 750+ books in the library subject to reading or reviewing. From this books will be selected to be scanned and converted to a PDF ebook. The selection process is based on a review of the physical book and scanning of its contents. Some of the books have been read already.
  • A decision point, which has not yet been exercised, is if I should to go outside or do in house scanning. The natural inclination is to do this in-house given the investment made and the desire to have control of the quality. As will be discussed below, having books scanned outside is becoming more attractive.
  • There is a key factor in ebook creation – should the book be destroyed by cutting the spine? This is an option for both outside and in-house ebook creation. In house there is a ScanSnap iX500. This has huge advantages of
  • Scanning both sides of the page at a time;
    It will auto color select;
    Scanning rate is 25 pages/min.
    The quality is very good.
  • Likewise outside service bureaus cost less when doing destructive scans. But the big drawback is that the physical book is hard to use again, with some vendors it is destroyed once scanned and that it cannot be resold.

In-House there are two non-destructive scanning options: a ScanSnap SV600 and the 80/20 based scanner. The ScanSnap includes software while the 80/20 scanner must rely in independent software which will be discussed below.

The output from each of the In-House scanners is a JPEG file per page, once processed. This is then OCR’ed for text recognition and indexed so that the book can be searched.

Others share a desire for personally created digital books and this has spawned the DIY book scanning movement. I reviewed many of these designs and found them short. Key elements in a book scanner which would be acceptable included:

  • Performance – must be a production tool.
    Modular – capable of doing newspapers, small books and books as large at atlases.
    High color quality – good image reproduction and with a minimization of reflections

I then set out to create a high quality robust scanner based on based on 80/20 materials. As a result this provided for modularity and ease of modification. I did the basic layout, specifications and the detailed design was done by 80/20. The result was a superb design as illustrated below.

Book Scanning - crop

This scanner has these features:

  • Modular including platen’s for 2 book sizes and the flat scrapbook/newspaper platen which can easily be moved into place.
    There is continuous image quality control with the two monitors which show the images seen by the cameras;
    The platen is counter balanced making for ease of movement as it is lowered onto the book gutter.
    The cradle moves to allow for the platen to center on the gutter no matter what part of the book is being scanned – from front to back;
    The lights are color balanced and placed to avoid direct reflections on the platen plastic surface;
    There is a solenoid camera trigger, not show in this illustration – see below, which fires both cameras at the same time from the handle which raises and lowers the platen;
    The platen which contacts the book pages is made of plastic and contacts the gutter at an edge based on the join of the two plastic surfaces. When weight is placed on the platen with the handle this forces the pages apart in the book gutter. The use of plastic was done, in part, because it could be counter sunk for attachment to the 80/20 aluminum.
    The book scanner is placed on a moveable work table to allow access to all parts of the scanner;
    All the electronics and controls are mounted on the back frame and most of the wires are routed inside the 80/20 T-Slotted profiles.

Here is the flatbed scanner.

IMG_6616 - Adjust

Here is a photo of the current camera trigger frame which can support many camera types. A Sony RX-100 is shown here.

DSC09891

This was an engineering effort in and of itself.

Here are some of the lessons learned from the 80/20 scanner.

  • The individual monitors were critical to observe the camera setup and output.
    When doing a book, every attempt was made to get the page image correct during the scanning process. To go back to redo a page once a book is scanned is much too time consuming. The down side of this is that frequently there were multiple images of the same page and these had to be removed in the processing step.
    Every page is checked in the processing phase to verify image quality and exposure before the book was assembled in PDF.
    The most significant time element in photographing pages was camera auto focus. It was not unusual to press the shutter several times to get the camera to focus on the page. Another approach is to use fixed focus; however, this has not been tried. The Sony RX-100 does not have fixed focus. Using fixed focus can also lead to image quality issues if the lens to page distance shifts, which happens as the pages are turned from the front to the back of the book and the thickness, varies.
    As a design objective, one wants the book page to fill as much of the camera image sensor as possible. This means putting as many pixel elements on a page as the camera and lens permits. For the 80/20 scanner this resulted in the need for a 250mm (equivalent) lens, to cover a 10” tall book and with the ability to accurately focus to 33”. Note that in the case of a book the size of a paperback that a 300mm (equivalent) lens would be required. Most 35mm DLSR lenses will not accomplish this. Some of the macro capabilities in the point and shoot cameras come close. One reason to mate the sensor to the page is that the higher the pixels/inch on a page the better the OCR results. Further, photographs have better image quality, subject, of course, to the printing process.
    Tests were run with a Nikon D800 with a 34Mp sensor. The results were actually worse than the Sony DX-100 because the lenses on the Nikon did not fit the application.
    Reflections come from illuminated objects which scatter light in a cone angle which reaches the lens. This can be determined from simple ray tracing. On a practical level, the dominant reflections came from scattering objects near the book and the camera. One of the worst offenders is the camera frame and support bar. Gaffers tape was used to mask most of these. In retrospect, the T-Fame members should have been black anodized.
    Multiple changes were made of the 80/20 system as the use got more refined and the last major refinement was the remote triggering of the shutter. All the these changes were readily accomplished as a direct result of the modular characteristics of 80/20.
    Right angle alignment of the various support members of the 80/20 frame which support the camera are important. The initial design has the camera on a lever arm, which can be seen in the photo. The sag meant that the image was canted in the frame which may or may not be corrected in the processing. As a result the set-up of the camera in its frame and the mount on the 80/20 frame became important and in some cases time consuming.
    A design was developed for automatic page turning. A major down side of virtually any page turn design is that the time to scan will increase. A key issue is that the page turn mechanism cannot drop the platen on the book when a page has only been partially turned. The initial assessment was that machine vision may have been required.

I should disclose that my graduate education is in optics. As a result I am a hawk on image quality. This likely goes beyond the expectations of most but I have no tolerance of images which are poor in focus, non-uniformly illuminated or are hard to read.

It is easy to get fixated on the elegance of the hardware and lose sight of the objective. This is very evident when a systems perspective is taken. Specifically, converting printed books into ebooks is a systems problem where the value of the end result is measured by two parameters: quality and time. Quality has two interrelated factors: image quality of the scanned page and the accuracy of the final OCR output which allows the book to be indexed and made searchable. Time is how long it takes to produce a final ebook from the original printed book. Another factor is cost and this will be addressed in the discussion to follow on scanning service bureaus.

The scope of the time problem is shown in this table:

Pages per Hour #2

The first column represents the best sustained performance that has been achieved with the 80/20 scanner and the third column is the ScanSnap SV600 – based on its 25 pages/min specification. One has to be careful with these numbers as there is overhead such as preparing the book, its loading/unloading and normal glitches in digitizing a book. I cannot recall completely scanning a book without an interruption. It is clear that there is a significant advantage to the dual page automatic page feed of the iX500, but this comes with a significant price – the destruction of the book.

I have not yet run enough tests on the SV600 to determine its scanning performance but it is on the order of the 80/20 scanner.

Ars technical reports a wooden book scanner which can reach 150 pages a minute. I have seen a number of these designs and tests. Usually only a few books are digitized. Image quality is seldom discussed. Lastly the production capabilities are not addressed. Thus, as I see such articles and reviews there is a level of skepticism.

There is a significant hole in this analysis seen in the table above. The evaluation of the processing time – which is directly related to the software which processes the individual JPEG images. The various software components are shown here. Below, in the table row, is the goal of having the processing time to be less than or equal to the scanning time.

As we can see from the table, based on continuous operation within an 8 hour day, it would take a month to create 500 ebooks from the existing library and about ½ that time with the ScanSnap iX500. This estimate has not been borne out in practice and is unlikely to be accomplished. A key issue is the level of processing on a per page basis. The objective is that the creation of the individual book pages from the photographs approach zero, that is, the vast majority of the individual ebook pages are created automatically. Using Book Scan Wizard on the output of the 80/20 scanner there is far too much individual page processing for this to be accomplished. The SV600 is a vast improvement when it automatically selects the page edges but when this fails one must pick the edges of each corner of each page. By far the best is the ScanSnap iX500 where much of the initial processing is done in the scanning process. This includes: exposure balancing, page cropping and automatic color page selection. Given that the pages are flat in the scanner the exposure is uniform and there are no concerns about page curl. As a result the output of the ScanSnap iX500 is essentially page ready output. This can be put into a PDF program such as Acrobat, OCR’ed and output is a PDF. In this example, the scanner and its operation, including the software, make a huge difference with the result that the goal of zero page processing is approached.

In summary, creating a library of 500 ebooks is a significant task. Given the normal interruptions of daily life it would likely take 6 months to scan these books. This begs the question – is it better to use an outside service bureau. To address this several book scanning services we evaluated with a focus on the cost as seen in the table below. Note the numbers are approximate to provide a top level comparison.

Service Bureau Costs

Wired has done a review of the 1DollarScan service.

Only 1DollarScan has reasonable pricing for this volume of books. However, as Wired reported there were some quality problems. Once the book is sent to them they destroy it. Thus, it is not possible to do another scan in-house to improve the quality and one will never get the book back.

There are copyright issues which center on fair use by the owner of the book. Some of these are discussed in the links provided. We leave it up the reader to explore these area.

Thus, based on this analysis the in-house destructive scanning using the ScanSnap iX500 provides the best quality and time performance. Of course it is free given that all the resources are present. It is not clear what is the best means of accomplishing non-destructive scanning is given the current state-of-the-art and the tools available to me as outlined here.

About the Author

About the Author: With a background in Electrical Engineering technology is both a career and passion. As can be seen here visiting the world, asking questions and learning are other passions. Attempting to pass ones experiences and that learning on to others is the a motivator for this blog. .

Post a Comment

Your email address will not be published.

Top