Creating a readable book from Internet Archive DJVU.txt files

Many books on the Internet Archive have a txt representation generated when books are uploaded to IA (?  assumed it’s generated… some of the errors in the scans are evident in the txt-only files).  A lot don’t have the original formatted scans… just txt representation.

Obviously there are other formats already ready to go but I wondered how hard it would be to reconstruct a pure text representation of the original book.

TXTBOOK

To find the page breaks,

I downloaded the TXT file and opened it up in Notepad++ (free here) and looked at what immediately preceded the pages in the text

After turning on symbols, I quickly found that each page was immediately preceded by four (4) LineFeeds. Because there were other available formats I could verify in the original book that it was actually true… each page was breaking by at least 4 LFs before the page number– some had 6 or 7 LFs… which is fine for the method that follows.  YMMV and you’ll have to look at the text that is loaded up on IA to really know in your case (and modify the next step accordingly).

In Notepad++, click anywhere in the document, hit CTRL-A to select all text and then CTRL-C to copy

Go over to Word (mine is Office 365, this would work in old versions too) open a blank document and Paste all the plain text (CTRL-V after you put your cursor in the document).   After a couple minutes, Word updates the page count.  In my case, it shows more than 600 pages for a book that originally had 387 pages.

This is where we start doing some light reformatting to get it back to original shape:

Inserting Page Breaks

Word has the ability to insert special characters in the document which tell Word how to format.  One is the page break (CTRL-M for those who like keyboard shortcuts).  Of course we need 387 or so all at once… the REPLACE function works nicely.

In the replace dialog search for 4 line feeds in consecutive order and the replace with a page break.  This is all selectable off of the “SPECIAL” button at the bottom of the REPLACE dialog. Select line feed 4 times in the “FIND WHAT” field, put a Page Break in the Replace field.   Click REPLACE ALL.  In my case Word reports 387 replacements were made.  (that’s a very promising number!)

Uh oh.  My page count went UP not down.  Right.  My defaults in word are double spacing with some extra space after each Line Feed so there are lots of automatic page breaks where a single page in the original book’s text is now multiple pages in my Word doc.   You may be able to skip this depending on your defaults in Word (or you could setup a blank page template which already has the settings you want if you’re doing a few of these)

Getting book-style Line spacing

In Word, Select All the text (CTRL-A will work here too)

Select the Layout tab (this is Office 365, your version will have this somewhere but may be a different tab) and find the Line Spacing section.

Make the two edits shown: 0 padding after lines and Line Spacing of SINGLE

My page count drops to around 450 (a little less… Word keeps updating but will eventually get it down to a much lower number)

Page Size

Printed books have really tight margins.  One reason I still have extra pages is that the text which fit on one page in the original book is spilling over and creating a 2nd page just due to the amount of text on the page.  Then, my inserted page break (the place in the original text where the page ended) is breaking anywhere from 1 to 5 sentences on that page of spilled over text.  In my case, just setting margins of the document to minimums works. (0.5 inch on all four sides) .  In Office365, the Layout menu also has a Margins setting and predefined setting of NARROW for this 0.5 inch on each side.

This dialed back my page count to about 20 over the total of 387 in the original. There are still some imperfections in the document (for instance some of those places where there were more than four LFs for a page break) *but*…it’s pretty close.

Reformatting the Index pages as multiple Columns

The book I picked had a 2 column Index at the back of the book which was also scanned in and appears in the text version.  To emulate this 2 column layout in Word, I need what is called a Section Break, New page.  You may have to mess around to get this right.  In my case, I just found the Index page (it says Index at the top of that page in the original… easy!) and I put my cursor at just to the left of the I in INDEX and then hit the backspace.  This deletes the Page Break for that page.

Now on the Layout menu (Office365, YMMV by version but all supported versions of Word have this function — insert Section Break, Next Page ) Select the BREAK icon and select the one shown… Insert Section, Next Page.

I had to do this Section Break so that my next step will work… it created s a Section in the Word document where my next formatting action will only apply to that section.  As in the original text, the entire book is a single column text layout.  When you get to the Index, it’s two columns.  We’re now ready to insert the Two Columns formatting.

Click somewhere on that first Index page.  In Office 365, the Layout menu has a COLUMNS entry.  I just click two (2) columns in the dropdown and presto, the rest of the book from the Index page onwards is two columns.

NOTE: if I had additional text after the Index and I didn’t want it in columns, I could set another Word Section Break, Next Page after the last Index entry and then move my cursor into that new section and set the Columns setting back to one ( 1) to get the original layout back after the Index.

 

That’s it. Maybe some others have other formatting options which would easily take us even closer to book layout form.  Obviously, there are no pictures or graphics of any kind in this one.   This was “close enough” for me… I save it from Word into a PDF format alongside Word.  I can now look at the original pages and read the text in a PDF, mark them up electronically and preserve page references. About 98% of the text is like the original in these scans and the rest is still decipherable.  And… I can search it.

Suggestions?

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *