Creating a readable book from Internet Archive DJVU.txt files

Many books on the Internet Archive have a txt representation generated when books are uploaded to IA (?  assumed it’s generated… some of the errors in the scans are evident in the txt-only files).  A lot don’t have the original formatted scans… just txt representation.

Obviously there are other formats already ready to go but I wondered how hard it would be to reconstruct a pure text representation of the original book.


To find the page breaks,

I downloaded the TXT file and opened it up in Notepad++ (free here) and looked at what immediately preceded the pages in the text

After turning on symbols, I quickly found that each page was immediately preceded by four (4) LineFeeds. Because there were other available formats I could verify in the original book that it was actually true… each page was breaking by at least 4 LFs before the page number– some had 6 or 7 LFs… which is fine for the method that follows.  YMMV and you’ll have to look at the text that is loaded up on IA to really know in your case (and modify the next step accordingly).

In Notepad++, click anywhere in the document, hit CTRL-A to select all text and then CTRL-C to copy

Go over to Word (mine is Office 365, this would work in old versions too) open a blank document and Paste all the plain text (CTRL-V after you put your cursor in the document).   After a couple minutes, Word updates the page count.  In my case, it shows more than 600 pages for a book that originally had 387 pages.

This is where we start doing some light reformatting to get it back to original shape:

Inserting Page Breaks

Word has the ability to insert special characters in the document which tell Word how to format.  One is the page break (CTRL-M for those who like keyboard shortcuts).  Of course we need 387 or so all at once… the REPLACE function works nicely.

In the replace dialog search for 4 line feeds in consecutive order and the replace with a page break.  This is all selectable off of the “SPECIAL” button at the bottom of the REPLACE dialog. Select line feed 4 times in the “FIND WHAT” field, put a Page Break in the Replace field.   Click REPLACE ALL.  In my case Word reports 387 replacements were made.  (that’s a very promising number!)

Uh oh.  My page count went UP not down.  Right.  My defaults in word are double spacing with some extra space after each Line Feed so there are lots of automatic page breaks where a single page in the original book’s text is now multiple pages in my Word doc.   You may be able to skip this depending on your defaults in Word (or you could setup a blank page template which already has the settings you want if you’re doing a few of these)

Getting book-style Line spacing

In Word, Select All the text (CTRL-A will work here too)

Select the Layout tab (this is Office 365, your version will have this somewhere but may be a different tab) and find the Line Spacing section.

Make the two edits shown: 0 padding after lines and Line Spacing of SINGLE

My page count drops to around 450 (a little less… Word keeps updating but will eventually get it down to a much lower number)

Page Size

Printed books have really tight margins.  One reason I still have extra pages is that the text which fit on one page in the original book is spilling over and creating a 2nd page just due to the amount of text on the page.  Then, my inserted page break (the place in the original text where the page ended) is breaking anywhere from 1 to 5 sentences on that page of spilled over text.  In my case, just setting margins of the document to minimums works. (0.5 inch on all four sides) .  In Office365, the Layout menu also has a Margins setting and predefined setting of NARROW for this 0.5 inch on each side.

This dialed back my page count to about 20 over the total of 387 in the original. There are still some imperfections in the document (for instance some of those places where there were more than four LFs for a page break) *but*…it’s pretty close.

Reformatting the Index pages as multiple Columns

The book I picked had a 2 column Index at the back of the book which was also scanned in and appears in the text version.  To emulate this 2 column layout in Word, I need what is called a Section Break, New page.  You may have to mess around to get this right.  In my case, I just found the Index page (it says Index at the top of that page in the original… easy!) and I put my cursor at just to the left of the I in INDEX and then hit the backspace.  This deletes the Page Break for that page.

Now on the Layout menu (Office365, YMMV by version but all supported versions of Word have this function — insert Section Break, Next Page ) Select the BREAK icon and select the one shown… Insert Section, Next Page.

I had to do this Section Break so that my next step will work… it created s a Section in the Word document where my next formatting action will only apply to that section.  As in the original text, the entire book is a single column text layout.  When you get to the Index, it’s two columns.  We’re now ready to insert the Two Columns formatting.

Click somewhere on that first Index page.  In Office 365, the Layout menu has a COLUMNS entry.  I just click two (2) columns in the dropdown and presto, the rest of the book from the Index page onwards is two columns.

NOTE: if I had additional text after the Index and I didn’t want it in columns, I could set another Word Section Break, Next Page after the last Index entry and then move my cursor into that new section and set the Columns setting back to one ( 1) to get the original layout back after the Index.


That’s it. Maybe some others have other formatting options which would easily take us even closer to book layout form.  Obviously, there are no pictures or graphics of any kind in this one.   This was “close enough” for me… I save it from Word into a PDF format alongside Word.  I can now look at the original pages and read the text in a PDF, mark them up electronically and preserve page references. About 98% of the text is like the original in these scans and the rest is still decipherable.  And… I can search it.


Posted in Uncategorized | Leave a comment

Informal Fallacies

Informal fallacies are a broad category of mistakes within the content or reasoning of an argument.  Carelessness, ambiguity and irrelevance are root causes of informal fallacies. Formal fallacies are related to structure and form of an argument.


Accent Ambiguity of argument that results from improper tone of voice and emphasis on given proposition
Ambiguity Various informal fallacies that make communication unclear or ambiguous
Ab Annis Appeal to age as a basis of truth
Ad Baculum Appeal to force or fear as a determination of truth
Ad Futuris Appeal to future possibilities as a determinitive of truth
Ad Hominem Appeal to the person (abusive or circumstantial) as determinitive of truth
Ad Ignoratntiam Appeal to one’s lack of knowleddge or proof concerning an issue as determinitive of truth
Ad Misericordiam Appeal to pity or misery of an individual as determinitive of truth
Ad Populum Appeal to what is popular or in vogue as determinitive of truth
Amphibole Appeal to ambiguous propositions that cloud the meaning of a truth statement due to awkward wording
Analogy An attempt to use similarity that is irrelevant to the argument
Argument of the Beard Appealing to imperceptible differences among extremes to indicate there are no real differences among them
Category Mistake Confusing things in one category with things from another category: “What does the color red taste like?”
Composition Assuming that what is true of the parts is equally true of the whole
Dicto Simpliciter Attempting to apply a general rule to a specific case when differences exist that militate against its application
Division Assuming that what is true of a whole is also true of its parts
Equivocation Change in meaning of a word in the midst of an argument though the overall context remains the same; “Some horses have short tails.  My horse has a short tail. Therefore, my horse is some horse.”
False Cuase Assigning what is not the cause of a given effect as its real cause– as in crediting one event as the cause of another event simply because the first occurs prior to to the second.
False Dilemma Occurring when one is given only two alternatives to choose from when there are at least one or more additional alternatives.  “Either we allow abortion or we force the children to be raised by parents who do not want them”.
Hasty Generalization Reaching a conclusion after analyzing only unusual cases instead of reasoning from analysis of typical cases.
Petitio Principii
Non Sequitur Occurring when the conclusion does not follow from the premises.
Relevance Incorporating irrelevant material into an argument.
Slippery Slope Occurring when an individual claims that accepting a conclusion of an argument will lead to a series of undesirable consequences and justifications.
Straw Man Occurring when one interprets an opposing or alternative viewpoint in its weakest form or inaccurately and then rfeutes it as if the strength of the position are being addressed
Posted in Uncategorized | Tagged , | Leave a comment

Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start writing!

Posted in Uncategorized | 1 Comment