Quantcast
Channel: miriam english
Viewing all articles
Browse latest Browse all 154

text unwrap revistited using sed again

$
0
0
Every now and then I look again at the problem of unwrapping text that has lines within a paragraph broken by a line end, but paragraphs separated by a blank line. Here is an example from the beginning of Lewis Carrol's book, Alice in Wonderland at Project Gutenberg:

            ALICE was beginning to get very tired of sitting by her
sister on the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no pictures or
conversations in it, "and what is the use of a book," thought Alice,
"without pictures or conversations?"

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid) whether the pleasure of
making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.


Strangely, there's no purpose-built tool that I could find to perform this deceptively simple-looking task.

I first approached this in 2010 and wrote a quite convoluted sed command to do it. It was made ridiculously difficult by sed's inability to see newline characters on a line unless the next line was read in after the current one, which meant setting up a loop to read in lines until a blank line is found. This is an absurdly wasteful way to use sed as it already loops implicitly over the whole file anyway -- this is part of its beauty.

Later, I found unwrapping text could be much simplified using awk because I could tell awk to read a whole paragraph as a single record by setting its record separator to a blank line (RS=""). Unfortunately something about this short script inserted annoying spurious blank lines at the beginning and end of the file. This became even more annoying when I used it as a kind of macro for a text editor because it now inserted the blank lines above and below the selection.

Just a few days ago I realised a very simple way to do the job using tr to translate all the newline characters to something exceedingly unlikely to be found in a file, such as ASCII character 1 (\x01). Then I could convert pairs of character 1 found together back to two newlines, preserving the paragraph separators, then convert all the remaining newlines to spaces. Add a tiny bit of extra pattern matching and any spaces and/or tabs before or after the single newlines are removed too.
tr "\n""\x01" | sed 's/\x01\x01/\n\n/g ; s/[ \t]*\x01[ \t]*/ /g'
This worked really well, but had a small bug. It collapsed triple newlines (double blank lines) down to doubles (single blank line). Double blank lines are often used as section breaks in texts, so losing them was a Bad Thing.

Yesterday I happened to look online to find the latest version of sed (v4.2.2) at ftp://ftp.gnu.org/gnu/sed/
It has some nice improvements, best of which (to my mind) is the ability to use the -z option to change how sed defines a record. Instead of being stuck with reading in records only terminated by newline characters, now with the -z option it can use the zero byte character as a record terminator. This is great! Now I can read a whole file in as a single record and manipulate the newline characters. In the example below I also use the -r option to force extended regular expressions, so I don't have to escape parentheses with backslashes, making it much easier to read. Unfortunately, limitations of regular expressions (regex) still make this more difficult than it need be, but life suddenly becomes much simpler:
sed -z -r 's/[ \t]*\n[ \t]*/\n/g; s/([^\n])\n([^\n])/\1 \2/g'
First I get rid of the spaces and tabs on either side of newline characters, then comes the incredibly simple command (which I've put in bold) to replace a newline with a space if it doesn't have another newline on either side.

No spurious lines inserted and it preserves double blank line section breaks too. Yay!

How simple is that!

(Crossposted from http://miriam-e.dreamwidth.org/324048.html at my Dreamwidth account. Number of comments there so far: comment count unavailable)

Viewing all articles
Browse latest Browse all 154

Trending Articles