text unwrap revistited using sed again

Every now and then I look again at the problem of unwrapping text that has lines within a paragraph broken by a line end, but paragraphs separated by a blank line. Here is an example from the beginning of Lewis Carrol's book, Alice in Wonderland at Project Gutenberg:

            ALICE was beginning to get very tired of sitting by her
sister on the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no pictures or
conversations in it, "and what is the use of a book," thought Alice,
"without pictures or conversations?"

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid) whether the pleasure of
making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.

Strangely, there's no purpose-built tool that I could find to perform this deceptively simple-looking task.

I first approached this in 2010 and wrote a quite convoluted sed command to do it. It was made ridiculously difficult by sed's inability to see newline characters on a line unless the next line was read in after the current one, which meant setting up a loop to read in lines until a blank line is found. This is an absurdly wasteful way to use sed as it already loops implicitly over the whole file anyway -- this is part of its beauty.

Later, I found unwrapping text could be much simplified using awk because I could tell awk to read a whole paragraph as a single record by setting its record separator to a blank line (RS=""). Unfortunately something about this short script inserted annoying spurious blank lines at the beginning and end of the file. This became even more annoying when I used it as a kind of macro for a text editor because it now inserted the blank lines above and below the selection.

Just a few days ago I realised a very simple way to do the job using tr to translate all the newline characters to something exceedingly unlikely to be found in a file, such as ASCII character 1 (\x01). Then I could convert pairs of character 1 found together back to two newlines, preserving the paragraph separators, then convert all the remaining newlines to spaces. Add a tiny bit of extra pattern matching and any spaces and/or tabs before or after the single newlines are removed too.
tr "\n""\x01" | sed 's/\x01\x01/\n\n/g ; s/[ \t]*\x01[ \t]*/ /g'
This worked really well, but had a small bug. It collapsed triple newlines (double blank lines) down to doubles (single blank line). Double blank lines are often used as section breaks in texts, so losing them was a Bad Thing.

Yesterday I happened to look online to find the latest version of sed (v4.2.2) at ftp://ftp.gnu.org/gnu/sed/
It has some nice improvements, best of which (to my mind) is the ability to use the -z option to change how sed defines a record. Instead of being stuck with reading in records only terminated by newline characters, now with the -z option it can use the zero byte character as a record terminator. This is great! Now I can read a whole file in as a single record and manipulate the newline characters. In the example below I also use the -r option to force extended regular expressions, so I don't have to escape parentheses with backslashes, making it much easier to read. Unfortunately, limitations of regular expressions (regex) still make this more difficult than it need be, but life suddenly becomes much simpler:
sed -z -r 's/[ \t]*\n[ \t]*/\n/g; s/([^\n])\n([^\n])/\1 \2/g'
First I get rid of the spaces and tabs on either side of newline characters, then comes the incredibly simple command (which I've put in bold) to replace a newline with a space if it doesn't have another newline on either side.

No spurious lines inserted and it preserves double blank line section breaks too. Yay!

How simple is that!

(Crossposted from http://miriam-e.dreamwidth.org/324048.html at my Dreamwidth account. Number of comments there so far: comment count unavailable

)

text unwrap revistited using sed again

Trending Articles

The Smurfs 2011 WORKPRINT 720p BluRay x264-AVS720

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

Throw Back: 2Toff “Ye Na Bra” Ft. Castro

LAG, Lacp configuration on Mellanox switches

Muloraki Au

Soo....where ya from? Melissa, David, Isabella, Derek, and Angelina Perizzolo...

Wanted Gloucester man Jamal Tyne arrested and returned to prison

ZARIA CUMMINGS

Kanulanu Thaake Lyrics and translation | Manam (2014)

The 10 Tennessee Cities With The Largest Black Population For 2021

Practice Sheet of Right form of verbs for HSC Students

The man who tried to murder John Gilligan

St Austell woman 'wanted to cut her heart out' in supermarket

MAN SUBJECTED NEIGHBOURS TO REIGN OF TERROR FOR ALMOST TWO DECADES

Sarah Samis, Emil Bove III

Detroit Mafia’s Consigliere Tony Pal, Possible Final Tie To Hoffa Mystery,...

Dell 12th Generation PowerEdge Servers - What Customers are Saying + Dell...

MSAADA WA JOINING INSTRUCTION YA LUSANGA SECONDARY SCHOOL MOROGORO

Shelton drug dealer caught with cannabis during police bust

Scan-Inventory -> Test-Compliance... Operation is not valid due to the...