Thursday, July 10, 2014

Parsing Text from converted PDF file

The table came out with most lines fine then all of a sudden you get this;

5      112   1.485791479630   0.626697417315   0.795310854254   0.428214564624   1.022558054845   1.001054969661   0.694652941279
             1.489646364222   0.625959518299   0.796389500449   0.427816472287   1.022558054845   1.001054969661   0.69071376525
5      113
             1.493491298915   0.625146823391   0.797466687679   0.427451228618   1.022558054845   1.001054969661   0.686751994675  5      114
             1.497273891632   0.624481104524   0.798605655202   0.427085672593   1.022558054845   1.001054969661   0.682920926101  5      115
I am parsing the data out of this file to merge data from separate pages. The columns here go with 7 additional columns from many pages down.  

I can see solving this problem by 
1. unpivot data while keeping column index and grouping and then re-pivot in SQL
2. write a program to parse the keys and store in dictionary and put pieces together
Either way it is going to be alot easier once the lines are put back onto the same row as they belong.

Here's the F# code I used to do this (F# seems nice since you get interactive window);

// make copy of file source
File.Copy(@"<source filename>", "<copy filename>",true) //second param allows overwrite if file exists

//regex pattern to match factors followed by indexes on next line
let pattern = "\s+(?<factors>(\s+[0-9]+[.][0-9]+){7})\r\n(?<varindex>\s+[0-9]+\s+[0-9]+)(?=\r\n|$)"

let alltext = File.ReadAllText(@"<copy filename>")

let replacedtext = Regex.Replace(alltext,pattern,"\n${varindex}  ${factors}")

File.WriteAllText(@"<output filename>",replacedtext)

As I was testing th eregex pattern I was basically working in Visual Studio and copied the text from above into the editor and bound it to like this;
let text = " <the lines above were here>"

Then I was doing stuff like ...

Regex.Replace(text,pattern,"\n${varindex}  ${factors}")
Regex.Match(text,pattern).Success

...to see if it was going to come out as expected.  The original lpattern used \n to match the new line.  Once it was working I tried it on the file and it failed!?!
I could not understand and it took me a while to figure out that to match the newline in ht efile I had to use \r\n. ?  I still do not completely understand how the file needs the \r\n but the text bound in let expression works with \n only.  (I assume  text editors (VS2013 vs notepad++)  behave differently in regard to line endings?)

The other thing which gave me grief was the pattern didn't appropriately match the final line of the file.  So I figured I had to do a positive look ahead as the last part of the pattern to see end of string or newline.  

I am a complete beginner with no real training - Lessons learned ?
1. If you are testing regex replacements try to replicate the source you are matching to (if matching file put test data in a file - rather than copy paste to editor).
2. Keep in mind newline character differences
3. Think about end of file conditions if end of text is not a newline.