joke vest amount (jkndrkn) wrote in ruby_lang,
joke vest amount
jkndrkn
ruby_lang

Ruby Newbie: Code Critique Request

Hello Friends

I have played with Ruby tutorials and ploughed through a great deal of the Pickaxe book, but have not really written much of anything on my own.

Finally, an opportunity came up for using Ruby for a practical application: a quick and dirty text-processing script. The script is designed to insert onclick event handlers into HTML anchor tags that point to PDF and other files. Our customer purchased a web analytics package and wanted to be able to track clicks to non-HTML files on their server.

The script uses regular expressions, string manipulation, and basic File IO.

If you have some time, could you review the script? I'm interested in code that is compact, legible, and efficient. Please point out if my approach takes advantage of the correct Ruby concepts. I've already found a place where I might improve efficieny: write a file only if changes were made to the file.

I very much appreciate your time!


  1 require 'cgi'
  2 
  3 files = ARGV
  4 
  5 # Match url of form href="foo/bar/baz/quux.pdf"
  6 regex = /(href\s*=)\s*\"([^\"]+\/)([^\"]+\.(pdf|mdb|wmv|xls|doc|ppt))\"/i
  7 debug = false
  8 
  9 files.each do |infile|
 10     if debug: 5.times {puts ""} end
 11     if debug: puts ">>>" + infile.to_s end
 12 
 13     output = ""
 14 
 15     File.open(infile) do |file|
 16         while line = file.gets
 17             # If a line contains a url of the correct form but NO onclick handler
 18             if line =~ regex and !line.include? 'onclick'
 19                 if debug: puts "<" + line end
 20 
 21                 # Retrieve URL components
 22                 matches = regex.match(line)
 23                 href, dir, name = matches[1..3]
 24 
 25                 # Rebuild URL
 26 
 27                 # No filtering necessary, as URL is enclosed in double-quotes and is assumed to have been tested in the past
 28                 url = href + '"' + dir + name + '"'
 29 
 30                 # Build onclick handler from URL components
 31                 
 32                 # Improve legibility of name to be displayed in Webtrends interface 
 33                 # by stripping out %20 and other characters 
 34                 name_clean = CGI.unescape(name)
 35                 
 36                 # Encode single quotes as they can break the onclick handler
 37                 [href, dir, name, name_clean].each do |str|
 38                     str.gsub!("'", "%27")
 39                 end 
 40                 
 41                 onclick = "onclick=\"dcsMultiTrack(\'DCS.dcsuri\',\'#{dir}#{name}\',\'WT.ti\',\'#{name_clean}\');\""
 42                 
 43                 # Substitute url link with both the original link and the onclick event handler
 44                 line.gsub!(regex, url + " " + onclick)
 45                 
 46                 if debug: puts ">" + line + "\n" end
 47             end 
 48             output << line
 49         end 
 50     end 
 51     
 52     if debug: puts "<<<" + infile.to_s end
 53     
 54     # Write output file 
 55     leading_slash = (debug) ? '_' : ''
 56     
 57     File.open(File.dirname(infile) + '/' + leading_slash + File.basename(infile), "w") do |outfile|
 58         outfile.write(output)
 59     end 
 60 end 

  • Post a new comment

    Error

    default userpic
  • 23 comments
On like 41, I’d use %{} instead of "" as delimiters—less escaping necessary.

I’d also consider using %r{} instead of // as delimiters from the regex, since you’re escaping at least one slash in there.
Er, sorry—the second bit would be line 6.
For HTML manipulation you might want to consider using _why's library Hpricot.
Very nice, thanks. Sure beats having to build complicated Regular Expressions.
Yeah :-)

I only discovered it a few days ago when I needed to convert HTML into MediaWiki markup :P
Oh, ugh. Let me guess. You were converting legacy HTML documentation to Wiki format?

Were you able to automate conversion of HTML tables to MediaWiki tables?
Thankfully it was nothing more complex than converting one or two links to MediaWiki markup.
You're not doing anything that needs it now, but you might enjoy looking at the Mechanize library too.
Looks quite reasonable to me.

Personally, I prefer:

5.times { puts "" } if debug

or

if debug
  5.times { puts "" }
end

to:

if debug: 5.times {puts ""} end

I feel that if your block needs an end, it should go on separate lines.

(Additionally, there's good argument for making debug a constant.)

Ah, good observation.

Also, I was thinking of searching ARGV for the string -d. If found, I would remove it from ARGV and set debug to true.
what argument is that?
I would say that declaring debug as a constant helps to visually separate it from the application logic and gives a hint to the programmer that it is an editable configuration value.
and I would say that naming it "debug" did that in the first place.

Re:

Anonymous

May 20 2008, 15:25:54 UTC 8 years ago

And I would say that "DEBUG" as a constant gives a much clearer warning to the writer than a local var called "debug".
5.times { puts "" } if debug

Very nice, thanks.
Lines 10-11:
    puts "\n\n\n\n\n>>>#{infile}" if debug

Lines 16-18:
    file.each_line do |line|
        # If a line contains a url of the correct form but NO onclick handler
        if line =~ regex and !line.include? 'onclick'
            puts "<#{line}" if debug

Line 46:
            puts ">#{line}\n" if debug # I'm sure you get the idea by now.


As well, apollotiger's comments are about right.
Yeah, that usage of if is very handy. Thanks.

Deleted comment

Thanks! Regular Expressions are certainly one of my major weaknesses.
Sorry, I hit 'send' on that sooner than I meant to. The negative-lookahead is definitely a way to pre-filter for links that include an onclick handler, but you'll have to be a bit smarter than I was about how to apply it.

However, another feature you might want to take a look at is the form of String#gsub which accepts a block argument, which could greatly simplify the control flow of your code. The code would look something like the following:

regexp = %r{(
[Error: Irreparable invalid markup ('<a[\s\w="]*(?!onclick)href=['"])([^"]*)(['"][^>') in entry. Owner must fix manually. Raw contents below.]

Sorry, I hit 'send' on that sooner than I meant to. The negative-lookahead is definitely a way to pre-filter for links that include an onclick handler, but you'll have to be a bit smarter than I was about how to apply it.

However, another feature you might want to take a look at is the form of String#gsub which accepts a block argument, which could greatly simplify the control flow of your code. The code would look something like the following:

<pre>
regexp = %r{(<a[\s\w="]*(?!onclick)href=['"])([^"]*)(['"][^>]*(?!onclick)[^>]*>)}

Files.map |path| do
data = File.read(path)
data.gsub(regexp) do |matchdata|
raw_url = matchdata[2]
orig_url = URI.parse(raw_url)
filename = File.split(orig_url.path).last
onclick = " onclick='dcsMultiTrack(\""DCS.dcsuri\", \"#{orig_url.path}\", \"TW.ti\", \"#{filename}\");"
[matchdata[1], matchdata[2], onclick, matchdata[3]].join('')
end
end
</pre>

(Notice that it's no longer doing line-structured matching, since a single 'a' element could easily span multiple lines in the HTML file.)
I think all but one of these comments kinda misses the point that you're not using the right tool for the job. Whether you're using 5.times { puts } or puts "\n\n\n\n\n" doesn't matter one whit. Using regexps doesn't matter either. Your original code breaks if the HTML is all on one line. Even the last comment from rcoder misses a lot of edge cases (malformed html? no quotes? etc etc). The comment from robhu was dead on.

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'cgi'

ARGV.each do |url|
  html = URI.parse(url).read
  doc = Hpricot(html)

  (doc / :a).each do |a|
    next if a[:onclick]
    href = a[:href]
    next unless href =~ /.(pdf|mdb|wmv|xls|doc|ppt)$/
    name_clean = CGI.unescape(File.basename href)
    a[:onclick] = "dcsMultiTrack(\'DCS.dcsuri\',\'#{href}\',\'WT.ti\',\'#{name_clean}\');"
  end

  puts doc.to_s
end


By going this route. It is MUCH more clear what you're trying to accomplish, making it easier to understand, debug, etc.
Thanks, I'll consider hpricot for future work with HTML. I'm fairly new to Ruby, so I've had little chance to be exposed to third-party libraries.

As for failing to handle malformed HTML, I was working on a code base written entirely in a fairly strong CMS system that outputs generally trustworthy HTML.

Thanks again.