Old School vs New School lite HTML Parsing

Gawk + sed VS Nokogiri

  • Etsy’s api can be queried to fetch recent listings on their website, one of the parameters that may be used in the query is category which takes an item’s ” category path” as a value. An example is /crochet/doll  where crochet is the category and doll is the subcategory.
  • Unfortunately, I could only find this html page as a source to display the possible categories and subcategories that would be necessary say, when creating a dropdown menu for both to run a search with.
  • So, how should we get the pertinent information (category and subcategory) out of this html file, and get something like this and maybe even this?

Gawk & Sed

  • All the paths are listed in the html file, so lets use a regex to create a JSON file to use in some Javascript application
    • <a href="/listing-category/crochet/doll">Doll</a>
  • gawk 'match($0, /<a href="\/listing-category\/(.*)\/(.*)"/, ary) {print ary[1],ary[2]}'
    • why not just awk? I’ll want to match two groups (category and subcategory) and then reformat them, this is a lot easier with gawk
    • match for regex
      • using the single quotes means we don’t have to escape spaces
      • remember that we are using basic regular expressions, not modern
    • $0 means that we are searching on the whole line
    • 3rd argument in the match() function is an optional array
    • then print just makes sure that everything other than category and subcategory pairs are not left in the stream
  • sed 's/\([a-z,_]*\) \([a-z,_,0-9]*\)/\    {"\1": "\2"},/'
    • now that gawk has eliminated all the superfluous lines, we use sed to format each line of the JSON file
    • the substitute option is used here, and once again, basic regex
    • notice that there will be an extra comma on the last line of the file this way
      • sed '$ s/\(.*\),/\1/' >> etsy_cat.js
        • $ means the last line
  • All of this is then inserted into a file between two brackets

Ruby & Nokogiri

  • in irb, after installing the nokogiri gem
    • f = File.open('etsy_categories.html')
    • noko = Nokogiri::HTML(f)
    • f.close
    • looking at the page source, all the lines that we are interested in are children of a div with class="children"
      • noko.css('.children').map { |e| e.children.map {|c| c.values}.flatten }
        • this can get us started with what we want, if we wanted the same exact thing as above:
        • noko.css('.children').map do |e|
          • e.children.map do |c|
            • if c.values.first
              • key_val = c.values.first.gsub("/listing-category/", "").split("/")
              • {key_val.first => key_val.last}
            • end
          • end.compact
        • end.flatten.to_json

So Which is Better?

  • gawk and sed were a little bit more fun to use, but that’s just me
  • Some would say that regex is not the tool to parse general html (break it down into its components), but in this case, the fact that our source text is surrounded in html is almost irrelevant, since we identify the presence of an important part of text with a keyword (listing-category) in addition to the anchor tag
    • So, although nokogiri is a tool built specifically for parsing html, its specialization isn’t entirely a must-have in such a situation
  • But I would say, if I had to do any more with this source page than I have already, I would continue with using nokogiri
    • So nokogiri is better suited and fancier, buts its flexibility isn’t needed in this lite case of html parsing, so the fun of using gawk and sed make them the winners for me

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s