Parslet a very cool Parsing Expression Grammar parser
Recently I made a japanese language learning tool, which is very cool :) . In this tool we need to create
a mark language parser to parse like this [kanji|furigana]
, My colleague zete, use his very professional regexp
skill to make a regexp to parse this text.
WORD_PARSER = /
(?<kanji> [\p{Han}\p{Hiragana}\p{Katakana}]+ ){0}
(?<furi> \p{Hiragana}+ ){0}
(?<oword> [^\]]+ ){0}
(?<invalid> . ){0}
(?<word> \[ \g<sp> \g<word_content> \g<sp> \] ){0}
(?<word_content> \g<kanji> \g<sp> \g<right> | \g<oword> ){0}
(?<right> \| \g<sp> \g<furi> ){0}
(?<sp> [\t\ ]* ){0}
\g<word> | \g<invalid>
/ux
Look at above stunning regular expression. I never toughed this level of complexity of regexp.
But later on when I was looking at http header parser, then found a good tool, called parslet
.
It just do the exactly same thing, and even could do more complicated things like parsing a language.
require 'parslet'
include Parslet
furidown = "[中国|ちゅうごく][雲南|うんなん][省|しょう][の][昆|こん][明|あきら][市|し][在住|ざいじゅう][の][27][歳|さい][の][アメリカ][人|じん][。]"
# Constructs a parser using a Parser Expression Grammar
class Furidown < Parslet::Parser
rule(:space) { match('\s').repeat(1) }
rule(:space?) { space.maybe }
rule(:lbracket) { str('[') >> space? }
rule(:rbracket) { str(']') >> space? }
rule(:split) { str("|") >> space? }
rule(:kanji) { match('[^\]\|]').repeat.as(:kanji) }
rule(:furigana) {split >> match('[^\]\[]').repeat.as(:furigana)}
rule(:word) {lbracket >> kanji >> furigana.maybe >> rbracket}
rule(:words) { word.repeat }
root(:words)
end
parser = Furidown.new
p parser.parse(furidown)
# the result like
=> [{:kanji=>"中国"@1, :furigana=>"ちゅうごく"@8}, {:kanji=>"雲南"@25, :furigana=>"うんなん"@32}, {:kanji=>"省"@46, :furigana=>"しょう"@50}, {:kanji=>"の"@61}, {:kanji=>"昆"@66, :furigana=>"こん"@70}, {:kanji=>"明"@78, :furigana=>"あきら"@82}, {:kanji=>"市"@93, :furigana=>"し"@97}, {:kanji=>"在住"@102, :furigana=>"ざいじゅう"@109}, {:kanji=>"の"@126}, {:kanji=>"27"@131}, {:kanji=>"歳"@135, :furigana=>"さい"@139}, {:kanji=>"の"@147}, {:kanji=>"アメリカ"@152}, {:kanji=>"人"@166, :furigana=>"じん"@170}, {:kanji=>"。"@178}]
Fantastic!!!
Reference: DSL doc Examples Get started Parslet intro - very good intro to parse erb