Seunghyun Yoo

Posts | Development | About

A tutorial to extract P/E ratio from the website using DiffScraper

  1. Suppose you have downloaded HTML files. It can be done by calling wget ""

  2. You should generate a template file using the downloaded HTML files.

    ./ --generate shiller-pe.template 1.html 2.html

  3. Now we want to synthesize the crawling script. Use the suggest and interactive mode.

    ./ --suggest --interactive 1.html 2.html

  4. Choose the selector that you think good enough. I chose ts([selector.tagattr("div", "id", "current")], 1) # recommended

  5. Put the synthesized crawling script in the directory usually named as crawling.

    def diffscraper(T, raw_html):
     item = {}
     F = list(map(lambda x: tokenizer.Tokenizer.feature("html", x), T))
     D = template.extract(T, raw_html)
     ts = lambda x, y: D[, x, y)].strip()
     # Copy the suggested code snippet for a proper selector
     # ex: item["title"] = ts([selector.starttag("title")], 1)
     item["pe"] = ts([selector.tagattr("div", "id", "current")], 1) # recommended
     return item
  6. To obtain the pickle files from the crawling script and the template that we just generated,

    ./ --scrape shiller-pe --template shiller-pe.template --output-dir shiller-pe 1.html 2.html

  7. Now we have the extracted data. The template file can be incrementally updated until it converges… :)