Web scraping: Interactive prototyping

Prototyping a solution for Web scraping a friend's blog interactively…

  1. Fetching the root page:>>> import urllib >>> u0 = "http://blog.tapuz.co.il/linsom/" >>> r0 = urllib.urlopen(u0) # Response.

    Nice feature: urllib handles redirections transparently:

    >>> r0.geturl() 'http://www.tapuz.co.il/blog/userBlog.asp?FolderName=linsom'

    Not so nice: response body seems to end prematurely. Need to specify size to read()? No, doesn't seem to help.

    >>> r0 = urllib.urlopen(u0) >>> b0 = r0.read(int(1e6)) # Body. >>> len(b0) 335709 >>> b0[-40:] # How ends? 'tyFF2();\r\n }\r\n\t}\t\r\n\r\n </script>\r\n'

    Docs say: "caveat: the read() method, if the size argument is omitted or negative, may not read until the end of the data stream". Foo.

    Bah, it's their rotten HTML.

  2. Saving the response so can analyze it:>>> import os >>> os.chdir("/tmp") # Someplace to dump stuff. >>> open("b0","w").write(r0.read())

    Note: if the response body is not read immediately after sending the request, the connection closes, apparently, and throws "socket.error: (104, 'Connection reset by peer')".

    Oy, but note the encoding, too:

    >>> r0.info() <httplib.HTTPMessage instance at 0x2adcb4b573f8> >>> r0.info().items() [('x-powered-by', 'ASP.NET'), ('set-cookie', 'TapuzBlog=blogId59549=1&Blognum59549=yes&blogId=1; expires=Sat, 22-Nov-2008 22:00:00 GMT; domain=tapuz.co.il; path=/, ASPSESSIONIDQSCDRBQS=BHLEMCODAENIMBFCIIDJCABP; path=/'), ('expires', 'Fri, 21 Nov 2008 17:31:49 GMT'), ('server', 'Microsoft-IIS/6.0'), ('connection', 'close'), ('cache-control', 'private'), ('date', 'Sat, 22 Nov 2008 10:11:50 GMT'), ('content-type', 'text/html; Charset=windows-1255')]
  3. Crawling:

    Blog's root page show the last few entries, but it would be more convenient to harvest them one by one, no?

    1. Links to posts seem to all look like this:href='ViewEntry.asp?EntryId=1352026'

      so can be matched with a simple regular expression:

      >>> import re >>> m = re.search(r"href='(ViewEntry\.asp\?EntryId=.+?)'>", b0) >>> m.group(1) 'ViewEntry.asp?EntryId=1371470'
    2. Fetch that post (then recursively):>>> u1 = u0[ : u0.rindex("/")] + "/" + m.group(1) >>> r1 = urllib.urlopen(u1); b1 = r1.read(int(1e6)) >>> len(b1) 307991

      This response, too, seems truncated, ends the same, but I wouldn't be surprised if their horrid HTML just ends like that.

  4. Parsing HTML:

    "Sterilize" the HTML so can render it in Konqueror without running embedded scripts, ignoring styling, etc, and in UTF-8:

    $ ./sterilize_html.py < b1 > b1-sterilized

    (Cleaning the main HTML: 17,355 lines, replaced 235 occurences of "http://" with "httpXXX://", 65 "<\s*SCRIPT" with "<XXXscript", 7 "<\s*iframe" with "<XXXiframe", 67 "on([a-z]+=)" with "onXXX\1", removed leading spaces from 12,350 lines. And trailing. File shrunk from 846KB to 301KB. Page won't render because embedded JavaScript contains HTML. Need to also disable "<object", "<param", "marquee", and "<embed". Still won't render: it's the embedded scripts. Did everything again, incrementally, found 29 scripts that rearrange content blocks on the page; they all call "insertBefore". To reveal structure: replaced 286 "border=" with "borderXXX=", 660 "width=", 496 "ing=", 646 "style=", 92 "color=", 438 "height=", etc. Crude, but works. Told you it's horrible.)

    Found the text:

    <table widthXXX='98%' cellpaddingXXX='0' cellspacingXXX='0'><tr><td align='right' ><a id='FontHedT' styleXXX='font-size:18px;color:#9A8575;text-decoration:none;' href='ViewEntry.asp?EntryId=1371470'><strong>על מגע touch</strong></a>&nbsp;&nbsp; <br><!-- This is a test --></td><td align='left' ><XXXfont styleXXX='font-size:9pt;color:black;' ></font></td><td align='left' widthXXX='120'><div id='divHit_1371470' styleXXX='float:left;'></div><div id='divHitTitle_1371470' styleXXX='display:none;'>על מגע touch</div><div id='divHitDescr_1371470' styleXXX='display:none;'>על מגע המגע המחבר בין אדם לאדם ויש כל מיני סוגים כל אדם זקוק למגע ויש הבדל עצום בין מגע לנגיעה חטופה אדם שרק נוגע בחטוף בודק או רק זורק סימן של קירבה אדם שיודע להושיב את ידו לגעת - מגיע כל אדם מתמלא ממגע יש כאלה שמפחדים נבהלים בדיוק כמו שהנפש שלהם לא משוחררת כך גופם קפוא מלקבל או לתת נגיעה היא סוג...</div></td></tr></table><table widthXXX=98% borderXXX=0 cellspacingXXX=0 cellpaddingXXX=0 align=center><tr><td align='center' id='divHitBlank_1371470'></td></tr></table><table widthXXX='98%' cellpaddingXXX='0' cellspacingXXX='0'><tr><td heightXXX='4'></td></tr><tr><td heightXXX='1' id='FontHed'><hr id='FontHedT' styleXXX='color:#9A8575'></td></tr><tr><td heightXXX='4'></td></tr></table><table dir=rtl widthXXX='98%' cellpaddingXXX='0' cellspacingXXX='0' styleXXX='font-size:10pt;' borderXXX=0><tr><td align='left' ><table dir=rtl cellpaddingXXX='0' cellspacingXXX='0' styleXXX='font-size:10pt;' borderXXX=0><tr><td align='left'><XXXfont id='FontHedT' colorXXX='#9A8575'>פורסם ב21 בנובמבר 2008, 22:47</font></td><td align='left' widthXXX=5></td><td align='left'><XXXfont id='FontHedT' colorXXX='#9A8575'>במדור&nbsp;</font></td><td align='left' id=NameCat><a href='userblog.asp?passok=yes&Catid=&blogId=59549' target='_self'><XXXfont id='FontHed'><u><b></b></u></font></a></td></tr></table></td></tr></table><div></div><table dir='ltr' widthXXX='100%' borderXXX='0' cellspacingXXX='0' cellpaddingXXX='0' align='center' styleXXX='font-size:10pt;'><tr><td dir='rtl' valign='top' align='center'><a target='_blank' href='httpXXX://blog.tapuz.co.il/linsom/images/1626680_928.jpg'><img borderXXX='0' src='httpXXX://blog.tapuz.co.il/linsom/images/1626680_928.jpg' widthXXX='243'></a></td></tr></table><br><XXXfont dir='rtl' id='FontSizeA'><div align="center"><u><br>על מגע</u></div><div align="center">&nbsp;</div><div align="center">המגע&nbsp;המחבר בין אדםלאדם</div><div align="center">ויש כל מיני סוגים </div><div align="center">&nbsp;</div><div align="center">כל אדם זקוק למגע</div><div align="center">ויש הבדל עצום בין מגע לנגיעה חטופה</div><div align="center">אדם שרק נוגע&nbsp; בחטוף בודק או רק זורק סימן של קירבה</div><div align="center">אדם שיודע להושיב את ידו לגעת - מגיע</div><div align="center">&nbsp;</div><div align="center">כל אדם מתמלא ממגע</div><div align="center">יש כאלה שמפחדים</div><div align="center">נבהלים</div><div align="center">בדיוק כמו שהנפש שלהם לא משוחררת</div><div align="center">כך גופם קפוא מלקבל או לתת</div><div align="center">&nbsp;</div><div align="center">נגיעה היא סוג של רצון לקשר</div><div align="center">מגע הוא החיבור המופלא לכל סוג של אהבה</div><div align="center">&nbsp;</div><div align="center">מגע מרפא</div><div align="center">נותן את הביטחון שלו</div><div align="center">לילד - בזוגיות - בחברות</div><div align="center">במשפחה </div><div align="center">&nbsp;</div><div align="center">לתת לאנשים שחשובים לנו לדעת שאנחנו שם בשבילם</div><div align="center">ויותר ממילה פרח או מתנה</div><div styleXXX="FONT-SIZE: 12pt" align="center">המגע</div><div align="center">&nbsp;</div><div align="center">&nbsp;</div><div align="center">החיבוק&nbsp;שיכול תמיד להחזיר&nbsp;בנו נשימה</div><div align="center">&nbsp;</div><div align="center">&nbsp;</div><div align="center">&nbsp;</div><div align="center">&nbsp;</div><div align="center">&nbsp;</div><div align="center">&nbsp;</div><div align="center">&nbsp;</div><div align="center"><b><XXXfont id='FontSizeA' colorXXX="#660000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; © כל הזכויות שמורות&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </font></b><div id="FontSizeA" align="center"><XXXfont id='FontSizeA' id="FontSizeA" colorXXX="#660000" size="2"><b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;רויתשחם</b></font></div><div id="FontSizeA" align="center"><base target="_self"><XXXfont id='FontSizeA' id="FontSizeA"><XXXfont id='FontSizeA' id="FontSizeA" colorXXX="#660000"><XXXfont id='FontSizeA' id="FontSizeA" size="2"><b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; RAVIT SHAHAM<br></b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span id='FontSizeA' id="FontSizeA">&nbsp;<b>&nbsp;&nbsp;"התעוררות "</b></span></font></font></font></div></div><div align="center">&nbsp;</div></font>

    Now, how to find it programmatically?!

  5. Beautiful Soup (BS):
    1. Install Beautiful Soup: unpacked the tarball, then:$ sudo python setup.py install

      This copied everything to /usr/lib/python2.5/site-packages/.

    2. Beautiful Soup doesn't guess the encoding right, it seems.>>> b1soup=bs.BeautifulSoup(b1,fromEncoding="windows-1255")

      Doing it myself works:

      >>> b1soup=bs.BeautifulSoup(unicode(b1,"windows-1255","replace")) >>> file("b1-soup","w").write(b1soup.prettify())

      BS outputs UTF-8 by default (I think). Yes, docs say: "Beautiful Soup stores only Unicode strings".

    3. BS code seems a bit, eh, archaic. Not too clean, nor efficient. Inconvenient for interactive DOM exploration. Use Firebug!?


Notes

(None.)

(Appending notes disabled temporarily.)


Last modified 2009-08-03 09:44:37 +0000