WordsCount

From Wizardsforge

Revision as of 13:22, 15 November 2006 by Kurt (Talk | contribs)

Contents

SCALE

Steve seems interested and will probably be a big help to me as far as our WordsCount project and as well as in getting Psyche back in order for the next SCALE. He did say he would like to go over it and fix up some stuff and I might be able to learn more Python from him. We should probably devote a reasonable effort to SCALE soon too.

Here is a brief rundown on the current state of affairs the early part of November 2006.

html2text

I don't know how this was overlooked, maybe it wasn't and found lacking. Anyway it might make a great tool for grabbing the text from websites. You may already have it installed, I did. It is html2text and more info on it can be found at http://userpage.fu-berlin.de/~mbayer/tools/html2text.html beware there are other tools with the same name including one in Python http://userpage.fu-berlin.de/~mbayer/tools/html2text.html which has a mixed message in that its license is GPL2.0 but has "Try" and "Buy" headings suggesting either a sense of humor or a desire to get money. That is just the beginning and about as far as I got, but there are more free as well as commercial versions of similar or identical products and services. I don't know how well they do commercially but we may want to take a closer look at some of their business models and marketing strategies.

Briefly, the tool takes arguments and options that sends a text version of an URL (or standard in) passed to it and sends the results to standard out or to a file. It may be the quickest and easiest way to grab text from a page.

pdftohtml and pdftotext

I also discovered these open source tools. I've only tested pdftotext so far. It did a fair job of converting but left out most traces of formatting. It turned a 6.3MB pdf into 1.4MB text file. It managed to do that in relatively short amount of time. It means we don't have to leave out pdf documents in a survey. There are probably other cool tools to do other stuff. All of this is for future reference since we can obviously proceed for now without these capabilities but is good to note.

ForPractice Update

I posted to IRC but that can be a bit too ephemeral. I have set up a ftp/pop account at: http://forpractice.com/kbsig/ I don't have anything there yet except the old Psyche files.

I also dusted of this site I created some time ago that it seems you never used at: http://forpractice.com/brian/ Contact me for how/if you want the password.

We also still have this one I made a long time ago when our regular server was down. I'd forgotten about it but rediscovered it when setting up this other stuff. http://forpractice.com/sfvlug/ it also has a link to Psyche files in its own subdirectory.

SFVLUG Wiki

Don't forget about this one. You may not have the time to work on anything there yourself but maybe you can encourage others to at least check it out. http://editthis.info/sfvlug/Main_Page


Here is info on getting our own URL to point to our wiki. There are probably other ways than this to get the job done.

Python Mutable Error

I don't know if you recall but I mentioned a problem where Python gives an erroneous answer without complaining. Here is an example:

>>> s = [2, 4, 6, 1, 3, 5, 9, 8]
>>> for x in s:
... if x % 2 != 0: s.remove(x)
...
>>> print s
[2, 4, 6, 3, 9, 8]
>>>
>>> s = [2, 4, 6, 1, 3, 5, 9, 8]
>>> for x in s[:]:
... if x % 2 != 0: s.remove(x)
...
>>> print s
[2, 4, 6, 8]
What my text says is, "If the sequence is a list, don't modify it in place in the body of the loop; if you do, Python may skip or repeat sequence items. Iterate over a copy of the sequence instead. This problem happens only for mutable sequences (that is, lists)." As you can see the for loop through the list gave errors without complaining while in the for loop pass through the copy it worked just fine.


Return to Main Page

Dust Bin is just a place to put old stuff I'm not quite ready to toss out yet.

Personal tools