WordsCount

From Wizardsforge

Revision as of 23:07, 14 November 2006 by Kurt (Talk | contribs)

Contents

Brian,

today is Sunday Nov. 12 and I have updated the code at http://forpractice.com/kbsig/psyche/WordsCount/ a couple of times already. Other than what we discussed on the phone I changed the initialization to set everything to operate on a relative path based on the directory the program is run from so we don't need to change a lot of stuff across different systems. Later I'll make it so it installs its required directory structure but for now you need to make sure you have the 'text' directory with the text files to be parsed as well as 'text-count' with its own sub-directory 'totals' already there. I probably won't do any more work on it till tomorrow, but if there are significant changes I'll post them here. The comments need to be updated on many references to the file structure so the code must speak for itself in that regard till I get around to fixing that.

Today is Nov. 11, 2006 and I have posted the code I have to http://forpractice.com/kbsig/psyche/WordsCount/ If there is more than one version the one with the largest version number is the latest until I have a stable version then I'll strip the version # from the name.


It is now Nov. 6, 2006 perhaps we can use this for now, at least until we get something better going. I'd like to eventually use this site and/or GoodNix for related but non-sfvlug activity and projects. But for now we can use it as a temporary place a little more private than the SFVLUG wiki. At some point we will need more security for commercial applications but this should do for now.

What I Need From You

ToDo List

It would be a huge help for me to have some sort of ToDo list of some of the next steps, especially tasks that I can do or quickly learn how to do. I'm reasonably but not completely proficient at Python up to the function and module level. I still have a way to go with Classes. I should have started taking better notes myself, Ironicly, I forget my memory isn't what it once was.


Rough Schedule

I don't need anything fancy or rigid. I can probably wind up the word counting, saving and addition part in real short order, it is there in theory, just a matter of getting it polished to where we need it for the next step(s) You can do that how ever and where ever you want.

Feedback

I probably won't need constant feedback, but if you could at least touch base now and then, even if it is just once every day or so. Even if it is just to let me know you are real busy and I shouldn't bug you for X amount of time it would be a big help for me. I just need enough info to help me manage my own time better. I'll continue to try and get some of our collaborative tools in better shape.

SCALE

Steve seems interested and will probably be a big help to me as far as our WordsCount project and as well as in getting Psyche back in order for the next SCALE. He did say he would like to go over it and fix up some stuff and I might be able to learn more Python from him. We should probably devote a reasonable effort to SCALE soon too.

Here is a brief rundown on the current state of affairs the early part of November 2006.

html2text

I don't know how this was overlooked, maybe it wasn't and found lacking. Anyway it might make a great tool for grabbing the text from websites. You may already have it installed, I did. It is html2text and more info on it can be found at http://userpage.fu-berlin.de/~mbayer/tools/html2text.html beware there are other tools with the same name including one in Python http://userpage.fu-berlin.de/~mbayer/tools/html2text.html which has a mixed message in that its license is GPL2.0 but has "Try" and "Buy" headings suggesting either a sense of humor or a desire to get money. That is just the beginning and about as far as I got, but there are more free as well as commercial versions of similar or identical products and services. I don't know how well they do commercially but we may want to take a closer look at some of their business models and marketing strategies.

Briefly, the tool takes arguments and options that sends a text version of an URL (or standard in) passed to it and sends the results to standard out or to a file. It may be the quickest and easiest way to grab text from a page.

pdftohtml and pdftotext

I also discovered these open source tools. I've only tested pdftotext so far. It did a fair job of converting but left out most traces of formatting. It turned a 6.3MB pdf into 1.4MB text file. It managed to do that in relatively short amount of time. It means we don't have to leave out pdf documents in a survey. There are probably other cool tools to do other stuff. All of this is for future reference since we can obviously proceed for now without these capabilities but is good to note.

ForPractice Update

I posted to IRC but that can be a bit too ephemeral. I have set up a ftp/pop account at: http://forpractice.com/kbsig/ I don't have anything there yet except the old Psyche files.

I also dusted of this site I created some time ago that it seems you never used at: http://forpractice.com/brian/ Contact me for how/if you want the password.

We also still have this one I made a long time ago when our regular server was down. I'd forgotten about it but rediscovered it when setting up this other stuff. http://forpractice.com/sfvlug/ it also has a link to Psyche files in its own subdirectory.

SFVLUG Wiki

Don't forget about this one. You may not have the time to work on anything there yourself but maybe you can encourage others to at least check it out. http://editthis.info/sfvlug/Main_Page


Here is info on getting our own URL to point to our wiki. There are probably other ways than this to get the job done.

Python Mutable Error

I don't know if you recall but I mentioned a problem where Python gives an erroneous answer without complaining. Here is an example:

>>> s = [2, 4, 6, 1, 3, 5, 9, 8]
>>> for x in s:
... if x % 2 != 0: s.remove(x)
...
>>> print s
[2, 4, 6, 3, 9, 8]
>>>
>>> s = [2, 4, 6, 1, 3, 5, 9, 8]
>>> for x in s[:]:
... if x % 2 != 0: s.remove(x)
...
>>> print s
[2, 4, 6, 8]
What my text says is, "If the sequence is a list, don't modify it in place in the body of the loop; if you do, Python may skip or repeat sequence items. Iterate over a copy of the sequence instead. This problem happens only for mutable sequences (that is, lists)." As you can see the for loop through the list gave errors without complaining while in the for loop pass through the copy it worked just fine.


Return to Main Page

Dust Bin is just a place to put old stuff I'm not quite ready to toss out yet.

Personal tools