WordsCount

From Wizardsforge

(Difference between revisions)
(Note to All)
 
(15 intermediate revisions not shown)
Line 1: Line 1:
-
=Brian,= It is now Nov. 6, 2006 perhaps we can use this for now, at least until we get something better going. I'd like to eventually use this site and/or GoodNix for related but non-sfvlug activity and projects. But for now we can use it as a temporary place a little more private than the [http://editthis.info/sfvlug/Main_Page SFVLUG wiki.] At some point we will need more security for commercial applications but this should do for now.
+
=Note to All=
 +
I found some issues that remain with splitting and filtering words. I changed things around a bit, hopefully for the better, but even at that it still needs work. It works but still includes some junk words. I tried to err on the side of including too much rather than too little and we can hope that statistically those problems will be insignificant. I've improved the module that can be used to play with the data after all the documents are parsed so that need not bog things down with post parsing tweaks. The main module still does the same things but can be pared down to essentials later. I've named the tweaking module process_piclesxxx.py and will post it at http://forpractice.com/kbsig/psyche/WordsCount/
 +
--[[User:Miasma|Miasma]] 22:48, 24 November 2006 (EST)
-
==What I Need From You==
+
Today we have some results that we can tinker with. The WordsCount script now has a function that reads out the percentage of documents in which a word appears. The function  show_range has two local variables that can be adjusted to tweak the results. The low_range variable sets the low end of the range of words (in percentage) while the high_range variable sets the high end of the range of words to be displayed. The words within that range will just be printed to standard out at this point but could easily be used or stored. My preliminary impression is that there won't be a simple and clear cut use of this information. With a given set of 100 documents, assuming there are two documents out of the 100 that have a statistical relationship the words creating the relationship could be as low as 2% with only two documents sharing a common word. We may need to rely on other indicators, however, when we get to creating sets we may find a cluster of common words between related documents that will be more obvious.--[[User:Miasma|Miasma]] 15:23, 21 November 2006 (EST)
-
===ToDo List===
+
-
It would be a huge help for me to have some sort of ToDo list of some of the next steps, especially tasks that I can do or quickly learn how to do. I'm reasonably but not completely proficient at Python up to the function and module level. I still have a way to go with Classes. I should have started taking better notes myself, Ironicly, I forget my memory isn't what it once was.
+
 +
I have been cleaning up my script with the goal of reducing unneeded global variables and making each function more generic and therefore more reusable. I am leaving in printing the counts to a text file to make it easier for a human to read but have been adding cPickle and shelve files so we can play with results later without having to parse all the text files again. I'll build the analysis portion as a separate module but in a manner that it could be inserted easily into the original script if desired.--[[User:Miasma|Miasma]] 16:07, 20 November 2006 (EST)
-
===Rough Schedule===
+
=Note to Brian=
-
I don't need anything fancy or rigid. I can probably wind up the word counting, saving and addition part in real short order, it is there in theory, just a matter of getting it polished to where we need it for the next step(s) You can do that how ever and where ever you want.  
+
I've moved the conversation to the discussion tab for this page. I'll do some more work to clean up this area for any future work. Please check there for the latest comments. We can move them somewhere else and even return them here if you prefer. I'd like to make it easy and quick to add notes and the discussion tab seemed a good place but You can also get there by making a direct link like this:
 +
http://editthis.info/wizardsforge/Talk:WordsCount
-
===Feedback===
+
==Interesting Resources==
-
I probably won't need constant feedback, but if you could at least touch base now and then, even if it is just once every day or so. Even if it is just to let me know you are real busy and I shouldn't bug you for X amount of time it would be a big help for me. I just need enough info to help me manage my own time better. I'll continue to try and get some of our collaborative tools in better shape.  
+
Not a lot yet, but I did find this site on AI where the guy running it has a bunch of interesting stuff including programs written in Python. I think I posted the link to his ftp downloads in IRC the other day but the main site is worth a look.
 +
http://zhar.net/
==SCALE==
==SCALE==
Line 21: Line 24:
Briefly, the tool takes arguments and options that sends a text version of an URL (or standard in) passed to it and sends the results to standard out or to a file. It may be the quickest and easiest way to grab text from a page.
Briefly, the tool takes arguments and options that sends a text version of an URL (or standard in) passed to it and sends the results to standard out or to a file. It may be the quickest and easiest way to grab text from a page.
 +
 +
==pdftohtml and pdftotext==
 +
 +
I also discovered these open source tools. I've only tested pdftotext so far. It did a fair job of converting but left out most traces of formatting. It turned a 6.3MB pdf into 1.4MB text file. It managed to do that in relatively short amount of time. It means we don't have to leave out pdf documents in a survey. There are probably other cool tools to do other stuff. All of this is for future reference since we can obviously proceed for now without these capabilities but is good to note.
==ForPractice Update==
==ForPractice Update==
Line 28: Line 35:
We also still have this one I made a long time ago when our regular server was down. I'd forgotten about it but rediscovered it when setting up this other stuff. http://forpractice.com/sfvlug/ it also has a link to Psyche files in its own subdirectory.
We also still have this one I made a long time ago when our regular server was down. I'd forgotten about it but rediscovered it when setting up this other stuff. http://forpractice.com/sfvlug/ it also has a link to Psyche files in its own subdirectory.
-
 
-
==SFVLUG Wiki==
 
-
Don't forget about this one. You may not have the time to work on anything there yourself but maybe you can encourage others to at least check it out. http://editthis.info/sfvlug/Main_Page
 
[http://www.constantsun.com/blog/2006/09/16/setting-a-domain-name-to-point-to-your-wiki/ Here] is info on getting our own URL to point to our wiki. There are probably other ways than this to get the job done.
[http://www.constantsun.com/blog/2006/09/16/setting-a-domain-name-to-point-to-your-wiki/ Here] is info on getting our own URL to point to our wiki. There are probably other ways than this to get the job done.
 +
==Python Mutable Error==
 +
I don't know if you recall but I mentioned a problem where Python gives an erroneous answer without complaining. Here is an example:<br>
 +
 +
<code>
 +
>>> s = [2, 4, 6, 1, 3, 5, 9, 8]<br>
 +
>>> for x in s:<br>
 +
...    if x % 2 != 0: s.remove(x)<br>
 +
...<br>
 +
>>> print s<br>
 +
[2, 4, 6, 3, 9, 8]<br>
 +
>>><br>
 +
>>> s = [2, 4, 6, 1, 3, 5, 9, 8]<br>
 +
>>> for x in s[:]:<br>
 +
...    if x % 2 != 0: s.remove(x)<br>
 +
...<br>
 +
>>> print s<br>
 +
[2, 4, 6, 8]<br>
 +
</code>
 +
What my text says is, "If the sequence is a list, don't modify it in place in the body of the loop; if you do, Python may skip or repeat sequence items. Iterate over a copy of the sequence instead. This problem happens only for mutable sequences (that is, lists)."
 +
As you can see the for loop through the list gave errors without complaining while in the for loop pass through the copy it worked just fine.
 +
----
 +
Kurt, this is because you're shrinking the list length that you're iterating over w/out re-evaluating its length, I'm guessing that the index its using jumps a cell as the list length shrinks (which your output shows), whereas in s[:] you've sliced the list which copies it keeping in mind you're removing with s.remove(x) which will remove first occurance of x only. If you have redunant cells the copies will be left in the list till your modulus gets to them.
 +
 +
see here for examples and output -> http://www.nixwit.org/kurt.py.txt and http://www.nixwit.org/output.txt
 +
 +
PS. The <nowiki><code></code></nowiki> tags don't work as billed in this wiki. This might be a very large problem although I must admit I may just have to read up on them a bit.
 +
----
Return to [[Main Page]]
Return to [[Main Page]]
 +
 +
[[Dust Bin]] is just a place to put old stuff I'm not quite ready to toss out yet.

Current revision as of 03:48, 25 November 2006

Contents

Note to All

I found some issues that remain with splitting and filtering words. I changed things around a bit, hopefully for the better, but even at that it still needs work. It works but still includes some junk words. I tried to err on the side of including too much rather than too little and we can hope that statistically those problems will be insignificant. I've improved the module that can be used to play with the data after all the documents are parsed so that need not bog things down with post parsing tweaks. The main module still does the same things but can be pared down to essentials later. I've named the tweaking module process_piclesxxx.py and will post it at http://forpractice.com/kbsig/psyche/WordsCount/ --Miasma 22:48, 24 November 2006 (EST)

Today we have some results that we can tinker with. The WordsCount script now has a function that reads out the percentage of documents in which a word appears. The function show_range has two local variables that can be adjusted to tweak the results. The low_range variable sets the low end of the range of words (in percentage) while the high_range variable sets the high end of the range of words to be displayed. The words within that range will just be printed to standard out at this point but could easily be used or stored. My preliminary impression is that there won't be a simple and clear cut use of this information. With a given set of 100 documents, assuming there are two documents out of the 100 that have a statistical relationship the words creating the relationship could be as low as 2% with only two documents sharing a common word. We may need to rely on other indicators, however, when we get to creating sets we may find a cluster of common words between related documents that will be more obvious.--Miasma 15:23, 21 November 2006 (EST)

I have been cleaning up my script with the goal of reducing unneeded global variables and making each function more generic and therefore more reusable. I am leaving in printing the counts to a text file to make it easier for a human to read but have been adding cPickle and shelve files so we can play with results later without having to parse all the text files again. I'll build the analysis portion as a separate module but in a manner that it could be inserted easily into the original script if desired.--Miasma 16:07, 20 November 2006 (EST)

Note to Brian

I've moved the conversation to the discussion tab for this page. I'll do some more work to clean up this area for any future work. Please check there for the latest comments. We can move them somewhere else and even return them here if you prefer. I'd like to make it easy and quick to add notes and the discussion tab seemed a good place but You can also get there by making a direct link like this: http://editthis.info/wizardsforge/Talk:WordsCount

Interesting Resources

Not a lot yet, but I did find this site on AI where the guy running it has a bunch of interesting stuff including programs written in Python. I think I posted the link to his ftp downloads in IRC the other day but the main site is worth a look. http://zhar.net/

SCALE

Steve seems interested and will probably be a big help to me as far as our WordsCount project and as well as in getting Psyche back in order for the next SCALE. He did say he would like to go over it and fix up some stuff and I might be able to learn more Python from him. We should probably devote a reasonable effort to SCALE soon too.

Here is a brief rundown on the current state of affairs the early part of November 2006.

html2text

I don't know how this was overlooked, maybe it wasn't and found lacking. Anyway it might make a great tool for grabbing the text from websites. You may already have it installed, I did. It is html2text and more info on it can be found at http://userpage.fu-berlin.de/~mbayer/tools/html2text.html beware there are other tools with the same name including one in Python http://userpage.fu-berlin.de/~mbayer/tools/html2text.html which has a mixed message in that its license is GPL2.0 but has "Try" and "Buy" headings suggesting either a sense of humor or a desire to get money. That is just the beginning and about as far as I got, but there are more free as well as commercial versions of similar or identical products and services. I don't know how well they do commercially but we may want to take a closer look at some of their business models and marketing strategies.

Briefly, the tool takes arguments and options that sends a text version of an URL (or standard in) passed to it and sends the results to standard out or to a file. It may be the quickest and easiest way to grab text from a page.

pdftohtml and pdftotext

I also discovered these open source tools. I've only tested pdftotext so far. It did a fair job of converting but left out most traces of formatting. It turned a 6.3MB pdf into 1.4MB text file. It managed to do that in relatively short amount of time. It means we don't have to leave out pdf documents in a survey. There are probably other cool tools to do other stuff. All of this is for future reference since we can obviously proceed for now without these capabilities but is good to note.

ForPractice Update

I posted to IRC but that can be a bit too ephemeral. I have set up a ftp/pop account at: http://forpractice.com/kbsig/ I don't have anything there yet except the old Psyche files.

I also dusted of this site I created some time ago that it seems you never used at: http://forpractice.com/brian/ Contact me for how/if you want the password.

We also still have this one I made a long time ago when our regular server was down. I'd forgotten about it but rediscovered it when setting up this other stuff. http://forpractice.com/sfvlug/ it also has a link to Psyche files in its own subdirectory.


Here is info on getting our own URL to point to our wiki. There are probably other ways than this to get the job done.

Python Mutable Error

I don't know if you recall but I mentioned a problem where Python gives an erroneous answer without complaining. Here is an example:

>>> s = [2, 4, 6, 1, 3, 5, 9, 8]
>>> for x in s:
... if x % 2 != 0: s.remove(x)
...
>>> print s
[2, 4, 6, 3, 9, 8]
>>>
>>> s = [2, 4, 6, 1, 3, 5, 9, 8]
>>> for x in s[:]:
... if x % 2 != 0: s.remove(x)
...
>>> print s
[2, 4, 6, 8]
What my text says is, "If the sequence is a list, don't modify it in place in the body of the loop; if you do, Python may skip or repeat sequence items. Iterate over a copy of the sequence instead. This problem happens only for mutable sequences (that is, lists)." As you can see the for loop through the list gave errors without complaining while in the for loop pass through the copy it worked just fine.


Kurt, this is because you're shrinking the list length that you're iterating over w/out re-evaluating its length, I'm guessing that the index its using jumps a cell as the list length shrinks (which your output shows), whereas in s[:] you've sliced the list which copies it keeping in mind you're removing with s.remove(x) which will remove first occurance of x only. If you have redunant cells the copies will be left in the list till your modulus gets to them.

see here for examples and output -> http://www.nixwit.org/kurt.py.txt and http://www.nixwit.org/output.txt

PS. The <code></code> tags don't work as billed in this wiki. This might be a very large problem although I must admit I may just have to read up on them a bit.


Return to Main Page

Dust Bin is just a place to put old stuff I'm not quite ready to toss out yet.

Personal tools