WordsCount
From Wizardsforge
(→Note to All) |
|||
(5 intermediate revisions not shown) | |||
Line 1: | Line 1: | ||
+ | =Note to All= | ||
+ | I found some issues that remain with splitting and filtering words. I changed things around a bit, hopefully for the better, but even at that it still needs work. It works but still includes some junk words. I tried to err on the side of including too much rather than too little and we can hope that statistically those problems will be insignificant. I've improved the module that can be used to play with the data after all the documents are parsed so that need not bog things down with post parsing tweaks. The main module still does the same things but can be pared down to essentials later. I've named the tweaking module process_piclesxxx.py and will post it at http://forpractice.com/kbsig/psyche/WordsCount/ | ||
+ | --[[User:Miasma|Miasma]] 22:48, 24 November 2006 (EST) | ||
+ | |||
+ | Today we have some results that we can tinker with. The WordsCount script now has a function that reads out the percentage of documents in which a word appears. The function show_range has two local variables that can be adjusted to tweak the results. The low_range variable sets the low end of the range of words (in percentage) while the high_range variable sets the high end of the range of words to be displayed. The words within that range will just be printed to standard out at this point but could easily be used or stored. My preliminary impression is that there won't be a simple and clear cut use of this information. With a given set of 100 documents, assuming there are two documents out of the 100 that have a statistical relationship the words creating the relationship could be as low as 2% with only two documents sharing a common word. We may need to rely on other indicators, however, when we get to creating sets we may find a cluster of common words between related documents that will be more obvious.--[[User:Miasma|Miasma]] 15:23, 21 November 2006 (EST) | ||
+ | |||
+ | I have been cleaning up my script with the goal of reducing unneeded global variables and making each function more generic and therefore more reusable. I am leaving in printing the counts to a text file to make it easier for a human to read but have been adding cPickle and shelve files so we can play with results later without having to parse all the text files again. I'll build the analysis portion as a separate module but in a manner that it could be inserted easily into the original script if desired.--[[User:Miasma|Miasma]] 16:07, 20 November 2006 (EST) | ||
+ | |||
=Note to Brian= | =Note to Brian= | ||
- | I've moved the conversation to the discussion tab for this page. I'll do some more work to clean up this area for any future work. | + | I've moved the conversation to the discussion tab for this page. I'll do some more work to clean up this area for any future work. Please check there for the latest comments. We can move them somewhere else and even return them here if you prefer. I'd like to make it easy and quick to add notes and the discussion tab seemed a good place but You can also get there by making a direct link like this: |
+ | http://editthis.info/wizardsforge/Talk:WordsCount | ||
+ | |||
+ | ==Interesting Resources== | ||
+ | Not a lot yet, but I did find this site on AI where the guy running it has a bunch of interesting stuff including programs written in Python. I think I posted the link to his ftp downloads in IRC the other day but the main site is worth a look. | ||
+ | http://zhar.net/ | ||
==SCALE== | ==SCALE== | ||
Line 46: | Line 59: | ||
What my text says is, "If the sequence is a list, don't modify it in place in the body of the loop; if you do, Python may skip or repeat sequence items. Iterate over a copy of the sequence instead. This problem happens only for mutable sequences (that is, lists)." | What my text says is, "If the sequence is a list, don't modify it in place in the body of the loop; if you do, Python may skip or repeat sequence items. Iterate over a copy of the sequence instead. This problem happens only for mutable sequences (that is, lists)." | ||
As you can see the for loop through the list gave errors without complaining while in the for loop pass through the copy it worked just fine. | As you can see the for loop through the list gave errors without complaining while in the for loop pass through the copy it worked just fine. | ||
+ | ---- | ||
+ | Kurt, this is because you're shrinking the list length that you're iterating over w/out re-evaluating its length, I'm guessing that the index its using jumps a cell as the list length shrinks (which your output shows), whereas in s[:] you've sliced the list which copies it keeping in mind you're removing with s.remove(x) which will remove first occurance of x only. If you have redunant cells the copies will be left in the list till your modulus gets to them. | ||
+ | |||
+ | see here for examples and output -> http://www.nixwit.org/kurt.py.txt and http://www.nixwit.org/output.txt | ||
+ | PS. The <nowiki><code></code></nowiki> tags don't work as billed in this wiki. This might be a very large problem although I must admit I may just have to read up on them a bit. | ||
---- | ---- | ||
Current revision as of 03:48, 25 November 2006
Contents |
Note to All
I found some issues that remain with splitting and filtering words. I changed things around a bit, hopefully for the better, but even at that it still needs work. It works but still includes some junk words. I tried to err on the side of including too much rather than too little and we can hope that statistically those problems will be insignificant. I've improved the module that can be used to play with the data after all the documents are parsed so that need not bog things down with post parsing tweaks. The main module still does the same things but can be pared down to essentials later. I've named the tweaking module process_piclesxxx.py and will post it at http://forpractice.com/kbsig/psyche/WordsCount/ --Miasma 22:48, 24 November 2006 (EST)
Today we have some results that we can tinker with. The WordsCount script now has a function that reads out the percentage of documents in which a word appears. The function show_range has two local variables that can be adjusted to tweak the results. The low_range variable sets the low end of the range of words (in percentage) while the high_range variable sets the high end of the range of words to be displayed. The words within that range will just be printed to standard out at this point but could easily be used or stored. My preliminary impression is that there won't be a simple and clear cut use of this information. With a given set of 100 documents, assuming there are two documents out of the 100 that have a statistical relationship the words creating the relationship could be as low as 2% with only two documents sharing a common word. We may need to rely on other indicators, however, when we get to creating sets we may find a cluster of common words between related documents that will be more obvious.--Miasma 15:23, 21 November 2006 (EST)
I have been cleaning up my script with the goal of reducing unneeded global variables and making each function more generic and therefore more reusable. I am leaving in printing the counts to a text file to make it easier for a human to read but have been adding cPickle and shelve files so we can play with results later without having to parse all the text files again. I'll build the analysis portion as a separate module but in a manner that it could be inserted easily into the original script if desired.--Miasma 16:07, 20 November 2006 (EST)
Note to Brian
I've moved the conversation to the discussion tab for this page. I'll do some more work to clean up this area for any future work. Please check there for the latest comments. We can move them somewhere else and even return them here if you prefer. I'd like to make it easy and quick to add notes and the discussion tab seemed a good place but You can also get there by making a direct link like this: http://editthis.info/wizardsforge/Talk:WordsCount
Interesting Resources
Not a lot yet, but I did find this site on AI where the guy running it has a bunch of interesting stuff including programs written in Python. I think I posted the link to his ftp downloads in IRC the other day but the main site is worth a look. http://zhar.net/
SCALE
Steve seems interested and will probably be a big help to me as far as our WordsCount project and as well as in getting Psyche back in order for the next SCALE. He did say he would like to go over it and fix up some stuff and I might be able to learn more Python from him. We should probably devote a reasonable effort to SCALE soon too.
Here is a brief rundown on the current state of affairs the early part of November 2006.
html2text
I don't know how this was overlooked, maybe it wasn't and found lacking. Anyway it might make a great tool for grabbing the text from websites. You may already have it installed, I did. It is html2text and more info on it can be found at http://userpage.fu-berlin.de/~mbayer/tools/html2text.html beware there are other tools with the same name including one in Python http://userpage.fu-berlin.de/~mbayer/tools/html2text.html which has a mixed message in that its license is GPL2.0 but has "Try" and "Buy" headings suggesting either a sense of humor or a desire to get money. That is just the beginning and about as far as I got, but there are more free as well as commercial versions of similar or identical products and services. I don't know how well they do commercially but we may want to take a closer look at some of their business models and marketing strategies.
Briefly, the tool takes arguments and options that sends a text version of an URL (or standard in) passed to it and sends the results to standard out or to a file. It may be the quickest and easiest way to grab text from a page.
pdftohtml and pdftotext
I also discovered these open source tools. I've only tested pdftotext so far. It did a fair job of converting but left out most traces of formatting. It turned a 6.3MB pdf into 1.4MB text file. It managed to do that in relatively short amount of time. It means we don't have to leave out pdf documents in a survey. There are probably other cool tools to do other stuff. All of this is for future reference since we can obviously proceed for now without these capabilities but is good to note.
ForPractice Update
I posted to IRC but that can be a bit too ephemeral. I have set up a ftp/pop account at: http://forpractice.com/kbsig/ I don't have anything there yet except the old Psyche files.
I also dusted of this site I created some time ago that it seems you never used at: http://forpractice.com/brian/ Contact me for how/if you want the password.
We also still have this one I made a long time ago when our regular server was down. I'd forgotten about it but rediscovered it when setting up this other stuff. http://forpractice.com/sfvlug/ it also has a link to Psyche files in its own subdirectory.
Here is info on getting our own URL to point to our wiki. There are probably other ways than this to get the job done.
Python Mutable Error
I don't know if you recall but I mentioned a problem where Python gives an erroneous answer without complaining. Here is an example:
>>> s = [2, 4, 6, 1, 3, 5, 9, 8]
What my text says is, "If the sequence is a list, don't modify it in place in the body of the loop; if you do, Python may skip or repeat sequence items. Iterate over a copy of the sequence instead. This problem happens only for mutable sequences (that is, lists)."
As you can see the for loop through the list gave errors without complaining while in the for loop pass through the copy it worked just fine.
>>> for x in s:
... if x % 2 != 0: s.remove(x)
...
>>> print s
[2, 4, 6, 3, 9, 8]
>>>
>>> s = [2, 4, 6, 1, 3, 5, 9, 8]
>>> for x in s[:]:
... if x % 2 != 0: s.remove(x)
...
>>> print s
[2, 4, 6, 8]
Kurt, this is because you're shrinking the list length that you're iterating over w/out re-evaluating its length, I'm guessing that the index its using jumps a cell as the list length shrinks (which your output shows), whereas in s[:] you've sliced the list which copies it keeping in mind you're removing with s.remove(x) which will remove first occurance of x only. If you have redunant cells the copies will be left in the list till your modulus gets to them.
see here for examples and output -> http://www.nixwit.org/kurt.py.txt and http://www.nixwit.org/output.txt
PS. The <code></code> tags don't work as billed in this wiki. This might be a very large problem although I must admit I may just have to read up on them a bit.
Return to Main Page
Dust Bin is just a place to put old stuff I'm not quite ready to toss out yet.