Tuesday, October 25, 2005

Help with string handling....


Hi all
My requirement is as follows:
I need to read a text file, eliminate certain special characters(like !
, - = + ), and then convert it to lower case and then remove certain
stopwords(like and, a, an, by, the etc) which is there in another txt
file.
Then, i need to run it thru a stemmer(a program which converts words
like running to run, ie, converts them to roots words).
Then i need to create a term-by-document matrix, which would be a
matrix, where in M(i,j) will give the number of times the term j occurs
in the document i.

My situation as of now is as below:
I have read the file contents into a string variable, removed/replaced
the special characters with a space using the replace function, and
then converted the string completely to lower case, using the transform
function.

I would really appreciate .any help, thanks i advance.

Thanks,
The code is as below:
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <set>
#include <algorithm>
#include <cctype>

using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;

int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <<endl;

vector<string> files;
vector<string> punct;//Vector of strings to remove the
punctuationsfrom each files
cout<<"This is a sample program"<<endl;
punct.push_back(",");punct.push_back(":");punct.push_back(";");
punct.push_back("'");
punct.push_back("'");punct.push_back("=");punct.push_back("-");
punct.push_back(".");punct.push_back(",");punct.push_back(",");

for (int i=0;i<punct.size();i++)
{
cout<<punct.at(i)<<endl;
}

std::replace(file.begin(),file.end(),',','');
std::replace(file.begin(),file.end(),';',' ');
std::replace(file.begin(),file.end(),':','');
std::replace(file.begin(),file.end(),'-',' ');
std::replace(file.begin(),file.end(),'=','');
std::replace(file.begin(),file.end(),'+',' ');
std::replace(file.begin(),file.end(),')','');
std::replace(file.begin(),file.end(),'(',' ');
std::replace(file.begin(),file.end(),'&','');
std::replace(file.begin(),file.end(),'!',' ');
std::replace(file.begin(),file.end(),'.','');
std::replace(file.begin(),file.end(),'/',' ');

//Removing single and double quotes
std::replace(file.begin(),file.end(),'\'','');
std::replace(file.begin(),file.end(),'\"',' ');

std::transform(file.begin(),file.end(),file.begin(),tolower);

/*if((pos=file.find(remword,0))!=string::npos)
{
file.erase(pos,remword.length());
}
cout << "After removing 'the'" <<endl;
*/

0 Comments:

Post a Comment

<< Home