Waste less time on Facebook — follow Brilliant.
×

Python Help

Hi, I am doing a science project that requires me to perform analysis of very long strings of text. I have to compare two strings with each other and determine how many elements between them are different. For example, the strings \(18351294\) and \(19352994\) have \(3\) differences. The difference is that my strings are about \(250\) chars long, are made of letters, and I have to compare \(10\) of them against each other, so I can't compare them manually. I can do this with Excel, but it would take me a really long time. My friend told me about these things called "while loops" that I could use in Python, but I don't know anything about it. Obviously, I could learn it on sites like Codecademy or KhanAcademy, but I am \(\textit{really}\) pressed for time (I have to have the code written and run by next weekend). Can someone please post an example of a while loop that would be able to compare the chars of two strings of text and return the number of differences? Thank you very much!

Note: This is not me being lazy and trying to take advantage of you. I am going to learn Python at some point, but I am really busy and have very little time to work.

Note by Trevor B.
3 years, 9 months ago

No vote yet
1 vote

Comments

Sort by:

Top Newest

This is a classic problem in bioinformatics, comparing strings of DNA. As long as these strings are the same size, this is easy to do. The number of corresponding symbols that differ, by the way, is called the Hamming distance between the two strings. Check out this link. The website in general is great for practicing programming and bioinformatics skills.

Anyway, to the code. Since you don't know what while loops are, I'm going to assume that you are very novice when it comes to programming. While your problem could be solved with a while loop, I'm going to use a for loop, so that we can be sure that our process terminates. Here is the code that will make give you your desired answer.

count = 0

for i in range(0,len(string1)):
    if string1[i] != string2[i]: count += 1

print count

If you're more descriptive about your problem (i.e., tell me whether the strings are all the same size, or how you want to be able to compare all ten of them more easily), I'd be happy to write you another code. (And to those who actually code well, I know that this isn't the shortest or most efficient piece of code for this problem. However, I think that it is probably the most understandable to a beginner.) Bob Krueger · 3 years, 9 months ago

Log in to reply

@Bob Krueger It's funny you mention bioinformatics, because that is exactly my project. I'm comparing the amino acid sequences of a protein from ten different animals. The strings have the same length. I had originally intended to copy and paste the code for the \(55\) different comparisons to be made, but now that I think about it, there is probably a way to repeat it in Python.

I can sort of see how that program works. It puts \(i\) in a range of numbers from \(0\) to the length of the first string, and then tests if that position [\(i\)] is the same as in the second string. Then it prints the count, the number of times the first string's [\(i\)] is not the same as the second string's. (I think)

I am a novice in programming (except for LaTeX, which will do nothing except make my project look pretty); in fact, I only starting beginning to program in Python \(15\) minutes ago.

Thank you very much! Trevor B. · 3 years, 9 months ago

Log in to reply

@Trevor B. You're Welcome. What format do you currently have the information in? Is it in a text file? In what way is it positioned? Or is it easiest to copy the information in a list in the code? I could easy whip something out that would cycle through all the possibilities for you. It would just use two for loops, but I'm sure you wouldn't know how to do it.

Also, note that some complications could arise. When you compare them in this way, you are only looking for point mutations in the AA string. Deleted or included AA can completely change this picture, and the process above would be an inaccurate representation of its differences. If that is the case, the code becomes much more complex, but still doable. Bob Krueger · 3 years, 9 months ago

Log in to reply

@Bob Krueger Sorry it's been a while. I have the text in a Word document, copied off of a database. A little editing to the text enabled me to account for the additions and omissions in the text. I added dashes to the text and added loops to the code based off of your original post to count those. I'd put the code, but I don't know how to insert code into those grey boxes using LaTeX. Can you tell me what commands you used? Trevor B. · 3 years, 9 months ago

Log in to reply

@Trevor B. I'm glad you were able to figure it out. To post the code, just indent each line, including the empty ones, four spaces. I hope everything turns out well for your project. Bob Krueger · 3 years, 8 months ago

Log in to reply

@Bob Krueger Thank you very much for the help, Bob. Here is the code.

protein_1 = '1st prion protein here'
protein_2 = '2nd prion protein here'

count_1 = 0

for i in range(0,len(protein_1)):
    if protein_1[i] != protein_2[i]:
        if protein_1[i] == '-':
            count_1 = count_1 - 1
        elif protein_2[i] == '-':
            count_1 = count_1 - 1
        count_1 = count_1 + 1

count_2 = 0

for i in range(0,len(protein_1)):
    if protein_1[i] == '-':
        count_2 = count_2 + 1

count_3 = 0

for i in range(0,len(protein_1)):
    if protein_2[i] == '-':
        count_3 = count_3 + 1


if count_1 == 1:
    print str(count_1) + ' difference'
else:
    print str(count_1) + ' differences'

if count_2 == 1:
    print str(count_2) + ' addition'
else:
    print str(count_2) + ' additions'

if count_3 == 1:
    print str(count_3) + ' omission'
else:
    print str(count_3) + ' omissions'
Trevor B. · 3 years, 8 months ago

Log in to reply

@Trevor B. That's awesome. Although I bet you have already done this procedure to all the proteins, there is a way to cycle through all the pairs of AA sequences. The idea isn't tricky, but the syntax is relatively hard to figure out. If you'd like to know how to do that, feel free to ask. Bob Krueger · 3 years, 8 months ago

Log in to reply

@Bob Krueger I'm good. I actually performed this code this morning and I got the data I needed. I copied information into the first two variables from a Word file and was done with the code in \(15\) minutes (instead of the hours it would have taken me to do manually). Thanks for all of the help. Trevor B. · 3 years, 8 months ago

Log in to reply

×

Problem Loading...

Note Loading...

Set Loading...